Pandas - Merging DataFrames

5. Pandas - Merging DataFrames#

Often, we may want to pool and combine data from different sources.

The Pandas function merge() allows you to combine data horizontally (adding columns) from different DataFrames based on common values in specified columns.

If you are familiar with SQL, this is a JOIN.

pandas_merge

Let’s go through the arguments of this function:

left and right are used to denote the two dataframes being merged.
on, left_on, right_on are the key columns (can be one or more) by which rows are matched.
- If the two dataframes share column names, use on. The on column will only be represented once in the merged dataframe.
- If not, they can still be matched using left_on and right_on. The key columns will be preserved as separate in the merged dataframe.
how describes the type of merge, which rows are kept and which rows are discarded depending on keys that do and don’t have corresponding matches.
- ‘left’ keep all the rows from the left dataframe. For a row with no corresponding data from the right dataframe, there will be missing data in the right half.
- ‘right’ keep all the rows from the right dataframe. For a row with no corresponding data from the left dataframe, there will be missing data in the left half.
- ‘inner’ keep only data where there are corresponding keys. There will be no missing row data, but this method throws away the most data.
- ‘outer’ keep all rows from both dataframes. This method doesn’t discard any data, but you may have missing data in both the left and right halves.
- ‘cross’ doesn’t look for matches between left and right keys; in fact, it won’t accept keys as an argument. Instead it makes all the possible combinations of rows from the left and right dataframes.
suffixes are used to disambiguate columns that share the same name. After the merge, the suffixes indicate which columns came from the left and right dataframes.

5.1. More about ‘how’#

read more about the types of merges

merge_types

import pandas as pd

L_dict = {
    'k1':[1, 2, 3, 4, 5],
    'l2':['a', 'b', 'c', 'd', 'e'],
    'twos':[12, 34, 45, 56, 67],
    'threes':[123, 234, 345, 456, 567]
}
R_dict = {
    'k1':[2, 4, 5, 6],
    'r2':['b', 'd', 'c', 'a'],
    'threes': [555, 666, 777, 888],
    'fours': [5555, 6666, 7777, 8888]
}

L_df = pd.DataFrame(L_dict)
R_df = pd.DataFrame(R_dict)

display(L_df)
display(R_df)

	k1	l2	twos	threes
0	1	a	12	123
1	2	b	34	234
2	3	c	45	345
3	4	d	56	456
4	5	e	67	567

	k1	r2	threes	fours
0	2	b	555	5555
1	4	d	666	6666
2	5	c	777	7777
3	6	a	888	8888

# Left and Right

left = pd.merge(L_df, R_df, how = 'left', 
                left_on = ['k1', 'l2'], right_on = ['k1', 'r2'], 
                suffixes = ['_L', '_R'])

# left = pd.merge(L_df, R_df, how = 'left', 
#                 left_on = ['k1', 'l2'], right_on = ['k1', 'r2'], 
#                 suffixes = ['_L', '_R'])

# display(left)

right = pd.merge(L_df, R_df, how = 'right', left_on = 'l2', right_on = 'r2', suffixes = ['_L', '_R'])

display(right)

	k1_L	l2	twos	threes_L	k1_R	r2	threes_R	fours
0	2	b	34	234	2	b	555	5555
1	4	d	56	456	4	d	666	6666
2	3	c	45	345	5	c	777	7777
3	1	a	12	123	6	a	888	8888

# Outer and inner

outer = pd.merge(L_df, R_df, how = 'outer', on = 'k1', suffixes = ['_L', '_R'])
inner = pd.merge(L_df, R_df, how = 'inner', on = 'k1', suffixes = ['_L', '_R'])

display(outer)
display(inner)

	k1	l2	twos	threes_L	r2	threes_R	fours
0	1	a	12.0	123.0	NaN	NaN	NaN
1	2	b	34.0	234.0	b	555.0	5555.0
2	3	c	45.0	345.0	NaN	NaN	NaN
3	4	d	56.0	456.0	d	666.0	6666.0
4	5	e	67.0	567.0	c	777.0	7777.0
5	6	NaN	NaN	NaN	a	888.0	8888.0

	k1	l2	twos	threes_L	r2	threes_R	fours
0	2	b	34	234	b	555	5555
1	4	d	56	456	d	666	6666
2	5	e	67	567	c	777	7777

cross = pd.merge(L_df, R_df, how = 'cross', suffixes = ['_L', '_R'])

display(cross)

	k1_L	l2	twos	threes_L	k1_R	r2	threes_R	fours
0	1	a	12	123	2	b	555	5555
1	1	a	12	123	4	d	666	6666
2	1	a	12	123	5	c	777	7777
3	1	a	12	123	6	a	888	8888
4	2	b	34	234	2	b	555	5555
5	2	b	34	234	4	d	666	6666
6	2	b	34	234	5	c	777	7777
7	2	b	34	234	6	a	888	8888
8	3	c	45	345	2	b	555	5555
9	3	c	45	345	4	d	666	6666
10	3	c	45	345	5	c	777	7777
11	3	c	45	345	6	a	888	8888
12	4	d	56	456	2	b	555	5555
13	4	d	56	456	4	d	666	6666
14	4	d	56	456	5	c	777	7777
15	4	d	56	456	6	a	888	8888
16	5	e	67	567	2	b	555	5555
17	5	e	67	567	4	d	666	6666
18	5	e	67	567	5	c	777	7777
19	5	e	67	567	6	a	888	8888

5.1.1. Example#

Looking again at the dataframe of election results. It’s not formatted in the most convenient way. Let’s improve the organization of this dataframe.

Goal:

One row for each county-and-year combo.
Within a row, there should be columns for each party candidate and the number of votes they received.
No duplicated or unnecessary data.

elec_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/us_political/countypres_2000-2024.csv')

elec_df.head()

	year	state	state_po	county_name	county_fips	office	candidate	party	candidatevotes	totalvotes	version	mode
0	2000	ALABAMA	AL	AUTAUGA	1001.0	US PRESIDENT	AL GORE	DEMOCRAT	4942	17208	20250821	TOTAL
1	2000	ALABAMA	AL	AUTAUGA	1001.0	US PRESIDENT	GEORGE W. BUSH	REPUBLICAN	11993	17208	20250821	TOTAL
2	2000	ALABAMA	AL	AUTAUGA	1001.0	US PRESIDENT	OTHER	OTHER	113	17208	20250821	TOTAL
3	2000	ALABAMA	AL	AUTAUGA	1001.0	US PRESIDENT	RALPH NADER	GREEN	160	17208	20250821	TOTAL
4	2000	ALABAMA	AL	BALDWIN	1003.0	US PRESIDENT	AL GORE	DEMOCRAT	13997	56480	20250821	TOTAL

elec_df['mode'].unique()
elec_df = elec_df.query('mode in ["TOTAL", "TOTAL VOTES"]')
elec_df = elec_df.drop(columns = ['office', 'version', 'mode'])
elec_df.head()

	year	state	state_po	county_name	county_fips	candidate	party	candidatevotes	totalvotes
0	2000	ALABAMA	AL	AUTAUGA	1001.0	AL GORE	DEMOCRAT	4942	17208
1	2000	ALABAMA	AL	AUTAUGA	1001.0	GEORGE W. BUSH	REPUBLICAN	11993	17208
2	2000	ALABAMA	AL	AUTAUGA	1001.0	OTHER	OTHER	113	17208
3	2000	ALABAMA	AL	AUTAUGA	1001.0	RALPH NADER	GREEN	160	17208
4	2000	ALABAMA	AL	BALDWIN	1003.0	AL GORE	DEMOCRAT	13997	56480

elec_df.rename(columns = {'candidatevotes':'votes'}, inplace = True)

R_df = elec_df.query('party == "REPUBLICAN"')
D_df = elec_df.query('party == "DEMOCRAT"')

display(R_df.head(2))
display(D_df.head(2))

	year	state	state_po	county_name	county_fips	candidate	party	votes	totalvotes
1	2000	ALABAMA	AL	AUTAUGA	1001.0	GEORGE W. BUSH	REPUBLICAN	11993	17208
5	2000	ALABAMA	AL	BALDWIN	1003.0	GEORGE W. BUSH	REPUBLICAN	40872	56480

	year	state	state_po	county_name	county_fips	candidate	party	votes	totalvotes
0	2000	ALABAMA	AL	AUTAUGA	1001.0	AL GORE	DEMOCRAT	4942	17208
4	2000	ALABAMA	AL	BALDWIN	1003.0	AL GORE	DEMOCRAT	13997	56480

# First run this then copy the list of column headers
# elec_df.columns

columns_to_keep = ['year', 'state_po', 'county_name', 'county_fips', 'candidate',
       'votes','party',  'totalvotes']

elec_df = elec_df[columns_to_keep]

elec_df.head()

	year	state_po	county_name	county_fips	candidate	votes	party	totalvotes
0	2000	AL	AUTAUGA	1001.0	AL GORE	4942	DEMOCRAT	17208
1	2000	AL	AUTAUGA	1001.0	GEORGE W. BUSH	11993	REPUBLICAN	17208
2	2000	AL	AUTAUGA	1001.0	OTHER	113	OTHER	17208
3	2000	AL	AUTAUGA	1001.0	RALPH NADER	160	GREEN	17208
4	2000	AL	BALDWIN	1003.0	AL GORE	13997	DEMOCRAT	56480

Pandas - Merging DataFrames

Contents

5. Pandas - Merging DataFrames#

5.1. More about ‘how’#

5.1.1. Example#