PS06 - Helpful Hints

PS06 - Helpful Hints#

DS256, Gettysburg College

Prof Eatai Roth

Due: Nov 3, 2025 5p

Your Name: Prof Roth

Collaborators: Eatai

By submitting this work, you attest that:

it reflects your own effort and understanding of the material
you did not ask for help from students outside the course, other than PLAs
you did not use a generative AI to produce any code used below.

1.

Looking again at the dataframe of election results. It’s not formatted in the most convenient way. Let’s improve the organization of this dataframe.

Goal:

One row for each county-and-year combo.
Within a row, there should be columns for each party candidate and the number of votes they received.
No duplicated or unnecessary data.

Columns should be:

year
state_po
county
county_fips
candidate_D
votes_D
candidate_R
votes_R
totalvotes

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

elec_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/us_political/countypres_2000-2024.csv')
elec_df.head()

	year	state	state_po	county_name	county_fips	office	candidate	party	candidatevotes	totalvotes	version	mode
0	2000	ALABAMA	AL	AUTAUGA	1001.0	US PRESIDENT	AL GORE	DEMOCRAT	4942	17208	20250821	TOTAL
1	2000	ALABAMA	AL	AUTAUGA	1001.0	US PRESIDENT	GEORGE W. BUSH	REPUBLICAN	11993	17208	20250821	TOTAL
2	2000	ALABAMA	AL	AUTAUGA	1001.0	US PRESIDENT	OTHER	OTHER	113	17208	20250821	TOTAL
3	2000	ALABAMA	AL	AUTAUGA	1001.0	US PRESIDENT	RALPH NADER	GREEN	160	17208	20250821	TOTAL
4	2000	ALABAMA	AL	BALDWIN	1003.0	US PRESIDENT	AL GORE	DEMOCRAT	13997	56480	20250821	TOTAL

elec_df.columns

Index(['year', 'state', 'state_po', 'county_name', 'county_fips', 'office',
       'candidate', 'party', 'candidatevotes', 'totalvotes', 'version',
       'mode'],
      dtype='object')

# I copied the list above and then I rearrange and drop the columns as I want them to be
columns_to_keep = ['year', 'state_po', 'county_name', 'county_fips',
       'candidate', 'party', 'candidatevotes', 'totalvotes',
       'mode']

elec_df = elec_df[columns_to_keep]

# Just for convenience, I'm renaming the column something shorter
elec_df.rename(columns = {'candidatevotes':'votes'}, inplace = True)

# Keep only the vote totals (in 2020 and 2024, some states provide a breakdown of how ballots were cast)
elec_df = elec_df.query('mode in ["TOTAL", "TOTAL VOTES"]')

D_df = elec_df.query('party == "DEMOCRAT"')
R_df = elec_df.query('party=="REPUBLICAN"')

DR_df = pd.merge(D_df, R_df, how = 'left', on = ['year', 'state_po', 'county_name', 'county_fips', 'totalvotes'], suffixes = ['_D', '_R'])
DR_df.head()

	year	state_po	county_name	county_fips	candidate_D	party_D	votes_D	totalvotes	mode_D	candidate_R	party_R	votes_R	mode_R
0	2000	AL	AUTAUGA	1001.0	AL GORE	DEMOCRAT	4942	17208	TOTAL	GEORGE W. BUSH	REPUBLICAN	11993.0	TOTAL
1	2000	AL	BALDWIN	1003.0	AL GORE	DEMOCRAT	13997	56480	TOTAL	GEORGE W. BUSH	REPUBLICAN	40872.0	TOTAL
2	2000	AL	BARBOUR	1005.0	AL GORE	DEMOCRAT	5188	10395	TOTAL	GEORGE W. BUSH	REPUBLICAN	5096.0	TOTAL
3	2000	AL	BIBB	1007.0	AL GORE	DEMOCRAT	2710	7101	TOTAL	GEORGE W. BUSH	REPUBLICAN	4273.0	TOTAL
4	2000	AL	BLOUNT	1009.0	AL GORE	DEMOCRAT	4977	17973	TOTAL	GEORGE W. BUSH	REPUBLICAN	12667.0	TOTAL

columns_to_keep = ['year', 'state_po', 'county_name', 'county_fips',
                   'candidate_D', 'candidate_R', 
                   'votes_D', 'votes_R', 'totalvotes']

DR_df = DR_df[columns_to_keep]
DR_df

	year	state_po	county_name	county_fips	candidate_D	candidate_R	votes_D	votes_R	totalvotes
0	2000	AL	AUTAUGA	1001.0	AL GORE	GEORGE W. BUSH	4942	11993.0	17208
1	2000	AL	BALDWIN	1003.0	AL GORE	GEORGE W. BUSH	13997	40872.0	56480
2	2000	AL	BARBOUR	1005.0	AL GORE	GEORGE W. BUSH	5188	5096.0	10395
3	2000	AL	BIBB	1007.0	AL GORE	GEORGE W. BUSH	2710	4273.0	7101
4	2000	AL	BLOUNT	1009.0	AL GORE	GEORGE W. BUSH	4977	12667.0	17973
...	...	...	...	...	...	...	...	...	...
20716	2024	VA	WILLIAMSBURG CITY	51830.0	KAMALA D HARRIS	DONALD J TRUMP	5613	2119.0	7851
20717	2024	VA	WINCHESTER CITY	51840.0	KAMALA D HARRIS	DONALD J TRUMP	6407	5252.0	11822
20718	2024	VA	WISE	51195.0	KAMALA D HARRIS	DONALD J TRUMP	3036	13655.0	16776
20719	2024	VA	WYTHE	51197.0	KAMALA D HARRIS	DONALD J TRUMP	3075	12267.0	15439
20720	2024	VA	YORK	51199.0	KAMALA D HARRIS	DONALD J TRUMP	18296	20722.0	39540

20721 rows × 9 columns

An aside about indexing and finding max or min#

A data frame has column headers and indices (the numbers down the left side).

Indices and Slices#

Let’s look at the indices before and after a query.

display(DR_df.head())

pa_df = DR_df.query('state_po == "PA"')
display(pa_df.head())

	year	state_po	county_name	county_fips	candidate_D	candidate_R	votes_D	votes_R	totalvotes
0	2000	AL	AUTAUGA	1001.0	AL GORE	GEORGE W. BUSH	4942	11993.0	17208
1	2000	AL	BALDWIN	1003.0	AL GORE	GEORGE W. BUSH	13997	40872.0	56480
2	2000	AL	BARBOUR	1005.0	AL GORE	GEORGE W. BUSH	5188	5096.0	10395
3	2000	AL	BIBB	1007.0	AL GORE	GEORGE W. BUSH	2710	4273.0	7101
4	2000	AL	BLOUNT	1009.0	AL GORE	GEORGE W. BUSH	4977	12667.0	17973

	year	state_po	county_name	county_fips	candidate_D	candidate_R	votes_D	votes_R	totalvotes
2257	2000	PA	ADAMS	42001.0	AL GORE	GEORGE W. BUSH	11682	20848.0	33433
2258	2000	PA	ALLEGHENY	42003.0	AL GORE	GEORGE W. BUSH	329963	235361.0	582478
2259	2000	PA	ARMSTRONG	42005.0	AL GORE	GEORGE W. BUSH	11127	15508.0	27423
2260	2000	PA	BEAVER	42007.0	AL GORE	GEORGE W. BUSH	38925	32491.0	73593
2261	2000	PA	BEDFORD	42009.0	AL GORE	GEORGE W. BUSH	5474	13598.0	19435

Notice that the original dataframe had indices in order and starting at 0.

After a query, some rows will be discarded, so the indices will no longer be continuous (you’ll have missing values).

You can reset the indices using .reset_index()

DR_df.reset_index(drop=True, inplace=True)
DR_df.head()

	year	state_po	county_name	county_fips	candidate_D	candidate_R	votes_D	votes_R	totalvotes
0	2000	AL	AUTAUGA	1001.0	AL GORE	GEORGE W. BUSH	4942	11993.0	17208
1	2000	AL	BALDWIN	1003.0	AL GORE	GEORGE W. BUSH	13997	40872.0	56480
2	2000	AL	BARBOUR	1005.0	AL GORE	GEORGE W. BUSH	5188	5096.0	10395
3	2000	AL	BIBB	1007.0	AL GORE	GEORGE W. BUSH	2710	4273.0	7101
4	2000	AL	BLOUNT	1009.0	AL GORE	GEORGE W. BUSH	4977	12667.0	17973

.iloc vs loc

These are two ways to slice a dataframe.

loc uses the number index of the row and the column names. Other than that, you can do all the slicing operators you normally would.
iloc uses only numbers and slices based on the position of the row and column regardless of the number in the index or the name of the column. For example, row 0 will be the top row, regardless of the number in the index.

DR_df.loc[2:5, 'state_po':'votes_R'] # rows 2-5, columns state_po to votes_R

	state_po	county_name	county_fips	candidate_D	candidate_R	votes_D	votes_R
2	AL	BARBOUR	1005.0	AL GORE	GEORGE W. BUSH	5188	5096.0
3	AL	BIBB	1007.0	AL GORE	GEORGE W. BUSH	2710	4273.0
4	AL	BLOUNT	1009.0	AL GORE	GEORGE W. BUSH	4977	12667.0
5	AL	BULLOCK	1011.0	AL GORE	GEORGE W. BUSH	3395	1433.0

DR_df.iloc[2:5, 1:8]

	state_po	county_name	county_fips	candidate_D	candidate_R	votes_D	votes_R
2	AL	BARBOUR	1005.0	AL GORE	GEORGE W. BUSH	5188	5096.0
3	AL	BIBB	1007.0	AL GORE	GEORGE W. BUSH	2710	4273.0
4	AL	BLOUNT	1009.0	AL GORE	GEORGE W. BUSH	4977	12667.0

Finding max and min#

max() returns the maximum value of the specified column idxmax() returns the location of the maximum value

max_value = DR_df['votes_D'].max()
max_idx = DR_df['votes_D'].idxmax()

# use loc with max_idx to find the row that contains the max value
display(DR_df.loc[max_idx])

# or to get specific values from that row
DR_df.loc[max_idx, ['county_fips']]

year                        2020
state_po                      CA
county_name          LOS ANGELES
county_fips               6037.0
candidate_D    JOSEPH R BIDEN JR
candidate_R       DONALD J TRUMP
votes_D                  3028885
votes_R                1145530.0
totalvotes               4264365
Name: 15915, dtype: object

county_fips    6037.0
Name: 15915, dtype: object

2. The remainder of the assignment uses the dataframe you re-organized above in conjunction with two other dataframes:

ev_df - the Washington state electric vehicle dataframe from last class
agesex_df - a dataframe of age and sex demographics for all US counties.

ev_url = 'https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/Electric_Vehicle_WA.csv'
ev_df = pd.read_csv(ev_url)

agesex_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/us_political/stats_america/Population-by-Age-and-Sex/Population%20by%20Age%20and%20Sex%20-%20US%2C%20States%2C%20Counties.csv')

display(ev_df.head())
display(agesex_df.head())

	VIN (1-10)	County	City	State	Postal Code	Model Year	Make	Model	Electric Vehicle Type	Clean Alternative Fuel Vehicle (CAFV) Eligibility	Electric Range	Legislative District	DOL Vehicle ID	Vehicle Location	Electric Utility	2020 Census Tract
0	WA1E2AFY8R	Thurston	Olympia	WA	98512.0	2024	AUDI	Q5 E	Plug-in Hybrid Electric Vehicle (PHEV)	Not eligible due to low battery range	23.0	22.0	263239938	POINT (-122.90787 46.9461)	PUGET SOUND ENERGY INC	5.306701e+10
1	WAUUPBFF4J	Yakima	Wapato	WA	98951.0	2018	AUDI	A3	Plug-in Hybrid Electric Vehicle (PHEV)	Not eligible due to low battery range	16.0	15.0	318160860	POINT (-120.42083 46.44779)	PACIFICORP	5.307794e+10
2	1N4AZ0CP0F	King	Seattle	WA	98125.0	2015	NISSAN	LEAF	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	84.0	46.0	184963586	POINT (-122.30253 47.72656)	CITY OF SEATTLE - (WA)\|CITY OF TACOMA - (WA)	5.303300e+10
3	WA1VAAGE5K	King	Kent	WA	98031.0	2019	AUDI	E-TRON	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	204.0	11.0	259426821	POINT (-122.17743 47.41185)	PUGET SOUND ENERGY INC\|\|CITY OF TACOMA - (WA)	5.303303e+10
4	7SAXCAE57N	Snohomish	Bothell	WA	98021.0	2022	TESLA	MODEL X	Battery Electric Vehicle (BEV)	Eligibility unknown as battery range has not b...	0.0	1.0	208182236	POINT (-122.18384 47.8031)	PUGET SOUND ENERGY INC	5.306105e+10

	Description	Year	Total Population	Population 0-4	Population 5-17	Population 18-24	Population 25-44	Population 45-64	Population 65+	Population Under 18	Population 18-54	Population 55+	Male Population	Female Population
0	U.S.	2000	282162411	19178293	53197896	27315274	84973340	62428040	35069568	72376189	150287588	59498634.0	138443407	143719004
1	U.S.	2001	284968955	19298217	53372958	27992652	84523274	64491563	35290291	72671175	151902194	60395586.0	139891492	145077463
2	U.S.	2002	287625193	19429192	53507265	28480708	83990295	66695526	35522207	72936457	152463197	62225539.0	141230559	146394634
3	U.S.	2003	290107933	19592446	53508312	28916746	83398001	68828899	35863529	73100758	153134701	63872474.0	142428897	147679036
4	U.S.	2004	292805298	19785885	53511850	29302179	83066831	70935234	36203319	73297735	153998940	65508623.0	143828012	148977286

2

a. What is the fips of the PA county with the greatest republican lean in 2024?
b. In the US in 2010, what is the fips (same as IBRC_Geo_ID) of the county that had the highest percentage of individuals aged 65+?
c. In Washington state 2024, how many electric vehicles were in the county with the highest percentage of people in the age range 18-44?
d. Rounded to the nearest percentage, what was the lean of that county (as a positive number)?
e. Find the most popular make of electric car in the Washington state county that has the greatest democratic lean. How many vehicles of that make are registered in that state?
f. Create three scatter plots, for each of 2016, 2020, 2024 in which each data point represents a county, the x-axis is percentage of people age 18-24 and the y-axis is political leaning of the county.

HINTS:#

a. How to create the lean column. Let’s define Lean = (R-D)/(R+D).

DR_df['lean'] = (DR_df['votes_R'] - DR_df['votes_D'])/(DR_df['votes_R'] + DR_df['votes_D'])
DR_df.head()

	year	state_po	county_name	county_fips	candidate_D	candidate_R	votes_D	votes_R	totalvotes	lean
0	2000	AL	AUTAUGA	1001.0	AL GORE	GEORGE W. BUSH	4942	11993.0	17208	0.416357
1	2000	AL	BALDWIN	1003.0	AL GORE	GEORGE W. BUSH	13997	40872.0	56480	0.489803
2	2000	AL	BARBOUR	1005.0	AL GORE	GEORGE W. BUSH	5188	5096.0	10395	-0.008946
3	2000	AL	BIBB	1007.0	AL GORE	GEORGE W. BUSH	2710	4273.0	7101	0.223829
4	2000	AL	BLOUNT	1009.0	AL GORE	GEORGE W. BUSH	4977	12667.0	17973	0.435842

c. This problem is much harder than I intended because the ev_df doesn’t have a fips column. Here’s the solution, you’ll need it for parts d and e.

Steps:

create a new fips_df that has the fips for every county.
- fips code is a 5-digit code:
  - first two numbers represent the state
  - the next three represent the county
  - if a fips is 4 digits, the first digit is implied 0 (e.g. 4001 —> state_fips = 04, county_fips = 001)
- IBRC_Geo_ID is the same as fips
create a column in agesex_df for the percentage of the population between 18 and 44.
find the Washington county with the highest value in that column

fips_df = elec_df[['state_po', 'county_name', 'county_fips']].drop_duplicates()
fips_df['state_fips'] = round(fips_df['county_fips']/1000)

# Get WA fips
wa_fips = fips_df.query('state_po=="WA"')['state_fips'].iloc[0]
wa_fips

np.float64(53.0)

# Make a column for the percent of the population age 18-44
agesex_df['perc_18_44'] = (agesex_df['Population 18-24'] + agesex_df['Population 25-44'])/ \
    agesex_df['Total Population']

# rename 'IBRC_Geo_ID' to 'county_fips' for more consistency across the dfs    
agesex_df.rename(columns = {'IBRC_Geo_ID':'county_fips'}, inplace=True)

# Filter agesex_df to only Washington state and 2024.
wa_1844 = agesex_df.query('Statefips==@wa_fips and Year == 2024')[['county_fips', 'perc_18_44']].reset_index()

# Find the maximum percentage and then the corresponding fips
max_idx = wa_1844['perc_18_44'].idxmax()
fips = wa_1844.loc[max_idx, 'county_fips']

# Find the name of the corresponding county
wa_county = \
    fips_df.query('county_fips==@fips')\
    .reset_index(drop=True)\
    .loc[0, 'county_name']

wa_county

'WHITMAN'

# Above, WHITMAN is all caps. In the ev_df, the counties are written in title-case (Whitman)
whitman_county = ev_df.query('County.str.lower() == @wa_county.lower()')

whitman_county['VIN (1-10)'].count()

np.int64(536)