PS06 - Helpful Hints#
DS256, Gettysburg College
Prof Eatai Roth
Due: Nov 3, 2025 5p
Your Name: Prof Roth
Collaborators: Eatai
By submitting this work, you attest that:
it reflects your own effort and understanding of the material
you did not ask for help from students outside the course, other than PLAs
you did not use a generative AI to produce any code used below.
1.
Looking again at the dataframe of election results. It’s not formatted in the most convenient way. Let’s improve the organization of this dataframe.
Goal:
One row for each county-and-year combo.
Within a row, there should be columns for each party candidate and the number of votes they received.
No duplicated or unnecessary data.
Columns should be:
year
state_po
county
county_fips
candidate_D
votes_D
candidate_R
votes_R
totalvotes
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
elec_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/us_political/countypres_2000-2024.csv')
elec_df.head()
| year | state | state_po | county_name | county_fips | office | candidate | party | candidatevotes | totalvotes | version | mode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2000 | ALABAMA | AL | AUTAUGA | 1001.0 | US PRESIDENT | AL GORE | DEMOCRAT | 4942 | 17208 | 20250821 | TOTAL |
| 1 | 2000 | ALABAMA | AL | AUTAUGA | 1001.0 | US PRESIDENT | GEORGE W. BUSH | REPUBLICAN | 11993 | 17208 | 20250821 | TOTAL |
| 2 | 2000 | ALABAMA | AL | AUTAUGA | 1001.0 | US PRESIDENT | OTHER | OTHER | 113 | 17208 | 20250821 | TOTAL |
| 3 | 2000 | ALABAMA | AL | AUTAUGA | 1001.0 | US PRESIDENT | RALPH NADER | GREEN | 160 | 17208 | 20250821 | TOTAL |
| 4 | 2000 | ALABAMA | AL | BALDWIN | 1003.0 | US PRESIDENT | AL GORE | DEMOCRAT | 13997 | 56480 | 20250821 | TOTAL |
elec_df.columns
Index(['year', 'state', 'state_po', 'county_name', 'county_fips', 'office',
'candidate', 'party', 'candidatevotes', 'totalvotes', 'version',
'mode'],
dtype='object')
# I copied the list above and then I rearrange and drop the columns as I want them to be
columns_to_keep = ['year', 'state_po', 'county_name', 'county_fips',
'candidate', 'party', 'candidatevotes', 'totalvotes',
'mode']
elec_df = elec_df[columns_to_keep]
# Just for convenience, I'm renaming the column something shorter
elec_df.rename(columns = {'candidatevotes':'votes'}, inplace = True)
# Keep only the vote totals (in 2020 and 2024, some states provide a breakdown of how ballots were cast)
elec_df = elec_df.query('mode in ["TOTAL", "TOTAL VOTES"]')
D_df = elec_df.query('party == "DEMOCRAT"')
R_df = elec_df.query('party=="REPUBLICAN"')
DR_df = pd.merge(D_df, R_df, how = 'left', on = ['year', 'state_po', 'county_name', 'county_fips', 'totalvotes'], suffixes = ['_D', '_R'])
DR_df.head()
| year | state_po | county_name | county_fips | candidate_D | party_D | votes_D | totalvotes | mode_D | candidate_R | party_R | votes_R | mode_R | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2000 | AL | AUTAUGA | 1001.0 | AL GORE | DEMOCRAT | 4942 | 17208 | TOTAL | GEORGE W. BUSH | REPUBLICAN | 11993.0 | TOTAL |
| 1 | 2000 | AL | BALDWIN | 1003.0 | AL GORE | DEMOCRAT | 13997 | 56480 | TOTAL | GEORGE W. BUSH | REPUBLICAN | 40872.0 | TOTAL |
| 2 | 2000 | AL | BARBOUR | 1005.0 | AL GORE | DEMOCRAT | 5188 | 10395 | TOTAL | GEORGE W. BUSH | REPUBLICAN | 5096.0 | TOTAL |
| 3 | 2000 | AL | BIBB | 1007.0 | AL GORE | DEMOCRAT | 2710 | 7101 | TOTAL | GEORGE W. BUSH | REPUBLICAN | 4273.0 | TOTAL |
| 4 | 2000 | AL | BLOUNT | 1009.0 | AL GORE | DEMOCRAT | 4977 | 17973 | TOTAL | GEORGE W. BUSH | REPUBLICAN | 12667.0 | TOTAL |
columns_to_keep = ['year', 'state_po', 'county_name', 'county_fips',
'candidate_D', 'candidate_R',
'votes_D', 'votes_R', 'totalvotes']
DR_df = DR_df[columns_to_keep]
DR_df
| year | state_po | county_name | county_fips | candidate_D | candidate_R | votes_D | votes_R | totalvotes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2000 | AL | AUTAUGA | 1001.0 | AL GORE | GEORGE W. BUSH | 4942 | 11993.0 | 17208 |
| 1 | 2000 | AL | BALDWIN | 1003.0 | AL GORE | GEORGE W. BUSH | 13997 | 40872.0 | 56480 |
| 2 | 2000 | AL | BARBOUR | 1005.0 | AL GORE | GEORGE W. BUSH | 5188 | 5096.0 | 10395 |
| 3 | 2000 | AL | BIBB | 1007.0 | AL GORE | GEORGE W. BUSH | 2710 | 4273.0 | 7101 |
| 4 | 2000 | AL | BLOUNT | 1009.0 | AL GORE | GEORGE W. BUSH | 4977 | 12667.0 | 17973 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 20716 | 2024 | VA | WILLIAMSBURG CITY | 51830.0 | KAMALA D HARRIS | DONALD J TRUMP | 5613 | 2119.0 | 7851 |
| 20717 | 2024 | VA | WINCHESTER CITY | 51840.0 | KAMALA D HARRIS | DONALD J TRUMP | 6407 | 5252.0 | 11822 |
| 20718 | 2024 | VA | WISE | 51195.0 | KAMALA D HARRIS | DONALD J TRUMP | 3036 | 13655.0 | 16776 |
| 20719 | 2024 | VA | WYTHE | 51197.0 | KAMALA D HARRIS | DONALD J TRUMP | 3075 | 12267.0 | 15439 |
| 20720 | 2024 | VA | YORK | 51199.0 | KAMALA D HARRIS | DONALD J TRUMP | 18296 | 20722.0 | 39540 |
20721 rows × 9 columns
An aside about indexing and finding max or min#
A data frame has column headers and indices (the numbers down the left side).
Indices and Slices#
Let’s look at the indices before and after a query.
display(DR_df.head())
pa_df = DR_df.query('state_po == "PA"')
display(pa_df.head())
| year | state_po | county_name | county_fips | candidate_D | candidate_R | votes_D | votes_R | totalvotes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2000 | AL | AUTAUGA | 1001.0 | AL GORE | GEORGE W. BUSH | 4942 | 11993.0 | 17208 |
| 1 | 2000 | AL | BALDWIN | 1003.0 | AL GORE | GEORGE W. BUSH | 13997 | 40872.0 | 56480 |
| 2 | 2000 | AL | BARBOUR | 1005.0 | AL GORE | GEORGE W. BUSH | 5188 | 5096.0 | 10395 |
| 3 | 2000 | AL | BIBB | 1007.0 | AL GORE | GEORGE W. BUSH | 2710 | 4273.0 | 7101 |
| 4 | 2000 | AL | BLOUNT | 1009.0 | AL GORE | GEORGE W. BUSH | 4977 | 12667.0 | 17973 |
| year | state_po | county_name | county_fips | candidate_D | candidate_R | votes_D | votes_R | totalvotes | |
|---|---|---|---|---|---|---|---|---|---|
| 2257 | 2000 | PA | ADAMS | 42001.0 | AL GORE | GEORGE W. BUSH | 11682 | 20848.0 | 33433 |
| 2258 | 2000 | PA | ALLEGHENY | 42003.0 | AL GORE | GEORGE W. BUSH | 329963 | 235361.0 | 582478 |
| 2259 | 2000 | PA | ARMSTRONG | 42005.0 | AL GORE | GEORGE W. BUSH | 11127 | 15508.0 | 27423 |
| 2260 | 2000 | PA | BEAVER | 42007.0 | AL GORE | GEORGE W. BUSH | 38925 | 32491.0 | 73593 |
| 2261 | 2000 | PA | BEDFORD | 42009.0 | AL GORE | GEORGE W. BUSH | 5474 | 13598.0 | 19435 |
Notice that the original dataframe had indices in order and starting at 0.
After a query, some rows will be discarded, so the indices will no longer be continuous (you’ll have missing values).
You can reset the indices using .reset_index()
DR_df.reset_index(drop=True, inplace=True)
DR_df.head()
| year | state_po | county_name | county_fips | candidate_D | candidate_R | votes_D | votes_R | totalvotes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2000 | AL | AUTAUGA | 1001.0 | AL GORE | GEORGE W. BUSH | 4942 | 11993.0 | 17208 |
| 1 | 2000 | AL | BALDWIN | 1003.0 | AL GORE | GEORGE W. BUSH | 13997 | 40872.0 | 56480 |
| 2 | 2000 | AL | BARBOUR | 1005.0 | AL GORE | GEORGE W. BUSH | 5188 | 5096.0 | 10395 |
| 3 | 2000 | AL | BIBB | 1007.0 | AL GORE | GEORGE W. BUSH | 2710 | 4273.0 | 7101 |
| 4 | 2000 | AL | BLOUNT | 1009.0 | AL GORE | GEORGE W. BUSH | 4977 | 12667.0 | 17973 |
.iloc vs loc
These are two ways to slice a dataframe.
locuses the number index of the row and the column names. Other than that, you can do all the slicing operators you normally would.ilocuses only numbers and slices based on the position of the row and column regardless of the number in the index or the name of the column. For example, row 0 will be the top row, regardless of the number in the index.
DR_df.loc[2:5, 'state_po':'votes_R'] # rows 2-5, columns state_po to votes_R
| state_po | county_name | county_fips | candidate_D | candidate_R | votes_D | votes_R | |
|---|---|---|---|---|---|---|---|
| 2 | AL | BARBOUR | 1005.0 | AL GORE | GEORGE W. BUSH | 5188 | 5096.0 |
| 3 | AL | BIBB | 1007.0 | AL GORE | GEORGE W. BUSH | 2710 | 4273.0 |
| 4 | AL | BLOUNT | 1009.0 | AL GORE | GEORGE W. BUSH | 4977 | 12667.0 |
| 5 | AL | BULLOCK | 1011.0 | AL GORE | GEORGE W. BUSH | 3395 | 1433.0 |
DR_df.iloc[2:5, 1:8]
| state_po | county_name | county_fips | candidate_D | candidate_R | votes_D | votes_R | |
|---|---|---|---|---|---|---|---|
| 2 | AL | BARBOUR | 1005.0 | AL GORE | GEORGE W. BUSH | 5188 | 5096.0 |
| 3 | AL | BIBB | 1007.0 | AL GORE | GEORGE W. BUSH | 2710 | 4273.0 |
| 4 | AL | BLOUNT | 1009.0 | AL GORE | GEORGE W. BUSH | 4977 | 12667.0 |
Finding max and min#
max() returns the maximum value of the specified column
idxmax() returns the location of the maximum value
max_value = DR_df['votes_D'].max()
max_idx = DR_df['votes_D'].idxmax()
# use loc with max_idx to find the row that contains the max value
display(DR_df.loc[max_idx])
# or to get specific values from that row
DR_df.loc[max_idx, ['county_fips']]
year 2020
state_po CA
county_name LOS ANGELES
county_fips 6037.0
candidate_D JOSEPH R BIDEN JR
candidate_R DONALD J TRUMP
votes_D 3028885
votes_R 1145530.0
totalvotes 4264365
Name: 15915, dtype: object
county_fips 6037.0
Name: 15915, dtype: object
2. The remainder of the assignment uses the dataframe you re-organized above in conjunction with two other dataframes:
ev_df- the Washington state electric vehicle dataframe from last classagesex_df- a dataframe of age and sex demographics for all US counties.
ev_url = 'https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/Electric_Vehicle_WA.csv'
ev_df = pd.read_csv(ev_url)
agesex_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/us_political/stats_america/Population-by-Age-and-Sex/Population%20by%20Age%20and%20Sex%20-%20US%2C%20States%2C%20Counties.csv')
display(ev_df.head())
display(agesex_df.head())
| VIN (1-10) | County | City | State | Postal Code | Model Year | Make | Model | Electric Vehicle Type | Clean Alternative Fuel Vehicle (CAFV) Eligibility | Electric Range | Base MSRP | Legislative District | DOL Vehicle ID | Vehicle Location | Electric Utility | 2020 Census Tract | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | WA1E2AFY8R | Thurston | Olympia | WA | 98512.0 | 2024 | AUDI | Q5 E | Plug-in Hybrid Electric Vehicle (PHEV) | Not eligible due to low battery range | 23.0 | 0 | 22.0 | 263239938 | POINT (-122.90787 46.9461) | PUGET SOUND ENERGY INC | 5.306701e+10 |
| 1 | WAUUPBFF4J | Yakima | Wapato | WA | 98951.0 | 2018 | AUDI | A3 | Plug-in Hybrid Electric Vehicle (PHEV) | Not eligible due to low battery range | 16.0 | 0 | 15.0 | 318160860 | POINT (-120.42083 46.44779) | PACIFICORP | 5.307794e+10 |
| 2 | 1N4AZ0CP0F | King | Seattle | WA | 98125.0 | 2015 | NISSAN | LEAF | Battery Electric Vehicle (BEV) | Clean Alternative Fuel Vehicle Eligible | 84.0 | 0 | 46.0 | 184963586 | POINT (-122.30253 47.72656) | CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA) | 5.303300e+10 |
| 3 | WA1VAAGE5K | King | Kent | WA | 98031.0 | 2019 | AUDI | E-TRON | Battery Electric Vehicle (BEV) | Clean Alternative Fuel Vehicle Eligible | 204.0 | 0 | 11.0 | 259426821 | POINT (-122.17743 47.41185) | PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA) | 5.303303e+10 |
| 4 | 7SAXCAE57N | Snohomish | Bothell | WA | 98021.0 | 2022 | TESLA | MODEL X | Battery Electric Vehicle (BEV) | Eligibility unknown as battery range has not b... | 0.0 | 0 | 1.0 | 208182236 | POINT (-122.18384 47.8031) | PUGET SOUND ENERGY INC | 5.306105e+10 |
| IBRC_Geo_ID | Statefips | Countyfips | Description | Year | Total Population | Population 0-4 | Population 5-17 | Population 18-24 | Population 25-44 | Population 45-64 | Population 65+ | Population Under 18 | Population 18-54 | Population 55+ | Male Population | Female Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | U.S. | 2000 | 282162411 | 19178293 | 53197896 | 27315274 | 84973340 | 62428040 | 35069568 | 72376189 | 150287588 | 59498634.0 | 138443407 | 143719004 |
| 1 | 0 | 0 | 0 | U.S. | 2001 | 284968955 | 19298217 | 53372958 | 27992652 | 84523274 | 64491563 | 35290291 | 72671175 | 151902194 | 60395586.0 | 139891492 | 145077463 |
| 2 | 0 | 0 | 0 | U.S. | 2002 | 287625193 | 19429192 | 53507265 | 28480708 | 83990295 | 66695526 | 35522207 | 72936457 | 152463197 | 62225539.0 | 141230559 | 146394634 |
| 3 | 0 | 0 | 0 | U.S. | 2003 | 290107933 | 19592446 | 53508312 | 28916746 | 83398001 | 68828899 | 35863529 | 73100758 | 153134701 | 63872474.0 | 142428897 | 147679036 |
| 4 | 0 | 0 | 0 | U.S. | 2004 | 292805298 | 19785885 | 53511850 | 29302179 | 83066831 | 70935234 | 36203319 | 73297735 | 153998940 | 65508623.0 | 143828012 | 148977286 |
2
a. What is the fips of the PA county with the greatest republican lean in 2024?
b. In the US in 2010, what is the fips (same as IBRC_Geo_ID) of the county that had the highest percentage of individuals aged 65+?
c. In Washington state 2024, how many electric vehicles were in the county with the highest percentage of people in the age range 18-44?
d. Rounded to the nearest percentage, what was the lean of that county (as a positive number)?
e. Find the most popular make of electric car in the Washington state county that has the greatest democratic lean. How many vehicles of that make are registered in that state?
f. Create three scatter plots, for each of 2016, 2020, 2024 in which each data point represents a county, the x-axis is percentage of people age 18-24 and the y-axis is political leaning of the county.
HINTS:#
a. How to create the lean column. Let’s define Lean = (R-D)/(R+D).
DR_df['lean'] = (DR_df['votes_R'] - DR_df['votes_D'])/(DR_df['votes_R'] + DR_df['votes_D'])
DR_df.head()
| year | state_po | county_name | county_fips | candidate_D | candidate_R | votes_D | votes_R | totalvotes | lean | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2000 | AL | AUTAUGA | 1001.0 | AL GORE | GEORGE W. BUSH | 4942 | 11993.0 | 17208 | 0.416357 |
| 1 | 2000 | AL | BALDWIN | 1003.0 | AL GORE | GEORGE W. BUSH | 13997 | 40872.0 | 56480 | 0.489803 |
| 2 | 2000 | AL | BARBOUR | 1005.0 | AL GORE | GEORGE W. BUSH | 5188 | 5096.0 | 10395 | -0.008946 |
| 3 | 2000 | AL | BIBB | 1007.0 | AL GORE | GEORGE W. BUSH | 2710 | 4273.0 | 7101 | 0.223829 |
| 4 | 2000 | AL | BLOUNT | 1009.0 | AL GORE | GEORGE W. BUSH | 4977 | 12667.0 | 17973 | 0.435842 |
c. This problem is much harder than I intended because the ev_df doesn’t have a fips column. Here’s the solution, you’ll need it for parts d and e.
Steps:
create a new fips_df that has the fips for every county.
fips code is a 5-digit code:
first two numbers represent the state
the next three represent the county
if a fips is 4 digits, the first digit is implied 0 (e.g. 4001 —> state_fips = 04, county_fips = 001)
IBRC_Geo_ID is the same as fips
create a column in agesex_df for the percentage of the population between 18 and 44.
find the Washington county with the highest value in that column
fips_df = elec_df[['state_po', 'county_name', 'county_fips']].drop_duplicates()
fips_df['state_fips'] = round(fips_df['county_fips']/1000)
# Get WA fips
wa_fips = fips_df.query('state_po=="WA"')['state_fips'].iloc[0]
wa_fips
np.float64(53.0)
# Make a column for the percent of the population age 18-44
agesex_df['perc_18_44'] = (agesex_df['Population 18-24'] + agesex_df['Population 25-44'])/ \
agesex_df['Total Population']
# rename 'IBRC_Geo_ID' to 'county_fips' for more consistency across the dfs
agesex_df.rename(columns = {'IBRC_Geo_ID':'county_fips'}, inplace=True)
# Filter agesex_df to only Washington state and 2024.
wa_1844 = agesex_df.query('Statefips==@wa_fips and Year == 2024')[['county_fips', 'perc_18_44']].reset_index()
# Find the maximum percentage and then the corresponding fips
max_idx = wa_1844['perc_18_44'].idxmax()
fips = wa_1844.loc[max_idx, 'county_fips']
# Find the name of the corresponding county
wa_county = \
fips_df.query('county_fips==@fips')\
.reset_index(drop=True)\
.loc[0, 'county_name']
wa_county
'WHITMAN'
# Above, WHITMAN is all caps. In the ev_df, the counties are written in title-case (Whitman)
whitman_county = ev_df.query('County.str.lower() == @wa_county.lower()')
whitman_county['VIN (1-10)'].count()
np.int64(536)