PS06 - Helpful Hints#

DS256, Gettysburg College

Prof Eatai Roth

Due: Nov 3, 2025 5p

Your Name: Prof Roth

Collaborators: Eatai

By submitting this work, you attest that:

  • it reflects your own effort and understanding of the material

  • you did not ask for help from students outside the course, other than PLAs

  • you did not use a generative AI to produce any code used below.

1.

Looking again at the dataframe of election results. It’s not formatted in the most convenient way. Let’s improve the organization of this dataframe.

Goal:

  • One row for each county-and-year combo.

  • Within a row, there should be columns for each party candidate and the number of votes they received.

  • No duplicated or unnecessary data.

Columns should be:

  • year

  • state_po

  • county

  • county_fips

  • candidate_D

  • votes_D

  • candidate_R

  • votes_R

  • totalvotes

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
elec_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/us_political/countypres_2000-2024.csv')
elec_df.head()
year state state_po county_name county_fips office candidate party candidatevotes totalvotes version mode
0 2000 ALABAMA AL AUTAUGA 1001.0 US PRESIDENT AL GORE DEMOCRAT 4942 17208 20250821 TOTAL
1 2000 ALABAMA AL AUTAUGA 1001.0 US PRESIDENT GEORGE W. BUSH REPUBLICAN 11993 17208 20250821 TOTAL
2 2000 ALABAMA AL AUTAUGA 1001.0 US PRESIDENT OTHER OTHER 113 17208 20250821 TOTAL
3 2000 ALABAMA AL AUTAUGA 1001.0 US PRESIDENT RALPH NADER GREEN 160 17208 20250821 TOTAL
4 2000 ALABAMA AL BALDWIN 1003.0 US PRESIDENT AL GORE DEMOCRAT 13997 56480 20250821 TOTAL
elec_df.columns
Index(['year', 'state', 'state_po', 'county_name', 'county_fips', 'office',
       'candidate', 'party', 'candidatevotes', 'totalvotes', 'version',
       'mode'],
      dtype='object')
# I copied the list above and then I rearrange and drop the columns as I want them to be
columns_to_keep = ['year', 'state_po', 'county_name', 'county_fips',
       'candidate', 'party', 'candidatevotes', 'totalvotes',
       'mode']

elec_df = elec_df[columns_to_keep]

# Just for convenience, I'm renaming the column something shorter
elec_df.rename(columns = {'candidatevotes':'votes'}, inplace = True)
# Keep only the vote totals (in 2020 and 2024, some states provide a breakdown of how ballots were cast)
elec_df = elec_df.query('mode in ["TOTAL", "TOTAL VOTES"]')
D_df = elec_df.query('party == "DEMOCRAT"')
R_df = elec_df.query('party=="REPUBLICAN"')

DR_df = pd.merge(D_df, R_df, how = 'left', on = ['year', 'state_po', 'county_name', 'county_fips', 'totalvotes'], suffixes = ['_D', '_R'])
DR_df.head()
year state_po county_name county_fips candidate_D party_D votes_D totalvotes mode_D candidate_R party_R votes_R mode_R
0 2000 AL AUTAUGA 1001.0 AL GORE DEMOCRAT 4942 17208 TOTAL GEORGE W. BUSH REPUBLICAN 11993.0 TOTAL
1 2000 AL BALDWIN 1003.0 AL GORE DEMOCRAT 13997 56480 TOTAL GEORGE W. BUSH REPUBLICAN 40872.0 TOTAL
2 2000 AL BARBOUR 1005.0 AL GORE DEMOCRAT 5188 10395 TOTAL GEORGE W. BUSH REPUBLICAN 5096.0 TOTAL
3 2000 AL BIBB 1007.0 AL GORE DEMOCRAT 2710 7101 TOTAL GEORGE W. BUSH REPUBLICAN 4273.0 TOTAL
4 2000 AL BLOUNT 1009.0 AL GORE DEMOCRAT 4977 17973 TOTAL GEORGE W. BUSH REPUBLICAN 12667.0 TOTAL
columns_to_keep = ['year', 'state_po', 'county_name', 'county_fips',
                   'candidate_D', 'candidate_R', 
                   'votes_D', 'votes_R', 'totalvotes']

DR_df = DR_df[columns_to_keep]
DR_df
year state_po county_name county_fips candidate_D candidate_R votes_D votes_R totalvotes
0 2000 AL AUTAUGA 1001.0 AL GORE GEORGE W. BUSH 4942 11993.0 17208
1 2000 AL BALDWIN 1003.0 AL GORE GEORGE W. BUSH 13997 40872.0 56480
2 2000 AL BARBOUR 1005.0 AL GORE GEORGE W. BUSH 5188 5096.0 10395
3 2000 AL BIBB 1007.0 AL GORE GEORGE W. BUSH 2710 4273.0 7101
4 2000 AL BLOUNT 1009.0 AL GORE GEORGE W. BUSH 4977 12667.0 17973
... ... ... ... ... ... ... ... ... ...
20716 2024 VA WILLIAMSBURG CITY 51830.0 KAMALA D HARRIS DONALD J TRUMP 5613 2119.0 7851
20717 2024 VA WINCHESTER CITY 51840.0 KAMALA D HARRIS DONALD J TRUMP 6407 5252.0 11822
20718 2024 VA WISE 51195.0 KAMALA D HARRIS DONALD J TRUMP 3036 13655.0 16776
20719 2024 VA WYTHE 51197.0 KAMALA D HARRIS DONALD J TRUMP 3075 12267.0 15439
20720 2024 VA YORK 51199.0 KAMALA D HARRIS DONALD J TRUMP 18296 20722.0 39540

20721 rows × 9 columns

An aside about indexing and finding max or min#

A data frame has column headers and indices (the numbers down the left side).

Indices and Slices#

Let’s look at the indices before and after a query.

display(DR_df.head())

pa_df = DR_df.query('state_po == "PA"')
display(pa_df.head())
year state_po county_name county_fips candidate_D candidate_R votes_D votes_R totalvotes
0 2000 AL AUTAUGA 1001.0 AL GORE GEORGE W. BUSH 4942 11993.0 17208
1 2000 AL BALDWIN 1003.0 AL GORE GEORGE W. BUSH 13997 40872.0 56480
2 2000 AL BARBOUR 1005.0 AL GORE GEORGE W. BUSH 5188 5096.0 10395
3 2000 AL BIBB 1007.0 AL GORE GEORGE W. BUSH 2710 4273.0 7101
4 2000 AL BLOUNT 1009.0 AL GORE GEORGE W. BUSH 4977 12667.0 17973
year state_po county_name county_fips candidate_D candidate_R votes_D votes_R totalvotes
2257 2000 PA ADAMS 42001.0 AL GORE GEORGE W. BUSH 11682 20848.0 33433
2258 2000 PA ALLEGHENY 42003.0 AL GORE GEORGE W. BUSH 329963 235361.0 582478
2259 2000 PA ARMSTRONG 42005.0 AL GORE GEORGE W. BUSH 11127 15508.0 27423
2260 2000 PA BEAVER 42007.0 AL GORE GEORGE W. BUSH 38925 32491.0 73593
2261 2000 PA BEDFORD 42009.0 AL GORE GEORGE W. BUSH 5474 13598.0 19435

Notice that the original dataframe had indices in order and starting at 0.

After a query, some rows will be discarded, so the indices will no longer be continuous (you’ll have missing values).

You can reset the indices using .reset_index()

DR_df.reset_index(drop=True, inplace=True)
DR_df.head()
year state_po county_name county_fips candidate_D candidate_R votes_D votes_R totalvotes
0 2000 AL AUTAUGA 1001.0 AL GORE GEORGE W. BUSH 4942 11993.0 17208
1 2000 AL BALDWIN 1003.0 AL GORE GEORGE W. BUSH 13997 40872.0 56480
2 2000 AL BARBOUR 1005.0 AL GORE GEORGE W. BUSH 5188 5096.0 10395
3 2000 AL BIBB 1007.0 AL GORE GEORGE W. BUSH 2710 4273.0 7101
4 2000 AL BLOUNT 1009.0 AL GORE GEORGE W. BUSH 4977 12667.0 17973

.iloc vs loc

These are two ways to slice a dataframe.

  • loc uses the number index of the row and the column names. Other than that, you can do all the slicing operators you normally would.

  • iloc uses only numbers and slices based on the position of the row and column regardless of the number in the index or the name of the column. For example, row 0 will be the top row, regardless of the number in the index.

DR_df.loc[2:5, 'state_po':'votes_R'] # rows 2-5, columns state_po to votes_R
state_po county_name county_fips candidate_D candidate_R votes_D votes_R
2 AL BARBOUR 1005.0 AL GORE GEORGE W. BUSH 5188 5096.0
3 AL BIBB 1007.0 AL GORE GEORGE W. BUSH 2710 4273.0
4 AL BLOUNT 1009.0 AL GORE GEORGE W. BUSH 4977 12667.0
5 AL BULLOCK 1011.0 AL GORE GEORGE W. BUSH 3395 1433.0
DR_df.iloc[2:5, 1:8]
state_po county_name county_fips candidate_D candidate_R votes_D votes_R
2 AL BARBOUR 1005.0 AL GORE GEORGE W. BUSH 5188 5096.0
3 AL BIBB 1007.0 AL GORE GEORGE W. BUSH 2710 4273.0
4 AL BLOUNT 1009.0 AL GORE GEORGE W. BUSH 4977 12667.0

Finding max and min#

max() returns the maximum value of the specified column idxmax() returns the location of the maximum value

max_value = DR_df['votes_D'].max()
max_idx = DR_df['votes_D'].idxmax()

# use loc with max_idx to find the row that contains the max value
display(DR_df.loc[max_idx])

# or to get specific values from that row
DR_df.loc[max_idx, ['county_fips']]
year                        2020
state_po                      CA
county_name          LOS ANGELES
county_fips               6037.0
candidate_D    JOSEPH R BIDEN JR
candidate_R       DONALD J TRUMP
votes_D                  3028885
votes_R                1145530.0
totalvotes               4264365
Name: 15915, dtype: object
county_fips    6037.0
Name: 15915, dtype: object

2. The remainder of the assignment uses the dataframe you re-organized above in conjunction with two other dataframes:

  • ev_df - the Washington state electric vehicle dataframe from last class

  • agesex_df - a dataframe of age and sex demographics for all US counties.

ev_url = 'https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/Electric_Vehicle_WA.csv'
ev_df = pd.read_csv(ev_url)

agesex_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/us_political/stats_america/Population-by-Age-and-Sex/Population%20by%20Age%20and%20Sex%20-%20US%2C%20States%2C%20Counties.csv')

display(ev_df.head())
display(agesex_df.head())
VIN (1-10) County City State Postal Code Model Year Make Model Electric Vehicle Type Clean Alternative Fuel Vehicle (CAFV) Eligibility Electric Range Base MSRP Legislative District DOL Vehicle ID Vehicle Location Electric Utility 2020 Census Tract
0 WA1E2AFY8R Thurston Olympia WA 98512.0 2024 AUDI Q5 E Plug-in Hybrid Electric Vehicle (PHEV) Not eligible due to low battery range 23.0 0 22.0 263239938 POINT (-122.90787 46.9461) PUGET SOUND ENERGY INC 5.306701e+10
1 WAUUPBFF4J Yakima Wapato WA 98951.0 2018 AUDI A3 Plug-in Hybrid Electric Vehicle (PHEV) Not eligible due to low battery range 16.0 0 15.0 318160860 POINT (-120.42083 46.44779) PACIFICORP 5.307794e+10
2 1N4AZ0CP0F King Seattle WA 98125.0 2015 NISSAN LEAF Battery Electric Vehicle (BEV) Clean Alternative Fuel Vehicle Eligible 84.0 0 46.0 184963586 POINT (-122.30253 47.72656) CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA) 5.303300e+10
3 WA1VAAGE5K King Kent WA 98031.0 2019 AUDI E-TRON Battery Electric Vehicle (BEV) Clean Alternative Fuel Vehicle Eligible 204.0 0 11.0 259426821 POINT (-122.17743 47.41185) PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA) 5.303303e+10
4 7SAXCAE57N Snohomish Bothell WA 98021.0 2022 TESLA MODEL X Battery Electric Vehicle (BEV) Eligibility unknown as battery range has not b... 0.0 0 1.0 208182236 POINT (-122.18384 47.8031) PUGET SOUND ENERGY INC 5.306105e+10
IBRC_Geo_ID Statefips Countyfips Description Year Total Population Population 0-4 Population 5-17 Population 18-24 Population 25-44 Population 45-64 Population 65+ Population Under 18 Population 18-54 Population 55+ Male Population Female Population
0 0 0 0 U.S. 2000 282162411 19178293 53197896 27315274 84973340 62428040 35069568 72376189 150287588 59498634.0 138443407 143719004
1 0 0 0 U.S. 2001 284968955 19298217 53372958 27992652 84523274 64491563 35290291 72671175 151902194 60395586.0 139891492 145077463
2 0 0 0 U.S. 2002 287625193 19429192 53507265 28480708 83990295 66695526 35522207 72936457 152463197 62225539.0 141230559 146394634
3 0 0 0 U.S. 2003 290107933 19592446 53508312 28916746 83398001 68828899 35863529 73100758 153134701 63872474.0 142428897 147679036
4 0 0 0 U.S. 2004 292805298 19785885 53511850 29302179 83066831 70935234 36203319 73297735 153998940 65508623.0 143828012 148977286

2

  • a. What is the fips of the PA county with the greatest republican lean in 2024?

  • b. In the US in 2010, what is the fips (same as IBRC_Geo_ID) of the county that had the highest percentage of individuals aged 65+?

  • c. In Washington state 2024, how many electric vehicles were in the county with the highest percentage of people in the age range 18-44?

  • d. Rounded to the nearest percentage, what was the lean of that county (as a positive number)?

  • e. Find the most popular make of electric car in the Washington state county that has the greatest democratic lean. How many vehicles of that make are registered in that state?

  • f. Create three scatter plots, for each of 2016, 2020, 2024 in which each data point represents a county, the x-axis is percentage of people age 18-24 and the y-axis is political leaning of the county.

HINTS:#

a. How to create the lean column. Let’s define Lean = (R-D)/(R+D).

DR_df['lean'] = (DR_df['votes_R'] - DR_df['votes_D'])/(DR_df['votes_R'] + DR_df['votes_D'])
DR_df.head()
year state_po county_name county_fips candidate_D candidate_R votes_D votes_R totalvotes lean
0 2000 AL AUTAUGA 1001.0 AL GORE GEORGE W. BUSH 4942 11993.0 17208 0.416357
1 2000 AL BALDWIN 1003.0 AL GORE GEORGE W. BUSH 13997 40872.0 56480 0.489803
2 2000 AL BARBOUR 1005.0 AL GORE GEORGE W. BUSH 5188 5096.0 10395 -0.008946
3 2000 AL BIBB 1007.0 AL GORE GEORGE W. BUSH 2710 4273.0 7101 0.223829
4 2000 AL BLOUNT 1009.0 AL GORE GEORGE W. BUSH 4977 12667.0 17973 0.435842

c. This problem is much harder than I intended because the ev_df doesn’t have a fips column. Here’s the solution, you’ll need it for parts d and e.

Steps:

  • create a new fips_df that has the fips for every county.

    • fips code is a 5-digit code:

      • first two numbers represent the state

      • the next three represent the county

      • if a fips is 4 digits, the first digit is implied 0 (e.g. 4001 —> state_fips = 04, county_fips = 001)

    • IBRC_Geo_ID is the same as fips

  • create a column in agesex_df for the percentage of the population between 18 and 44.

  • find the Washington county with the highest value in that column

fips_df = elec_df[['state_po', 'county_name', 'county_fips']].drop_duplicates()
fips_df['state_fips'] = round(fips_df['county_fips']/1000)

# Get WA fips
wa_fips = fips_df.query('state_po=="WA"')['state_fips'].iloc[0]
wa_fips
np.float64(53.0)
# Make a column for the percent of the population age 18-44
agesex_df['perc_18_44'] = (agesex_df['Population 18-24'] + agesex_df['Population 25-44'])/ \
    agesex_df['Total Population']

# rename 'IBRC_Geo_ID' to 'county_fips' for more consistency across the dfs    
agesex_df.rename(columns = {'IBRC_Geo_ID':'county_fips'}, inplace=True)
# Filter agesex_df to only Washington state and 2024.
wa_1844 = agesex_df.query('Statefips==@wa_fips and Year == 2024')[['county_fips', 'perc_18_44']].reset_index()

# Find the maximum percentage and then the corresponding fips
max_idx = wa_1844['perc_18_44'].idxmax()
fips = wa_1844.loc[max_idx, 'county_fips']
# Find the name of the corresponding county
wa_county = \
    fips_df.query('county_fips==@fips')\
    .reset_index(drop=True)\
    .loc[0, 'county_name']

wa_county
'WHITMAN'
# Above, WHITMAN is all caps. In the ev_df, the counties are written in title-case (Whitman)
whitman_county = ev_df.query('County.str.lower() == @wa_county.lower()')

whitman_county['VIN (1-10)'].count()
np.int64(536)