Pandas - Grouping and Aggregation

4. Pandas - Grouping and Aggregation#

import pandas as pd

Data source: https://data.wa.gov/Transportation/Electric-Vehicle-Population-Data/f6w7-q2d2/data_preview

url = 'https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/Electric_Vehicle_WA.csv'

ev_df = pd.read_csv(url)
ev_df.head()

	VIN (1-10)	County	City	State	Postal Code	Model Year	Make	Model	Electric Vehicle Type	Clean Alternative Fuel Vehicle (CAFV) Eligibility	Electric Range	Legislative District	DOL Vehicle ID	Vehicle Location	Electric Utility	2020 Census Tract
0	WA1E2AFY8R	Thurston	Olympia	WA	98512.0	2024	AUDI	Q5 E	Plug-in Hybrid Electric Vehicle (PHEV)	Not eligible due to low battery range	23.0	22.0	263239938	POINT (-122.90787 46.9461)	PUGET SOUND ENERGY INC	5.306701e+10
1	WAUUPBFF4J	Yakima	Wapato	WA	98951.0	2018	AUDI	A3	Plug-in Hybrid Electric Vehicle (PHEV)	Not eligible due to low battery range	16.0	15.0	318160860	POINT (-120.42083 46.44779)	PACIFICORP	5.307794e+10
2	1N4AZ0CP0F	King	Seattle	WA	98125.0	2015	NISSAN	LEAF	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	84.0	46.0	184963586	POINT (-122.30253 47.72656)	CITY OF SEATTLE - (WA)\|CITY OF TACOMA - (WA)	5.303300e+10
3	WA1VAAGE5K	King	Kent	WA	98031.0	2019	AUDI	E-TRON	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	204.0	11.0	259426821	POINT (-122.17743 47.41185)	PUGET SOUND ENERGY INC\|\|CITY OF TACOMA - (WA)	5.303303e+10
4	7SAXCAE57N	Snohomish	Bothell	WA	98021.0	2022	TESLA	MODEL X	Battery Electric Vehicle (BEV)	Eligibility unknown as battery range has not b...	0.0	1.0	208182236	POINT (-122.18384 47.8031)	PUGET SOUND ENERGY INC	5.306105e+10

ev_df.describe()

	Postal Code	Model Year	Electric Range	Legislative District	DOL Vehicle ID	2020 Census Tract
count	264619.000000	264628.000000	264624.000000	263969.000000	2.646280e+05	2.646190e+05
mean	98170.770500	2021.842556	41.713159	28.862897	2.422540e+08	5.297109e+10
std	2637.717847	3.044095	80.377977	14.883356	6.516028e+07	1.638317e+09
min	1469.000000	1999.000000	0.000000	1.000000	4.385000e+03	1.001020e+09
25%	98052.000000	2021.000000	0.000000	17.000000	2.174474e+08	5.303301e+10
50%	98133.000000	2023.000000	0.000000	32.000000	2.603598e+08	5.303303e+10
75%	98382.000000	2024.000000	34.000000	42.000000	2.758921e+08	5.305307e+10
max	99577.000000	2026.000000	337.000000	49.000000	4.791150e+08	6.601095e+10

4.1. Groupby and Aggregate#

Often in data science, we want to partition our data set into groups based on some feature and compare statistics of those groups.

In pandas, groupby is used to split the data into groups based on the values in one or more columns. After grouping, you can apply aggregation (agg) functions to summarize or transform each group.

groupby works by:

Splitting: The data is split into groups using the values of specified columns.
Applying: An aggregation function (like sum, mean, count, etc.) is applied to each group.
Combining: The results are combined into a new DataFrame or Series.

And we can group by multiple features as well.

Example

# Group by 'County' and calculate the average Electric Range for each county
ev_df.groupby(by = 'County')['Electric Range'].mean()

# Group by 'County' and count the number of cars by each manufacturer
ev_df.groupby(by = ['County', 'Make'])['VIN (1-10)'].count()

4.1.1. Common Aggregation Functions#

sum(): Total of values in each group
mean(): Average of values in each group
count(): Number of items in each group
max(), min(): Maximum or minimum value in each group

4.1.2. Multiple Aggregations#

You can apply multiple aggregation functions at once using .agg():

ev_df.groupby('County').agg({
    'Electric Range': ['mean', 'max', 'min'],
    'Model Year': 'count'
})

This produces a summary table with the specified statistics for each group.

4.1.3. Use Cases#

Summarizing data by categories (e.g., average sales per region)
Counting occurrences (e.g., number of vehicles per manufacturer)
Finding maximum/minimum values within groups

ev_df.groupby(by = ['County', 'Make'])['VIN (1-10)'].count()

County  Make      
Ada     TESLA          2
Adams   BMW            4
        CADILLAC       3
        CHEVROLET      7
        CHRYSLER       1
                      ..
Yakima  VOLKSWAGEN    31
        VOLVO         31
York    TESLA          2
        VOLKSWAGEN     1
Yuba    TESLA          1
Name: VIN (1-10), Length: 1543, dtype: int64

ev_df.groupby(by = 'County')['Electric Range'].describe()

	count	mean	std	min	25%	50%	75%	max
County
Ada	2.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0
Adams	109.0	52.504587	93.387348	0.0	0.0	0.0	39.0	322.0
Alameda	10.0	2.100000	6.640783	0.0	0.0	0.0	0.0	21.0
Albemarle	2.0	104.000000	147.078210	0.0	52.0	104.0	156.0	208.0
Alexandria	7.0	86.142857	138.338267	0.0	0.0	0.0	136.5	330.0
...	...	...	...	...	...	...	...	...
Whitman	536.0	36.694030	67.823408	0.0	0.0	0.0	39.0	322.0
Williamson	5.0	48.600000	96.331719	0.0	0.0	0.0	23.0	220.0
Yakima	1791.0	49.169179	86.251936	0.0	0.0	0.0	39.5	330.0
York	3.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0
Yuba	1.0	0.000000	NaN	0.0	0.0	0.0	0.0	0.0

239 rows × 8 columns

ev_df.groupby(by = 'County')['Electric Range'].mean()

County
Ada             0.000000
Adams          52.504587
Alameda         2.100000
Albemarle     104.000000
Alexandria     86.142857
                 ...    
Whitman        36.694030
Williamson     48.600000
Yakima         49.169179
York            0.000000
Yuba            0.000000
Name: Electric Range, Length: 239, dtype: float64

ev_df.groupby('County').agg({
    'Electric Range': ['mean', 'max', 'min'],
    'Model Year': 'count'
})

	Electric Range			Model Year
	mean	max	min	count
County
Ada	0.000000	0.0	0.0	2
Adams	52.504587	322.0	0.0	109
Alameda	2.100000	21.0	0.0	10
Albemarle	104.000000	208.0	0.0	2
Alexandria	86.142857	330.0	0.0	7
...	...	...	...	...
Whitman	36.694030	322.0	0.0	536
Williamson	48.600000	220.0	0.0	5
Yakima	49.169179	330.0	0.0	1791
York	0.000000	0.0	0.0	3
Yuba	0.000000	0.0	0.0	1

239 rows × 4 columns

4.2. Useful Pandas Functions for Sorting, Counting, and Finding Maxima/Minima in Data#

Sorting Values: Sort the DataFrame by a column (e.g., Electric Range):

ev_df.sort_values(by='Electric Range', ascending=False).head()

Counting Instances: Count the number of vehicles by manufacturer:

ev_df['Make'].value_counts()

Counting in Groups: Count the number of vehicles by county and model:

is this outdated ?

ev_df.groupby(['County', 'Model']).size()

Finding Max Value: Find the maximum value in a column:

ev_df['Electric Range'].max()

Finding the Data Row with the Max Value: Find the row with the maximum electric range:

ev_df.loc[ev_df['Electric Range'].idxmax()]

ev_df.sort_values(by='Electric Range', ascending=False).head(1)

	VIN (1-10)	County	City	State	Postal Code	Model Year	Make	Model	Electric Vehicle Type	Clean Alternative Fuel Vehicle (CAFV) Eligibility	Electric Range	Base MSRP	Legislative District	DOL Vehicle ID	Vehicle Location	Electric Utility	2020 Census Tract
109226	5YJSA1E48L	Pierce	Tacoma	WA	98422.0	2020	TESLA	MODEL S	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	337.0	0	27.0	128568478	POINT (-122.39471 47.28159)	BONNEVILLE POWER ADMINISTRATION\|\|CITY OF TACOM...	5.305394e+10

4.3. Crack the Code!#

a What is the maximum Electric Range of the vehicles in this dataset?
b How many electric vehicles are registered in Snohomish County?
c How many Tesla Model Ys are in King County?
d For the Model Year that has the most registered vehicles, how many vehicles are there?
e What is the highest average electric range for a postal code (there are six postal codes that are tied at first)?
f Of Chevrolet EVs, what is the count for the most popular model?
g Which Legislative District has the most EVs?
h What is the DOL Vehicle ID of the oldest car on this registry? There are two entries (likely a data entry mistake), pick the larger DOL Vehicle ID.

a = 11
b = 11
c = 11
d = 11
e = 11
f = 11
g = 11
h = 11

ans_list = [a, b, c, d, e, f, g, h]

code_a = '234202361 538145258415488147671 613621493620 538251 492671619262419197494605148094094094 653315 318671493490671197266 203494364'
code_b = '182482423319 309685 685309549490206325 475 538251266214541318 622605492619365209208 551150 271209377318375605156306424'

message_a = decode_message(code_a, ans_list)
message_b = decode_message(code_b, ans_list[::-1])

print(message_a)
print(message_b)

___ _______ ­µéǖ __ ____________ __ _______ ___
____ __ ______ # ______ _______ __ _________