4. Pandas - Grouping and Aggregation#
import pandas as pd
Data source: https://data.wa.gov/Transportation/Electric-Vehicle-Population-Data/f6w7-q2d2/data_preview
url = 'https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/Electric_Vehicle_WA.csv'
ev_df = pd.read_csv(url)
ev_df.head()
| VIN (1-10) | County | City | State | Postal Code | Model Year | Make | Model | Electric Vehicle Type | Clean Alternative Fuel Vehicle (CAFV) Eligibility | Electric Range | Base MSRP | Legislative District | DOL Vehicle ID | Vehicle Location | Electric Utility | 2020 Census Tract | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | WA1E2AFY8R | Thurston | Olympia | WA | 98512.0 | 2024 | AUDI | Q5 E | Plug-in Hybrid Electric Vehicle (PHEV) | Not eligible due to low battery range | 23.0 | 0 | 22.0 | 263239938 | POINT (-122.90787 46.9461) | PUGET SOUND ENERGY INC | 5.306701e+10 |
| 1 | WAUUPBFF4J | Yakima | Wapato | WA | 98951.0 | 2018 | AUDI | A3 | Plug-in Hybrid Electric Vehicle (PHEV) | Not eligible due to low battery range | 16.0 | 0 | 15.0 | 318160860 | POINT (-120.42083 46.44779) | PACIFICORP | 5.307794e+10 |
| 2 | 1N4AZ0CP0F | King | Seattle | WA | 98125.0 | 2015 | NISSAN | LEAF | Battery Electric Vehicle (BEV) | Clean Alternative Fuel Vehicle Eligible | 84.0 | 0 | 46.0 | 184963586 | POINT (-122.30253 47.72656) | CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA) | 5.303300e+10 |
| 3 | WA1VAAGE5K | King | Kent | WA | 98031.0 | 2019 | AUDI | E-TRON | Battery Electric Vehicle (BEV) | Clean Alternative Fuel Vehicle Eligible | 204.0 | 0 | 11.0 | 259426821 | POINT (-122.17743 47.41185) | PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA) | 5.303303e+10 |
| 4 | 7SAXCAE57N | Snohomish | Bothell | WA | 98021.0 | 2022 | TESLA | MODEL X | Battery Electric Vehicle (BEV) | Eligibility unknown as battery range has not b... | 0.0 | 0 | 1.0 | 208182236 | POINT (-122.18384 47.8031) | PUGET SOUND ENERGY INC | 5.306105e+10 |
ev_df.describe()
| Postal Code | Model Year | Electric Range | Legislative District | DOL Vehicle ID | 2020 Census Tract | |
|---|---|---|---|---|---|---|
| count | 264619.000000 | 264628.000000 | 264624.000000 | 263969.000000 | 2.646280e+05 | 2.646190e+05 |
| mean | 98170.770500 | 2021.842556 | 41.713159 | 28.862897 | 2.422540e+08 | 5.297109e+10 |
| std | 2637.717847 | 3.044095 | 80.377977 | 14.883356 | 6.516028e+07 | 1.638317e+09 |
| min | 1469.000000 | 1999.000000 | 0.000000 | 1.000000 | 4.385000e+03 | 1.001020e+09 |
| 25% | 98052.000000 | 2021.000000 | 0.000000 | 17.000000 | 2.174474e+08 | 5.303301e+10 |
| 50% | 98133.000000 | 2023.000000 | 0.000000 | 32.000000 | 2.603598e+08 | 5.303303e+10 |
| 75% | 98382.000000 | 2024.000000 | 34.000000 | 42.000000 | 2.758921e+08 | 5.305307e+10 |
| max | 99577.000000 | 2026.000000 | 337.000000 | 49.000000 | 4.791150e+08 | 6.601095e+10 |
4.1. Groupby and Aggregate#
Often in data science, we want to partition our data set into groups based on some feature and compare statistics of those groups.
In pandas, groupby is used to split the data into groups based on the values in one or more columns. After grouping, you can apply aggregation (agg) functions to summarize or transform each group.
groupby works by:
Splitting: The data is split into groups using the values of specified columns.
Applying: An aggregation function (like
sum,mean,count, etc.) is applied to each group.Combining: The results are combined into a new DataFrame or Series.
And we can group by multiple features as well.
Example
# Group by 'County' and calculate the average Electric Range for each county
ev_df.groupby(by = 'County')['Electric Range'].mean()
# Group by 'County' and count the number of cars by each manufacturer
ev_df.groupby(by = ['County', 'Make'])['VIN (1-10)'].count()
4.1.1. Common Aggregation Functions#
sum(): Total of values in each groupmean(): Average of values in each groupcount(): Number of items in each groupmax(),min(): Maximum or minimum value in each group
4.1.2. Multiple Aggregations#
You can apply multiple aggregation functions at once using .agg():
ev_df.groupby('County').agg({
'Electric Range': ['mean', 'max', 'min'],
'Model Year': 'count'
})
This produces a summary table with the specified statistics for each group.
4.1.3. Use Cases#
Summarizing data by categories (e.g., average sales per region)
Counting occurrences (e.g., number of vehicles per manufacturer)
Finding maximum/minimum values within groups
ev_df.groupby(by = ['County', 'Make'])['VIN (1-10)'].count()
County Make
Ada TESLA 2
Adams BMW 4
CADILLAC 3
CHEVROLET 7
CHRYSLER 1
..
Yakima VOLKSWAGEN 31
VOLVO 31
York TESLA 2
VOLKSWAGEN 1
Yuba TESLA 1
Name: VIN (1-10), Length: 1543, dtype: int64
ev_df.groupby(by = 'County')['Electric Range'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| County | ||||||||
| Ada | 2.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Adams | 109.0 | 52.504587 | 93.387348 | 0.0 | 0.0 | 0.0 | 39.0 | 322.0 |
| Alameda | 10.0 | 2.100000 | 6.640783 | 0.0 | 0.0 | 0.0 | 0.0 | 21.0 |
| Albemarle | 2.0 | 104.000000 | 147.078210 | 0.0 | 52.0 | 104.0 | 156.0 | 208.0 |
| Alexandria | 7.0 | 86.142857 | 138.338267 | 0.0 | 0.0 | 0.0 | 136.5 | 330.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Whitman | 536.0 | 36.694030 | 67.823408 | 0.0 | 0.0 | 0.0 | 39.0 | 322.0 |
| Williamson | 5.0 | 48.600000 | 96.331719 | 0.0 | 0.0 | 0.0 | 23.0 | 220.0 |
| Yakima | 1791.0 | 49.169179 | 86.251936 | 0.0 | 0.0 | 0.0 | 39.5 | 330.0 |
| York | 3.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Yuba | 1.0 | 0.000000 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
239 rows × 8 columns
ev_df.groupby(by = 'County')['Electric Range'].mean()
County
Ada 0.000000
Adams 52.504587
Alameda 2.100000
Albemarle 104.000000
Alexandria 86.142857
...
Whitman 36.694030
Williamson 48.600000
Yakima 49.169179
York 0.000000
Yuba 0.000000
Name: Electric Range, Length: 239, dtype: float64
ev_df.groupby('County').agg({
'Electric Range': ['mean', 'max', 'min'],
'Model Year': 'count'
})
| Electric Range | Model Year | |||
|---|---|---|---|---|
| mean | max | min | count | |
| County | ||||
| Ada | 0.000000 | 0.0 | 0.0 | 2 |
| Adams | 52.504587 | 322.0 | 0.0 | 109 |
| Alameda | 2.100000 | 21.0 | 0.0 | 10 |
| Albemarle | 104.000000 | 208.0 | 0.0 | 2 |
| Alexandria | 86.142857 | 330.0 | 0.0 | 7 |
| ... | ... | ... | ... | ... |
| Whitman | 36.694030 | 322.0 | 0.0 | 536 |
| Williamson | 48.600000 | 220.0 | 0.0 | 5 |
| Yakima | 49.169179 | 330.0 | 0.0 | 1791 |
| York | 0.000000 | 0.0 | 0.0 | 3 |
| Yuba | 0.000000 | 0.0 | 0.0 | 1 |
239 rows × 4 columns
4.2. Useful Pandas Functions for Sorting, Counting, and Finding Maxima/Minima in Data#
Sorting Values: Sort the DataFrame by a column (e.g., Electric Range):
ev_df.sort_values(by='Electric Range', ascending=False).head()
Counting Instances: Count the number of vehicles by manufacturer:
ev_df['Make'].value_counts()
Counting in Groups: Count the number of vehicles by county and model:
is this outdated ?
ev_df.groupby(['County', 'Model']).size()
Finding Max Value: Find the maximum value in a column:
ev_df['Electric Range'].max()
Finding the Data Row with the Max Value: Find the row with the maximum electric range:
ev_df.loc[ev_df['Electric Range'].idxmax()]
ev_df.sort_values(by='Electric Range', ascending=False).head(1)
| VIN (1-10) | County | City | State | Postal Code | Model Year | Make | Model | Electric Vehicle Type | Clean Alternative Fuel Vehicle (CAFV) Eligibility | Electric Range | Base MSRP | Legislative District | DOL Vehicle ID | Vehicle Location | Electric Utility | 2020 Census Tract | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 109226 | 5YJSA1E48L | Pierce | Tacoma | WA | 98422.0 | 2020 | TESLA | MODEL S | Battery Electric Vehicle (BEV) | Clean Alternative Fuel Vehicle Eligible | 337.0 | 0 | 27.0 | 128568478 | POINT (-122.39471 47.28159) | BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM... | 5.305394e+10 |
4.3. Crack the Code!#
a What is the maximum Electric Range of the vehicles in this dataset?
b How many electric vehicles are registered in Snohomish County?
c How many Tesla Model Ys are in King County?
d For the Model Year that has the most registered vehicles, how many vehicles are there?
e What is the highest average electric range for a postal code (there are six postal codes that are tied at first)?
f Of Chevrolet EVs, what is the count for the most popular model?
g Which Legislative District has the most EVs?
h What is the DOL Vehicle ID of the oldest car on this registry? There are two entries (likely a data entry mistake), pick the larger DOL Vehicle ID.
Show code cell source
import numpy as np
def decode_word(codestr, ans):
np.random.seed(ans)
number = np.random.randint(0, 1000000000)
numberstr = str(number)
decoded = ''
while codestr:
code = int(codestr[0:3])
num = numberstr[0]
char = chr(code - (int(num)+1)*ord(num))
decoded += char
try:
codestr = codestr[3:]
except:
pass
if len(numberstr) == 1:
continue
else:
numberstr = numberstr[1:]
return decoded
def decode_message(encoded_message, ans_list):
code_list = encoded_message.split()
word_list = []
for code, ans in zip(code_list, ans_list):
try:
word_list.append(decode_word(code, ans))
except:
word_list.append('_' * int(len(code)/3))
return ' '.join(word_list)
a = 11
b = 11
c = 11
d = 11
e = 11
f = 11
g = 11
h = 11
ans_list = [a, b, c, d, e, f, g, h]
code_a = '234202361 538145258415488147671 613621493620 538251 492671619262419197494605148094094094 653315 318671493490671197266 203494364'
code_b = '182482423319 309685 685309549490206325 475 538251266214541318 622605492619365209208 551150 271209377318375605156306424'
message_a = decode_message(code_a, ans_list)
message_b = decode_message(code_b, ans_list[::-1])
print(message_a)
print(message_b)
___ _______ µéǖ __ ____________ __ _______ ___
____ __ ______ # ______ _______ __ _________