4. Pandas - Grouping and Aggregation#

import pandas as pd

Data source: https://data.wa.gov/Transportation/Electric-Vehicle-Population-Data/f6w7-q2d2/data_preview

url = 'https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/Electric_Vehicle_WA.csv'

ev_df = pd.read_csv(url)
ev_df.head()
VIN (1-10) County City State Postal Code Model Year Make Model Electric Vehicle Type Clean Alternative Fuel Vehicle (CAFV) Eligibility Electric Range Base MSRP Legislative District DOL Vehicle ID Vehicle Location Electric Utility 2020 Census Tract
0 WA1E2AFY8R Thurston Olympia WA 98512.0 2024 AUDI Q5 E Plug-in Hybrid Electric Vehicle (PHEV) Not eligible due to low battery range 23.0 0 22.0 263239938 POINT (-122.90787 46.9461) PUGET SOUND ENERGY INC 5.306701e+10
1 WAUUPBFF4J Yakima Wapato WA 98951.0 2018 AUDI A3 Plug-in Hybrid Electric Vehicle (PHEV) Not eligible due to low battery range 16.0 0 15.0 318160860 POINT (-120.42083 46.44779) PACIFICORP 5.307794e+10
2 1N4AZ0CP0F King Seattle WA 98125.0 2015 NISSAN LEAF Battery Electric Vehicle (BEV) Clean Alternative Fuel Vehicle Eligible 84.0 0 46.0 184963586 POINT (-122.30253 47.72656) CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA) 5.303300e+10
3 WA1VAAGE5K King Kent WA 98031.0 2019 AUDI E-TRON Battery Electric Vehicle (BEV) Clean Alternative Fuel Vehicle Eligible 204.0 0 11.0 259426821 POINT (-122.17743 47.41185) PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA) 5.303303e+10
4 7SAXCAE57N Snohomish Bothell WA 98021.0 2022 TESLA MODEL X Battery Electric Vehicle (BEV) Eligibility unknown as battery range has not b... 0.0 0 1.0 208182236 POINT (-122.18384 47.8031) PUGET SOUND ENERGY INC 5.306105e+10
ev_df.describe()
Postal Code Model Year Electric Range Legislative District DOL Vehicle ID 2020 Census Tract
count 264619.000000 264628.000000 264624.000000 263969.000000 2.646280e+05 2.646190e+05
mean 98170.770500 2021.842556 41.713159 28.862897 2.422540e+08 5.297109e+10
std 2637.717847 3.044095 80.377977 14.883356 6.516028e+07 1.638317e+09
min 1469.000000 1999.000000 0.000000 1.000000 4.385000e+03 1.001020e+09
25% 98052.000000 2021.000000 0.000000 17.000000 2.174474e+08 5.303301e+10
50% 98133.000000 2023.000000 0.000000 32.000000 2.603598e+08 5.303303e+10
75% 98382.000000 2024.000000 34.000000 42.000000 2.758921e+08 5.305307e+10
max 99577.000000 2026.000000 337.000000 49.000000 4.791150e+08 6.601095e+10

4.1. Groupby and Aggregate#

Often in data science, we want to partition our data set into groups based on some feature and compare statistics of those groups.

In pandas, groupby is used to split the data into groups based on the values in one or more columns. After grouping, you can apply aggregation (agg) functions to summarize or transform each group.

groupby works by:

  • Splitting: The data is split into groups using the values of specified columns.

  • Applying: An aggregation function (like sum, mean, count, etc.) is applied to each group.

  • Combining: The results are combined into a new DataFrame or Series.

And we can group by multiple features as well.

Example

# Group by 'County' and calculate the average Electric Range for each county
ev_df.groupby(by = 'County')['Electric Range'].mean()
# Group by 'County' and count the number of cars by each manufacturer
ev_df.groupby(by = ['County', 'Make'])['VIN (1-10)'].count()

4.1.1. Common Aggregation Functions#

  • sum(): Total of values in each group

  • mean(): Average of values in each group

  • count(): Number of items in each group

  • max(), min(): Maximum or minimum value in each group

4.1.2. Multiple Aggregations#

You can apply multiple aggregation functions at once using .agg():

ev_df.groupby('County').agg({
    'Electric Range': ['mean', 'max', 'min'],
    'Model Year': 'count'
})

This produces a summary table with the specified statistics for each group.

4.1.3. Use Cases#

  • Summarizing data by categories (e.g., average sales per region)

  • Counting occurrences (e.g., number of vehicles per manufacturer)

  • Finding maximum/minimum values within groups

ev_df.groupby(by = ['County', 'Make'])['VIN (1-10)'].count()
County  Make      
Ada     TESLA          2
Adams   BMW            4
        CADILLAC       3
        CHEVROLET      7
        CHRYSLER       1
                      ..
Yakima  VOLKSWAGEN    31
        VOLVO         31
York    TESLA          2
        VOLKSWAGEN     1
Yuba    TESLA          1
Name: VIN (1-10), Length: 1543, dtype: int64
ev_df.groupby(by = 'County')['Electric Range'].describe()
count mean std min 25% 50% 75% max
County
Ada 2.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
Adams 109.0 52.504587 93.387348 0.0 0.0 0.0 39.0 322.0
Alameda 10.0 2.100000 6.640783 0.0 0.0 0.0 0.0 21.0
Albemarle 2.0 104.000000 147.078210 0.0 52.0 104.0 156.0 208.0
Alexandria 7.0 86.142857 138.338267 0.0 0.0 0.0 136.5 330.0
... ... ... ... ... ... ... ... ...
Whitman 536.0 36.694030 67.823408 0.0 0.0 0.0 39.0 322.0
Williamson 5.0 48.600000 96.331719 0.0 0.0 0.0 23.0 220.0
Yakima 1791.0 49.169179 86.251936 0.0 0.0 0.0 39.5 330.0
York 3.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
Yuba 1.0 0.000000 NaN 0.0 0.0 0.0 0.0 0.0

239 rows × 8 columns

ev_df.groupby(by = 'County')['Electric Range'].mean()
County
Ada             0.000000
Adams          52.504587
Alameda         2.100000
Albemarle     104.000000
Alexandria     86.142857
                 ...    
Whitman        36.694030
Williamson     48.600000
Yakima         49.169179
York            0.000000
Yuba            0.000000
Name: Electric Range, Length: 239, dtype: float64
ev_df.groupby('County').agg({
    'Electric Range': ['mean', 'max', 'min'],
    'Model Year': 'count'
})
Electric Range Model Year
mean max min count
County
Ada 0.000000 0.0 0.0 2
Adams 52.504587 322.0 0.0 109
Alameda 2.100000 21.0 0.0 10
Albemarle 104.000000 208.0 0.0 2
Alexandria 86.142857 330.0 0.0 7
... ... ... ... ...
Whitman 36.694030 322.0 0.0 536
Williamson 48.600000 220.0 0.0 5
Yakima 49.169179 330.0 0.0 1791
York 0.000000 0.0 0.0 3
Yuba 0.000000 0.0 0.0 1

239 rows × 4 columns

4.2. Useful Pandas Functions for Sorting, Counting, and Finding Maxima/Minima in Data#

  • Sorting Values: Sort the DataFrame by a column (e.g., Electric Range):

ev_df.sort_values(by='Electric Range', ascending=False).head()
  • Counting Instances: Count the number of vehicles by manufacturer:

ev_df['Make'].value_counts()
  • Counting in Groups: Count the number of vehicles by county and model:

is this outdated ?

ev_df.groupby(['County', 'Model']).size()

Finding Max Value: Find the maximum value in a column:

ev_df['Electric Range'].max()
  • Finding the Data Row with the Max Value: Find the row with the maximum electric range:

ev_df.loc[ev_df['Electric Range'].idxmax()]
ev_df.sort_values(by='Electric Range', ascending=False).head(1)
VIN (1-10) County City State Postal Code Model Year Make Model Electric Vehicle Type Clean Alternative Fuel Vehicle (CAFV) Eligibility Electric Range Base MSRP Legislative District DOL Vehicle ID Vehicle Location Electric Utility 2020 Census Tract
109226 5YJSA1E48L Pierce Tacoma WA 98422.0 2020 TESLA MODEL S Battery Electric Vehicle (BEV) Clean Alternative Fuel Vehicle Eligible 337.0 0 27.0 128568478 POINT (-122.39471 47.28159) BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM... 5.305394e+10

4.3. Crack the Code!#

  • a What is the maximum Electric Range of the vehicles in this dataset?

  • b How many electric vehicles are registered in Snohomish County?

  • c How many Tesla Model Ys are in King County?

  • d For the Model Year that has the most registered vehicles, how many vehicles are there?

  • e What is the highest average electric range for a postal code (there are six postal codes that are tied at first)?

  • f Of Chevrolet EVs, what is the count for the most popular model?

  • g Which Legislative District has the most EVs?

  • h What is the DOL Vehicle ID of the oldest car on this registry? There are two entries (likely a data entry mistake), pick the larger DOL Vehicle ID.

Hide code cell source
import numpy as np

def decode_word(codestr, ans):
    np.random.seed(ans)
    number = np.random.randint(0, 1000000000)
    
    numberstr = str(number)
    decoded = ''
    while codestr:
        code = int(codestr[0:3])
        num = numberstr[0]     

        char = chr(code - (int(num)+1)*ord(num))
        decoded += char
        
        try:
            codestr = codestr[3:]
        except:
            pass
        
        if len(numberstr) == 1:
            continue
        else:
            numberstr = numberstr[1:]
        
    return decoded

def decode_message(encoded_message, ans_list):
    code_list = encoded_message.split()
    word_list = []
    for code, ans in zip(code_list, ans_list):
        try:
            word_list.append(decode_word(code, ans))
        except:
            word_list.append('_' * int(len(code)/3))
            
    return ' '.join(word_list)
a = 11
b = 11
c = 11
d = 11
e = 11
f = 11
g = 11
h = 11

ans_list = [a, b, c, d, e, f, g, h]

code_a = '234202361 538145258415488147671 613621493620 538251 492671619262419197494605148094094094 653315 318671493490671197266 203494364'
code_b = '182482423319 309685 685309549490206325 475 538251266214541318 622605492619365209208 551150 271209377318375605156306424'

message_a = decode_message(code_a, ans_list)
message_b = decode_message(code_b, ans_list[::-1])

print(message_a)
print(message_b)
___ _______ ­µéǖ __ ____________ __ _______ ___
____ __ ______ # ______ _______ __ _________