6. Pandas Cheat Sheet#

Here’s an amazingly comprehensive Pandas cheat sheet. This includes many operations we haven’t covered in class.

import pandas as pd
import numpy as np

# Creating a dataframe from a dictionary
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': ['x', 'y', 'z', 'w'],
    'C': [10, 20, 30, 40],
    'D': [-40, -30, np.nan, -10]
})

# Importing a dataframe from a csv
fish_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/Fish.csv')

# Writing a dataframe to a .csv file
df.to_csv('test_data.csv')

6.1. Initial Inspection#

# Get statistics on columns
df.describe()

# See data types and number of values by column
df.info()

# Count instances of values in a column
df.value_counts()

# See the first/last n rows (5 by default)
df.head(2)
df.tail(2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       4 non-null      int64  
 1   B       4 non-null      object 
 2   C       4 non-null      int64  
 3   D       3 non-null      float64
dtypes: float64(1), int64(2), object(1)
memory usage: 260.0+ bytes
A B C D
2 3 z 30 NaN
3 4 w 40 -10.0

6.2. Row/Column selection and slicing#

  • iloc: Select by integer-location based indexing

  • loc: Select by label/index

  • fillna: Fill missing values

  • dropna: Remove missing values

df['A']           # Select column 'A'
df[['A', 'B']]    # Select multiple columns

# iloc (integer-location based indexing)
df.iloc[0]        # First row (as Series)
df.iloc[0:2]      # First two rows
df.iloc[:, 1]     # All rows, column at index 1 ('B')
df.iloc[1, 2]     # Row 1, column 2 (value 20)

# loc (row index and column name based location)
df.loc[1, 'A':'C']

# fill missing values, drop missing values
df.fillna(value = 0)
df.dropna(axis = 'index')
A B C D
0 1 x 10 -40.0
1 2 y 20 -30.0
3 4 w 40 -10.0

6.4. Grouping and aggregation#

  • groupby: Group DataFrame using a mapper or by a Series of columns

  • agg: Aggregate using one or more operations over the specified axis

# groupby
df.groupby('B')['C'].mean()   # Mean of 'C' grouped by 'B'
df.groupby('B').agg({'A': 'sum', 'C': 'max'})  # Multiple aggregations
A C
B
w 4 40
x 1 10
y 2 20
z 3 30

6.5. Merging#

# merge (joining DataFrames)
df2 = pd.DataFrame({'B': ['x', 'y'], 'D': [100, 200]})
pd.merge(df, df2, on='B', how='inner')   # Inner join on column 'B'
pd.merge(df, df2, on='B', how='left')    # Left join
A B C D_x D_y
0 1 x 10 -40.0 100.0
1 2 y 20 -30.0 200.0
2 3 z 30 NaN NaN
3 4 w 40 -10.0 NaN

6.6. Reordering and Renaming Columns, Replacing Values, Resetting Indices#

  • set_index: Set the DataFrame index using existing columns

  • reset_index: Reset the index of the DataFrame

  • rename: Alter axes labels

  • map: Map values of a Series according to an input mapping

# Reordering/removing columns
new_column_order = ['B', 'A', 'C']
df = df[new_column_order]

# Setting index
df.set_index('B')              # Set column 'B' as index

# Resetting index (useful after query or dropna)
df.reset_index(drop=True)      # Reset index to default integer index
B A C
0 x 1 10
1 y 2 20
2 z 3 30
3 w 4 40
# Renaming columns
df.rename(columns={'A': 'Alpha', 'B': 'Beta', 'C': 'Gamma'})

# Replacing variables
new_vals = {'x': 'apple', 'y': 'banana', 'z': 'cherry', 'w': 'pineapple'}
df['B_mapped'] = df['B'].map(new_vals)

6.7. Applying functions to columns#

  • apply: apply a function to the column of a dataframe

def double(x):
    return x*2

triple = lambda x: 3*x

# Applying function to a column
df['A_doubled'] = df['A'].apply(double) # apply a function defined above
df['A_tripled'] = df['A'].apply(triple) # apply a lambda function defined above
df['A_squared'] = df['A'].apply(lambda x: 4**2) # apply a lambda function defined on the fly
df['A_root'] = df['A'].apply(np.sqrt) # apply a pre-defined function