Pandas Cheat Sheet

6. Pandas Cheat Sheet#

Here’s an amazingly comprehensive Pandas cheat sheet. This includes many operations we haven’t covered in class.

import pandas as pd
import numpy as np

# Creating a dataframe from a dictionary
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': ['x', 'y', 'z', 'w'],
    'C': [10, 20, 30, 40],
    'D': [-40, -30, np.nan, -10]
})

# Importing a dataframe from a csv
fish_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/Fish.csv')

# Writing a dataframe to a .csv file
df.to_csv('test_data.csv')

6.1. Initial Inspection#

# Get statistics on columns
df.describe()

# See data types and number of values by column
df.info()

# Count instances of values in a column
df.value_counts()

# See the first/last n rows (5 by default)
df.head(2)
df.tail(2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       4 non-null      int64  
 1   B       4 non-null      object 
 2   C       4 non-null      int64  
 3   D       3 non-null      float64
dtypes: float64(1), int64(2), object(1)
memory usage: 260.0+ bytes

	A	B	C	D
2	3	z	30	NaN
3	4	w	40	-10.0

6.2. Row/Column selection and slicing#

iloc: Select by integer-location based indexing
loc: Select by label/index
fillna: Fill missing values
dropna: Remove missing values

df['A']           # Select column 'A'
df[['A', 'B']]    # Select multiple columns

# iloc (integer-location based indexing)
df.iloc[0]        # First row (as Series)
df.iloc[0:2]      # First two rows
df.iloc[:, 1]     # All rows, column at index 1 ('B')
df.iloc[1, 2]     # Row 1, column 2 (value 20)

# loc (row index and column name based location)
df.loc[1, 'A':'C']

# fill missing values, drop missing values
df.fillna(value = 0)
df.dropna(axis = 'index')

	A	B	C	D
0	1	x	10	-40.0
1	2	y	20	-30.0
3	4	w	40	-10.0

6.3. Filter/Search#

query: Query the columns of a DataFrame with a boolean expression

threshold = 1.5

# query (boolean expression)
df.query('A > @threshold')    # Rows where column 'A' is greater than the variable threshold, notice the @
df.query('A > 2')             # Rows where column 'A' > 2
df.query('B == "y"')          # Rows where column 'B' is 'y'

	A	B	C	D
1	2	y	20	-30.0

6.4. Grouping and aggregation#

groupby: Group DataFrame using a mapper or by a Series of columns
agg: Aggregate using one or more operations over the specified axis

# groupby
df.groupby('B')['C'].mean()   # Mean of 'C' grouped by 'B'
df.groupby('B').agg({'A': 'sum', 'C': 'max'})  # Multiple aggregations

	A	C
B
w	4	40
x	1	10
y	2	20
z	3	30

6.5. Merging#

# merge (joining DataFrames)
df2 = pd.DataFrame({'B': ['x', 'y'], 'D': [100, 200]})
pd.merge(df, df2, on='B', how='inner')   # Inner join on column 'B'
pd.merge(df, df2, on='B', how='left')    # Left join

	A	B	C	D_x	D_y
0	1	x	10	-40.0	100.0
1	2	y	20	-30.0	200.0
2	3	z	30	NaN	NaN
3	4	w	40	-10.0	NaN

6.6. Reordering and Renaming Columns, Replacing Values, Resetting Indices#

set_index: Set the DataFrame index using existing columns
reset_index: Reset the index of the DataFrame
rename: Alter axes labels
map: Map values of a Series according to an input mapping

# Reordering/removing columns
new_column_order = ['B', 'A', 'C']
df = df[new_column_order]

# Setting index
df.set_index('B')              # Set column 'B' as index

# Resetting index (useful after query or dropna)
df.reset_index(drop=True)      # Reset index to default integer index

	B	A	C
0	x	1	10
1	y	2	20
2	z	3	30
3	w	4	40

# Renaming columns
df.rename(columns={'A': 'Alpha', 'B': 'Beta', 'C': 'Gamma'})

# Replacing variables
new_vals = {'x': 'apple', 'y': 'banana', 'z': 'cherry', 'w': 'pineapple'}
df['B_mapped'] = df['B'].map(new_vals)

6.7. Applying functions to columns#

apply: apply a function to the column of a dataframe

def double(x):
    return x*2

triple = lambda x: 3*x

# Applying function to a column
df['A_doubled'] = df['A'].apply(double) # apply a function defined above
df['A_tripled'] = df['A'].apply(triple) # apply a lambda function defined above
df['A_squared'] = df['A'].apply(lambda x: 4**2) # apply a lambda function defined on the fly
df['A_root'] = df['A'].apply(np.sqrt) # apply a pre-defined function

Pandas Cheat Sheet

Contents

6. Pandas Cheat Sheet#

6.1. Initial Inspection#

6.2. Row/Column selection and slicing#

6.3. Filter/Search#

6.4. Grouping and aggregation#

6.5. Merging#

6.6. Reordering and Renaming Columns, Replacing Values, Resetting Indices#

6.7. Applying functions to columns#