6. Pandas Cheat Sheet#
Here’s an amazingly comprehensive Pandas cheat sheet. This includes many operations we haven’t covered in class.
import pandas as pd
import numpy as np
# Creating a dataframe from a dictionary
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': ['x', 'y', 'z', 'w'],
'C': [10, 20, 30, 40],
'D': [-40, -30, np.nan, -10]
})
# Importing a dataframe from a csv
fish_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/Fish.csv')
# Writing a dataframe to a .csv file
df.to_csv('test_data.csv')
6.1. Initial Inspection#
# Get statistics on columns
df.describe()
# See data types and number of values by column
df.info()
# Count instances of values in a column
df.value_counts()
# See the first/last n rows (5 by default)
df.head(2)
df.tail(2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null int64
1 B 4 non-null object
2 C 4 non-null int64
3 D 3 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 260.0+ bytes
| A | B | C | D | |
|---|---|---|---|---|
| 2 | 3 | z | 30 | NaN |
| 3 | 4 | w | 40 | -10.0 |
6.2. Row/Column selection and slicing#
iloc: Select by integer-location based indexingloc: Select by label/indexfillna: Fill missing valuesdropna: Remove missing values
df['A'] # Select column 'A'
df[['A', 'B']] # Select multiple columns
# iloc (integer-location based indexing)
df.iloc[0] # First row (as Series)
df.iloc[0:2] # First two rows
df.iloc[:, 1] # All rows, column at index 1 ('B')
df.iloc[1, 2] # Row 1, column 2 (value 20)
# loc (row index and column name based location)
df.loc[1, 'A':'C']
# fill missing values, drop missing values
df.fillna(value = 0)
df.dropna(axis = 'index')
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 1 | x | 10 | -40.0 |
| 1 | 2 | y | 20 | -30.0 |
| 3 | 4 | w | 40 | -10.0 |
6.3. Filter/Search#
query: Query the columns of a DataFrame with a boolean expression
threshold = 1.5
# query (boolean expression)
df.query('A > @threshold') # Rows where column 'A' is greater than the variable threshold, notice the @
df.query('A > 2') # Rows where column 'A' > 2
df.query('B == "y"') # Rows where column 'B' is 'y'
| A | B | C | D | |
|---|---|---|---|---|
| 1 | 2 | y | 20 | -30.0 |
6.4. Grouping and aggregation#
groupby: Group DataFrame using a mapper or by a Series of columnsagg: Aggregate using one or more operations over the specified axis
# groupby
df.groupby('B')['C'].mean() # Mean of 'C' grouped by 'B'
df.groupby('B').agg({'A': 'sum', 'C': 'max'}) # Multiple aggregations
| A | C | |
|---|---|---|
| B | ||
| w | 4 | 40 |
| x | 1 | 10 |
| y | 2 | 20 |
| z | 3 | 30 |
6.5. Merging#
# merge (joining DataFrames)
df2 = pd.DataFrame({'B': ['x', 'y'], 'D': [100, 200]})
pd.merge(df, df2, on='B', how='inner') # Inner join on column 'B'
pd.merge(df, df2, on='B', how='left') # Left join
| A | B | C | D_x | D_y | |
|---|---|---|---|---|---|
| 0 | 1 | x | 10 | -40.0 | 100.0 |
| 1 | 2 | y | 20 | -30.0 | 200.0 |
| 2 | 3 | z | 30 | NaN | NaN |
| 3 | 4 | w | 40 | -10.0 | NaN |
6.6. Reordering and Renaming Columns, Replacing Values, Resetting Indices#
set_index: Set the DataFrame index using existing columnsreset_index: Reset the index of the DataFramerename: Alter axes labelsmap: Map values of a Series according to an input mapping
# Reordering/removing columns
new_column_order = ['B', 'A', 'C']
df = df[new_column_order]
# Setting index
df.set_index('B') # Set column 'B' as index
# Resetting index (useful after query or dropna)
df.reset_index(drop=True) # Reset index to default integer index
| B | A | C | |
|---|---|---|---|
| 0 | x | 1 | 10 |
| 1 | y | 2 | 20 |
| 2 | z | 3 | 30 |
| 3 | w | 4 | 40 |
# Renaming columns
df.rename(columns={'A': 'Alpha', 'B': 'Beta', 'C': 'Gamma'})
# Replacing variables
new_vals = {'x': 'apple', 'y': 'banana', 'z': 'cherry', 'w': 'pineapple'}
df['B_mapped'] = df['B'].map(new_vals)
6.7. Applying functions to columns#
apply: apply a function to the column of a dataframe
def double(x):
return x*2
triple = lambda x: 3*x
# Applying function to a column
df['A_doubled'] = df['A'].apply(double) # apply a function defined above
df['A_tripled'] = df['A'].apply(triple) # apply a lambda function defined above
df['A_squared'] = df['A'].apply(lambda x: 4**2) # apply a lambda function defined on the fly
df['A_root'] = df['A'].apply(np.sqrt) # apply a pre-defined function