1. Pandas#

1.1. Pandas Dataframe#

Pandas is a python package for importing, inspecting, cleaning, aggregating, transforming, and visualizing data. You import pandas as:

import pandas as pd

1.2. Creating a dataframe#

Pandas introduces a new datatype, the DataFrame.

The dataframe is a two dimensional container that organizes data into rows (samples) and columns (features) much like a spreadsheet.

We can create a dataframe using a dictionary. The keys of the dictionary will be the columns of the dataframe and the values will be the the row values. So, the values in our dictionary should be lists or numpy arrays.

For example, let’s extend our earning dictionary from above. For each key, replace the value with the following lists:

  • month - [‘Sept’, ‘Oct’, ‘Nov’]

  • gross - [22000, 24500, 18000]

  • expenses - [16000, 16000, 8000]

  • net - (gross - expenses)

Once we have the dict, we can create the dataframe using

earnings_df = pd.DataFrame(earnings)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
earnings_dict = dict(
    month = ['Sept', 'Oct', 'Nov'],
    gross = [22000, 24500, 18000],
    expenses = [16000, 16000, 8000]
)

earnings_dict
{'month': ['Sept', 'Oct', 'Nov'],
 'gross': [22000, 24500, 18000],
 'expenses': [16000, 16000, 8000]}

We can convert the dictionary into a dataframe using pd.DataFrame(). The keys become the column headers and the associated values become…the values.

A note of caution, to convert a dictionary into a dataframe, the value lists for each column must be the same length.

earnings_df = pd.DataFrame(earnings_dict)

earnings_df
month gross expenses
0 Sept 22000 16000
1 Oct 24500 16000
2 Nov 18000 8000

And like a dictionary, we can append new columns on the fly.

earnings_df['net'] = earnings_df['gross']- earnings_df['expenses']
earnings_df
month gross expenses net
0 Sept 22000 16000 6000
1 Oct 24500 16000 8500
2 Nov 18000 8000 10000

1.3. Calculations on Columns#

When extracting a column, you can use numpy functions to get different measures of those data:

  • .mean()

  • .std()

  • .min()

  • .max()

  • .sum()

  • .unique()

earnings_df['gross']
0    22000
1    24500
2    18000
Name: gross, dtype: int64
earnings_df['net'] = earnings_df['gross'] - earnings_df['expenses']

fig, ax = plt.subplots(1,1, figsize = (5,5))

ax.bar(earnings_df['month'], earnings_df['net'], label = 'net')
ax.bar(earnings_df['month'], earnings_df['expenses'],
              bottom = earnings_df['net'],
              label = 'expenses')

ax.set_ylabel('Amount ($)')
plt.legend()
plt.title('Net and Expenses by Month')
plt.show()
../_images/c7f431af05e9595e8507c387a764c6363cf1677caf204ec51eead45a252044bf.png

Pandas has its own custom plotting commands, but its built on matplotlib and the functionality is the same.

fig, ax = plt.subplots(1,1, figsize = (5,5))

earnings_df.plot(
    x='month',
    y=['net', 'expenses'],
    kind='bar',
    stacked=True,
    ax = ax
)

ax.set_ylabel('Amount ($)')
plt.show()
../_images/3335e4d5de20dd580d13d14dfe61f7f612d17fa2921478ee632e70e3d6f7068f.png

1.4. Importing and Inspecting Data#

Most of the time, we’ll be using data that exist outside of Python, either data that somebody else has collected or data that we’ve collected and saved to some other kind of file (.csv, .json, etc). We’ll import data into the dataframe using:

df = pd.read_csv(data.csv)

Where data.csv could be a filepath on your computer or a url to a file that exists online. There are other commands that allow pandas to import a variety of data filetypes.

Once a file is imported, we can inspect the data using:

  • df.head(n) - Look at the first n rows (5 rows by default, if you omit n)

  • df.tail(n) - Look at the last n rows

  • df.describe() - Get information and statistics of the data

  • df.columns - Get a list of feature names

We’ll import and inspect the Marvel Cinematic Universe (MCU) Movies & Series dataset, compiled by Stephen Tracy and published on Kaggle: https://www.kaggle.com/datasets/stevetracy/marvel-cinematic-universe-mcu-movies-and-series

1.5. Example: MCU dataset#

mcu_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/mcu_data_apr_2024.csv')
mcu_df.head()
movie_title Type mcu_phase_number mcu_phase_text release_date release_year rt_tomato_meter rt_audience_score imdb-rating metacritic_metascore metacritic_userscore production_budget domestic_box_office international_box_office worldwide_box_office
0 Iron Man Movie 1.0 MCU Phase 1 5/2/08 2008.0 94.0 91.0 7.9 79.0 8.6 $140,000,000 $319,034,126 $266,762,121 $585,796,247
1 The Incredible Hulk Movie 1.0 MCU Phase 1 6/13/08 2008.0 67.0 70.0 6.6 61.0 6.8 $140,000,000 $134,806,913 $128,620,638 $263,427,551
2 Iron Man 2 Movie 1.0 MCU Phase 1 5/7/10 2010.0 72.0 71.0 6.9 57.0 6.4 $200,000,000 $312,433,331 $311,500,000 $623,933,331
3 Thor Movie 1.0 MCU Phase 1 5/6/11 2011.0 77.0 76.0 7.0 57.0 7.0 $150,000,000 $181,030,624 $268,295,994 $449,326,618
4 Captain America: The First Avenger Movie 1.0 MCU Phase 1 7/22/11 2011.0 79.0 75.0 6.9 66.0 6.8 $140,000,000 $176,654,505 $193,915,269 $370,569,774

1.5.1. Data Cleaning#

The data have three main issues:

  • The last two rows are missing all data. Easy fix.

  • Some of the headings have leading and trailing spaces (e.g. ‘ worldwide_box_office ‘). Annoying.

  • All the budget and revenue columns list the dollar amounts as strings with $ and commas. Annnoyyyyying.

For now, I’m just cleaning the data for you. We’ll cover these tools later.

1.5.1.1. Dropping bad data#

Data can be ‘bad’ for numerous reasons, but most immediately obvious are missing data. We drop missing data with df.dropna (dropna documentation)

mcu_df.dropna(axis=0, inplace=True)
mcu_df.tail()
movie_title Type mcu_phase_number mcu_phase_text release_date release_year rt_tomato_meter rt_audience_score imdb-rating metacritic_metascore metacritic_userscore production_budget domestic_box_office international_box_office worldwide_box_office
38 Ant-Man and the Wasp: Quantumania Movie 5.0 MCU Phase 5 2/17/23 2023.0 46.0 82.0 6.1 48.0 5.5 $200,000,000 $214,504,909 $261,566,271 $476,071,180
39 Guardians of the Galaxy Vol. 3 Movie 5.0 MCU Phase 5 5/5/23 2023.0 82.0 94.0 7.9 64.0 7.9 $250,000,000 $358,995,815 $486,559,962 $845,555,777
40 Secret Invasion TV Show 5.0 MCU Phase 5 6/21/23 2023.0 53.0 46.0 5.9 63.0 3.5 $211,000,000 - - -
41 The Marvels Movie 5.0 MCU Phase 5 11/10/23 2023.0 62.0 82.0 5.6 50.0 3.8 $270,000,000 $84,500,223 $121,612,709 $206,112,932
42 Echo TV Show 5.0 MCU Phase 5 1/9/24 2024.0 71.0 61.0 6.0 62.0 5.2 $40,000,000 - - -

1.5.1.2. Renaming Columns#

We can rename columns using the function .rename() which takes as input a dictionary. The keys of the dictionary are the current column headers and the values are the replacement headers.

rename_dict = {col:col.strip() for col in mcu_df.columns}
mcu_df.rename(columns = rename_dict, inplace=True)

1.5.1.3. Converting strings to numbers#

We can use the map function to modify entries of the DataFrame by some function.

def convert_dollars(entry):
    if entry == ' - ':
        return np.nan
    elif isinstance(entry, str):
        entry = entry.replace('$','').replace(',','')
        return int(entry)
    else:
        return(entry)

money_cols = ['production_budget', 'domestic_box_office','international_box_office', 'worldwide_box_office']
    
mcu_df[money_cols] = mcu_df[money_cols].map(convert_dollars)

1.5.1.4. Inspection#

  • What movies/tv shows are on the list?

  • What are the statistics of the numeric columns?

mcu_df['movie_title'].unique()
array(['Iron Man', 'The Incredible Hulk', 'Iron Man 2', 'Thor',
       'Captain America: The First Avenger', 'The Avengers', 'Iron Man 3',
       'Thor: The Dark World', 'Captain America: The Winter Soldier',
       'Guardians of the Galaxy', 'Avengers: Age of Ultron', 'Ant-Man',
       'Captain America: Civil War', 'Doctor Strange',
       'Guardians of the Galaxy Vol. 2', 'Spider-Man: Homecoming',
       'Thor: Ragnarok', 'Black Panther', 'Avengers: Infinity War',
       'Ant-Man and the Wasp', 'Captain Marvel', 'Avengers: Endgame',
       'Spider-Man: Far From Home', 'WandaVision',
       'The Falcon and the Winter Soldier', 'Loki', 'Black Widow',
       'What If…?', 'Shang-Chi and the Legend of the Ten Rings',
       'Eternals', 'Hawkeye', 'Spider-Man: No Way Home', 'Moon Knight',
       'Doctor Strange: Multiverse of Madness', 'Ms. Marvel',
       'Thor: Love and Thunder', 'She-Hulk',
       'Black Panther: Wakanda Forever',
       'Ant-Man and the Wasp: Quantumania',
       'Guardians of the Galaxy Vol. 3', 'Secret Invasion', 'The Marvels',
       'Echo'], dtype=object)
mcu_df.tail()
movie_title Type mcu_phase_number mcu_phase_text release_date release_year rt_tomato_meter rt_audience_score imdb-rating metacritic_metascore metacritic_userscore production_budget domestic_box_office international_box_office worldwide_box_office
38 Ant-Man and the Wasp: Quantumania Movie 5.0 MCU Phase 5 2/17/23 2023.0 46.0 82.0 6.1 48.0 5.5 200000000.0 214504909.0 261566271.0 476071180.0
39 Guardians of the Galaxy Vol. 3 Movie 5.0 MCU Phase 5 5/5/23 2023.0 82.0 94.0 7.9 64.0 7.9 250000000.0 358995815.0 486559962.0 845555777.0
40 Secret Invasion TV Show 5.0 MCU Phase 5 6/21/23 2023.0 53.0 46.0 5.9 63.0 3.5 211000000.0 NaN NaN NaN
41 The Marvels Movie 5.0 MCU Phase 5 11/10/23 2023.0 62.0 82.0 5.6 50.0 3.8 270000000.0 84500223.0 121612709.0 206112932.0
42 Echo TV Show 5.0 MCU Phase 5 1/9/24 2024.0 71.0 61.0 6.0 62.0 5.2 40000000.0 NaN NaN NaN
mcu_describe = mcu_df.describe()

Describe creates another dataframe and we can save this as a variable and interact with it.

1.6. Slicing dataframes#

  • df.COLUMN_NAME or df[COLUMN_NAME] or df.loc[COLUMN_NAME] - slice an individual column

  • df[LIST_OF_COLUMN_NAMES] or df.loc[LIST_OF_COLUMN_NAME]- slice multiple columns

  • df.iloc[ROW_INDICES]- slice multiple rows

How does Pandas know whether you meant to slice a column or a row?

You can ‘slice’ columns or rows from a dataframe using:

  • df[LIST_OF_COLUMN_NAMES].iloc[ROW_INDICES] - slice along columns and rows

Let’s get the movie_title, release_date, rt_audience_score, metacritic_userscore, production_budget, and worldwide_box_office for the last ten most recent entries (as of the time this list was made).

# A single column
mcu_df['movie_title']
0                                      Iron Man
1                           The Incredible Hulk
2                                    Iron Man 2
3                                          Thor
4            Captain America: The First Avenger
5                                  The Avengers
6                                    Iron Man 3
7                          Thor: The Dark World
8           Captain America: The Winter Soldier
9                       Guardians of the Galaxy
10                      Avengers: Age of Ultron
11                                      Ant-Man
12                   Captain America: Civil War
13                               Doctor Strange
14               Guardians of the Galaxy Vol. 2
15                       Spider-Man: Homecoming
16                               Thor: Ragnarok
17                                Black Panther
18                       Avengers: Infinity War
19                         Ant-Man and the Wasp
20                               Captain Marvel
21                            Avengers: Endgame
22                    Spider-Man: Far From Home
23                                  WandaVision
24            The Falcon and the Winter Soldier
25                                         Loki
26                                  Black Widow
27                                    What If…?
28    Shang-Chi and the Legend of the Ten Rings
29                                     Eternals
30                                      Hawkeye
31                      Spider-Man: No Way Home
32                                  Moon Knight
33        Doctor Strange: Multiverse of Madness
34                                   Ms. Marvel
35                       Thor: Love and Thunder
36                                     She-Hulk
37               Black Panther: Wakanda Forever
38            Ant-Man and the Wasp: Quantumania
39               Guardians of the Galaxy Vol. 3
40                              Secret Invasion
41                                  The Marvels
42                                         Echo
Name: movie_title, dtype: object
# Mulitple columns as a list
mcu_df[['movie_title', 'release_year']]
movie_title release_year
0 Iron Man 2008.0
1 The Incredible Hulk 2008.0
2 Iron Man 2 2010.0
3 Thor 2011.0
4 Captain America: The First Avenger 2011.0
5 The Avengers 2012.0
6 Iron Man 3 2013.0
7 Thor: The Dark World 2013.0
8 Captain America: The Winter Soldier 2014.0
9 Guardians of the Galaxy 2014.0
10 Avengers: Age of Ultron 2015.0
11 Ant-Man 2015.0
12 Captain America: Civil War 2016.0
13 Doctor Strange 2016.0
14 Guardians of the Galaxy Vol. 2 2017.0
15 Spider-Man: Homecoming 2017.0
16 Thor: Ragnarok 2017.0
17 Black Panther 2018.0
18 Avengers: Infinity War 2018.0
19 Ant-Man and the Wasp 2018.0
20 Captain Marvel 2019.0
21 Avengers: Endgame 2019.0
22 Spider-Man: Far From Home 2019.0
23 WandaVision 2021.0
24 The Falcon and the Winter Soldier 2021.0
25 Loki 2021.0
26 Black Widow 2021.0
27 What If…? 2021.0
28 Shang-Chi and the Legend of the Ten Rings 2021.0
29 Eternals 2021.0
30 Hawkeye 2021.0
31 Spider-Man: No Way Home 2021.0
32 Moon Knight 2022.0
33 Doctor Strange: Multiverse of Madness 2022.0
34 Ms. Marvel 2022.0
35 Thor: Love and Thunder 2022.0
36 She-Hulk 2022.0
37 Black Panther: Wakanda Forever 2022.0
38 Ant-Man and the Wasp: Quantumania 2023.0
39 Guardians of the Galaxy Vol. 3 2023.0
40 Secret Invasion 2023.0
41 The Marvels 2023.0
42 Echo 2024.0
# Specific rows by index

mcu_df.iloc[5:10]
movie_title Type mcu_phase_number mcu_phase_text release_date release_year rt_tomato_meter rt_audience_score imdb-rating metacritic_metascore metacritic_userscore production_budget domestic_box_office international_box_office worldwide_box_office
5 The Avengers Movie 1.0 MCU Phase 1 5/4/12 2012.0 91.0 91.0 8.0 69.0 8.0 220000000.0 623357910.0 895442000.0 1.518800e+09
6 Iron Man 3 Movie 2.0 MCU Phase 2 5/3/13 2013.0 79.0 78.0 7.1 62.0 6.7 200000000.0 409013994.0 806426000.0 1.215440e+09
7 Thor: The Dark World Movie 2.0 MCU Phase 2 11/8/13 2013.0 66.0 75.0 6.8 54.0 7.0 170000000.0 206362140.0 438421000.0 6.447831e+08
8 Captain America: The Winter Soldier Movie 2.0 MCU Phase 2 4/4/14 2014.0 90.0 92.0 7.7 70.0 8.3 170000000.0 259766572.0 454654931.0 7.144215e+08
9 Guardians of the Galaxy Movie 2.0 MCU Phase 2 8/1/14 2014.0 92.0 92.0 8.0 76.0 8.2 170000000.0 333176600.0 439697800.0 7.728744e+08
# specified columns and in reverse order using [::-1] slice
mcu_df[['movie_title', 'release_year', 'rt_audience_score']].iloc[::-1]
movie_title release_year rt_audience_score
42 Echo 2024.0 61.0
41 The Marvels 2023.0 82.0
40 Secret Invasion 2023.0 46.0
39 Guardians of the Galaxy Vol. 3 2023.0 94.0
38 Ant-Man and the Wasp: Quantumania 2023.0 82.0
37 Black Panther: Wakanda Forever 2022.0 94.0
36 She-Hulk 2022.0 32.0
35 Thor: Love and Thunder 2022.0 76.0
34 Ms. Marvel 2022.0 80.0
33 Doctor Strange: Multiverse of Madness 2022.0 85.0
32 Moon Knight 2022.0 89.0
31 Spider-Man: No Way Home 2021.0 98.0
30 Hawkeye 2021.0 88.0
29 Eternals 2021.0 78.0
28 Shang-Chi and the Legend of the Ten Rings 2021.0 98.0
27 What If…? 2021.0 92.0
26 Black Widow 2021.0 91.0
25 Loki 2021.0 90.0
24 The Falcon and the Winter Soldier 2021.0 82.0
23 WandaVision 2021.0 88.0
22 Spider-Man: Far From Home 2019.0 95.0
21 Avengers: Endgame 2019.0 90.0
20 Captain Marvel 2019.0 45.0
19 Ant-Man and the Wasp 2018.0 81.0
18 Avengers: Infinity War 2018.0 91.0
17 Black Panther 2018.0 79.0
16 Thor: Ragnarok 2017.0 87.0
15 Spider-Man: Homecoming 2017.0 87.0
14 Guardians of the Galaxy Vol. 2 2017.0 87.0
13 Doctor Strange 2016.0 86.0
12 Captain America: Civil War 2016.0 89.0
11 Ant-Man 2015.0 85.0
10 Avengers: Age of Ultron 2015.0 83.0
9 Guardians of the Galaxy 2014.0 92.0
8 Captain America: The Winter Soldier 2014.0 92.0
7 Thor: The Dark World 2013.0 75.0
6 Iron Man 3 2013.0 78.0
5 The Avengers 2012.0 91.0
4 Captain America: The First Avenger 2011.0 75.0
3 Thor 2011.0 76.0
2 Iron Man 2 2010.0 71.0
1 The Incredible Hulk 2008.0 70.0
0 Iron Man 2008.0 91.0

1.7. Adding a column#

Just as in dictionaries, you can add a calculated column as you go.

# a new column indicating whether viewers or critics liked the movie more
mcu_df['audience_vs_critic'] = mcu_df['rt_audience_score'] - mcu_df['rt_tomato_meter']

# a new column calculating the net revenue
mcu_df['profit'] = mcu_df['worldwide_box_office'] - mcu_df['production_budget']

# a new column using the new column above
mcu_df['profit_margin'] = mcu_df['profit']/mcu_df['worldwide_box_office']

mcu_df.head()
movie_title Type mcu_phase_number mcu_phase_text release_date release_year rt_tomato_meter rt_audience_score imdb-rating metacritic_metascore metacritic_userscore production_budget domestic_box_office international_box_office worldwide_box_office audience_vs_critic profit profit_margin
0 Iron Man Movie 1.0 MCU Phase 1 5/2/08 2008.0 94.0 91.0 7.9 79.0 8.6 140000000.0 319034126.0 266762121.0 585796247.0 -3.0 445796247.0 0.761009
1 The Incredible Hulk Movie 1.0 MCU Phase 1 6/13/08 2008.0 67.0 70.0 6.6 61.0 6.8 140000000.0 134806913.0 128620638.0 263427551.0 3.0 123427551.0 0.468545
2 Iron Man 2 Movie 1.0 MCU Phase 1 5/7/10 2010.0 72.0 71.0 6.9 57.0 6.4 200000000.0 312433331.0 311500000.0 623933331.0 -1.0 423933331.0 0.679453
3 Thor Movie 1.0 MCU Phase 1 5/6/11 2011.0 77.0 76.0 7.0 57.0 7.0 150000000.0 181030624.0 268295994.0 449326618.0 -1.0 299326618.0 0.666167
4 Captain America: The First Avenger Movie 1.0 MCU Phase 1 7/22/11 2011.0 79.0 75.0 6.9 66.0 6.8 140000000.0 176654505.0 193915269.0 370569774.0 -4.0 230569774.0 0.622203

1.8. Filtering dataframes#

You can also select rows of a dataframe that satisfy some condition (e.g. greater than some value, string starts with, before some date, value in a range, etc).

There are two main ways to filter dataframes based on conditions:

  • df[CONDITION] - the condition in terms of df.COLUMN_NAME

  • df.query('CONDITION') - the condition in terms of COLUMN_NAME

The syntaxes of these two approaches are illustrated by an example.

Example:

Find me only movies that were released before 2018.

movie_type = "Movie"
year = 2018

# Standard filter
mcu_df[(mcu_df['Type']==movie_type) & (mcu_df['release_year']<year)]

# Query
mcu_df.query('Type == @movie_type and release_year < @year')
mcu_df.query('Type == "Movie" and release_year < 2018')
movie_title Type mcu_phase_number mcu_phase_text release_date release_year rt_tomato_meter rt_audience_score imdb-rating metacritic_metascore metacritic_userscore production_budget domestic_box_office international_box_office worldwide_box_office audience_vs_critic profit profit_margin
0 Iron Man Movie 1.0 MCU Phase 1 5/2/08 2008.0 94.0 91.0 7.9 79.0 8.6 140000000.0 319034126.0 266762121.0 5.857962e+08 -3.0 4.457962e+08 0.761009
1 The Incredible Hulk Movie 1.0 MCU Phase 1 6/13/08 2008.0 67.0 70.0 6.6 61.0 6.8 140000000.0 134806913.0 128620638.0 2.634276e+08 3.0 1.234276e+08 0.468545
2 Iron Man 2 Movie 1.0 MCU Phase 1 5/7/10 2010.0 72.0 71.0 6.9 57.0 6.4 200000000.0 312433331.0 311500000.0 6.239333e+08 -1.0 4.239333e+08 0.679453
3 Thor Movie 1.0 MCU Phase 1 5/6/11 2011.0 77.0 76.0 7.0 57.0 7.0 150000000.0 181030624.0 268295994.0 4.493266e+08 -1.0 2.993266e+08 0.666167
4 Captain America: The First Avenger Movie 1.0 MCU Phase 1 7/22/11 2011.0 79.0 75.0 6.9 66.0 6.8 140000000.0 176654505.0 193915269.0 3.705698e+08 -4.0 2.305698e+08 0.622203
5 The Avengers Movie 1.0 MCU Phase 1 5/4/12 2012.0 91.0 91.0 8.0 69.0 8.0 220000000.0 623357910.0 895442000.0 1.518800e+09 0.0 1.298800e+09 0.855149
6 Iron Man 3 Movie 2.0 MCU Phase 2 5/3/13 2013.0 79.0 78.0 7.1 62.0 6.7 200000000.0 409013994.0 806426000.0 1.215440e+09 -1.0 1.015440e+09 0.835451
7 Thor: The Dark World Movie 2.0 MCU Phase 2 11/8/13 2013.0 66.0 75.0 6.8 54.0 7.0 170000000.0 206362140.0 438421000.0 6.447831e+08 9.0 4.747831e+08 0.736345
8 Captain America: The Winter Soldier Movie 2.0 MCU Phase 2 4/4/14 2014.0 90.0 92.0 7.7 70.0 8.3 170000000.0 259766572.0 454654931.0 7.144215e+08 2.0 5.444215e+08 0.762045
9 Guardians of the Galaxy Movie 2.0 MCU Phase 2 8/1/14 2014.0 92.0 92.0 8.0 76.0 8.2 170000000.0 333176600.0 439697800.0 7.728744e+08 0.0 6.028744e+08 0.780042
10 Avengers: Age of Ultron Movie 2.0 MCU Phase 2 5/1/15 2015.0 76.0 83.0 7.3 66.0 7.1 250000000.0 459005868.0 943786000.0 1.402792e+09 7.0 1.152792e+09 0.821784
11 Ant-Man Movie 2.0 MCU Phase 2 7/17/15 2015.0 83.0 85.0 7.2 64.0 7.4 130000000.0 180202163.0 339109802.0 5.193120e+08 2.0 3.893120e+08 0.749669
12 Captain America: Civil War Movie 3.0 MCU Phase 3 5/6/16 2016.0 90.0 89.0 7.8 75.0 8.0 250000000.0 408084349.0 745231193.0 1.153316e+09 -1.0 9.033155e+08 0.783234
13 Doctor Strange Movie 3.0 MCU Phase 3 11/4/16 2016.0 89.0 86.0 7.5 72.0 8.0 165000000.0 232641920.0 445060000.0 6.777184e+08 -3.0 5.127184e+08 0.756536
14 Guardians of the Galaxy Vol. 2 Movie 3.0 MCU Phase 3 5/5/17 2017.0 85.0 87.0 7.6 67.0 7.8 200000000.0 389813101.0 473900000.0 8.637561e+08 2.0 6.637561e+08 0.768453
15 Spider-Man: Homecoming Movie 3.0 MCU Phase 3 7/7/17 2017.0 92.0 87.0 7.4 73.0 7.5 175000000.0 334201140.0 546171626.0 8.803728e+08 -5.0 7.053728e+08 0.801221
16 Thor: Ragnarok Movie 3.0 MCU Phase 3 11/3/17 2017.0 93.0 87.0 7.9 74.0 7.7 180000000.0 315058289.0 538958924.0 8.540172e+08 -6.0 6.740172e+08 0.789231

Here’s a snapshot of the code above. These lines of code all produce the same results.

filterquery

What differences are there between the standard filter and the query?

Standard

Query

Uses Python boolean indexing

Uses a string expression

Uses & and | (element-wise and/or)

uses ‘and’ and ‘or’

Column names must be referenced as df['col']

Column names referenced directly as col

Can use variables directly

Use @variable to reference Python variables

More flexible for complex conditions

More readable for simple conditions

Note: I use .query almost exclusively. I find it faster to write and easier to read. But many people prefer the standard format, so you should be familiar with it so you know what it means if you come across it in somebody else’s code.

1.9. Practice Problems#

  • Which movies made over $500,000,000 domestic box office? Provide the title, release_year, and domestic_box_office.

  • Which movies were liked by audiences more than by the critics? Provide all the info.

  • Find all the movies with Captain America in the title and tell me their metacritic scores.

# Which movies made over $500,000,000 domestic box office? Provide the title, release_year, and domestic_box_office.

mcu_df.query('domestic_box_office > 500000000')[['movie_title', 'release_year', 'domestic_box_office']]
movie_title release_year domestic_box_office
5 The Avengers 2012.0 623357910.0
17 Black Panther 2018.0 700059566.0
18 Avengers: Infinity War 2018.0 678815482.0
21 Avengers: Endgame 2019.0 858372000.0
31 Spider-Man: No Way Home 2021.0 814115070.0
# Which movies were liked by audiences more than by the critics? Provide all the info.

mcu_df.query('rt_audience_score>rt_tomato_meter')
movie_title Type mcu_phase_number mcu_phase_text release_date release_year rt_tomato_meter rt_audience_score imdb-rating metacritic_metascore metacritic_userscore production_budget domestic_box_office international_box_office worldwide_box_office audience_vs_critic profit profit_margin
1 The Incredible Hulk Movie 1.0 MCU Phase 1 6/13/08 2008.0 67.0 70.0 6.6 61.0 6.8 140000000.0 134806913.0 1.286206e+08 2.634276e+08 3.0 1.234276e+08 0.468545
7 Thor: The Dark World Movie 2.0 MCU Phase 2 11/8/13 2013.0 66.0 75.0 6.8 54.0 7.0 170000000.0 206362140.0 4.384210e+08 6.447831e+08 9.0 4.747831e+08 0.736345
8 Captain America: The Winter Soldier Movie 2.0 MCU Phase 2 4/4/14 2014.0 90.0 92.0 7.7 70.0 8.3 170000000.0 259766572.0 4.546549e+08 7.144215e+08 2.0 5.444215e+08 0.762045
10 Avengers: Age of Ultron Movie 2.0 MCU Phase 2 5/1/15 2015.0 76.0 83.0 7.3 66.0 7.1 250000000.0 459005868.0 9.437860e+08 1.402792e+09 7.0 1.152792e+09 0.821784
11 Ant-Man Movie 2.0 MCU Phase 2 7/17/15 2015.0 83.0 85.0 7.2 64.0 7.4 130000000.0 180202163.0 3.391098e+08 5.193120e+08 2.0 3.893120e+08 0.749669
14 Guardians of the Galaxy Vol. 2 Movie 3.0 MCU Phase 3 5/5/17 2017.0 85.0 87.0 7.6 67.0 7.8 200000000.0 389813101.0 4.739000e+08 8.637561e+08 2.0 6.637561e+08 0.768453
18 Avengers: Infinity War Movie 3.0 MCU Phase 3 4/27/18 2018.0 85.0 91.0 8.4 68.0 8.5 300000000.0 678815482.0 1.369544e+09 2.048360e+09 6.0 1.748360e+09 0.853541
22 Spider-Man: Far From Home Movie 3.0 MCU Phase 3 7/2/19 2019.0 90.0 95.0 7.4 69.0 7.5 160000000.0 390532085.0 7.412200e+08 1.131752e+09 5.0 9.717521e+08 0.858626
26 Black Widow Movie 4.0 MCU Phase 4 7/9/21 2021.0 79.0 91.0 6.7 68.0 6.0 288000000.0 183651655.0 1.961000e+08 3.797517e+08 12.0 9.175166e+07 0.241610
28 Shang-Chi and the Legend of the Ten Rings Movie 4.0 MCU Phase 4 9/3/21 2021.0 91.0 98.0 7.4 71.0 7.0 150000000.0 224543292.0 2.077000e+08 4.322433e+08 7.0 2.822433e+08 0.652973
29 Eternals Movie 4.0 MCU Phase 4 11/5/21 2021.0 47.0 78.0 6.3 52.0 6.1 236000000.0 164870234.0 2.371947e+08 4.020649e+08 31.0 1.660649e+08 0.413030
31 Spider-Man: No Way Home Movie 4.0 MCU Phase 4 12/17/21 2021.0 93.0 98.0 8.2 71.0 8.5 200000000.0 814115070.0 1.107732e+09 1.921847e+09 5.0 1.721847e+09 0.895933
32 Moon Knight TV Show 4.0 MCU Phase 4 3/30/22 2022.0 86.0 89.0 7.3 69.0 6.9 148000000.0 NaN NaN NaN 3.0 NaN NaN
33 Doctor Strange: Multiverse of Madness Movie 4.0 MCU Phase 4 5/6/22 2022.0 73.0 85.0 6.9 60.0 5.9 294000000.0 411331607.0 5.444442e+08 9.557758e+08 12.0 6.617758e+08 0.692396
35 Thor: Love and Thunder Movie 4.0 MCU Phase 4 7/8/22 2022.0 63.0 76.0 6.2 57.0 4.8 250000000.0 343256830.0 4.176713e+08 7.609281e+08 13.0 5.109281e+08 0.671454
37 Black Panther: Wakanda Forever Movie 4.0 MCU Phase 4 11/11/22 2022.0 83.0 94.0 6.7 67.0 5.2 200000000.0 453829060.0 4.053798e+08 8.592088e+08 11.0 6.592088e+08 0.767228
38 Ant-Man and the Wasp: Quantumania Movie 5.0 MCU Phase 5 2/17/23 2023.0 46.0 82.0 6.1 48.0 5.5 200000000.0 214504909.0 2.615663e+08 4.760712e+08 36.0 2.760712e+08 0.579895
39 Guardians of the Galaxy Vol. 3 Movie 5.0 MCU Phase 5 5/5/23 2023.0 82.0 94.0 7.9 64.0 7.9 250000000.0 358995815.0 4.865600e+08 8.455558e+08 12.0 5.955558e+08 0.704336
41 The Marvels Movie 5.0 MCU Phase 5 11/10/23 2023.0 62.0 82.0 5.6 50.0 3.8 270000000.0 84500223.0 1.216127e+08 2.061129e+08 20.0 -6.388707e+07 -0.309961
# Find all the movies with Captain America in the title and tell me their metacritic scores.
# Notice, to use string functions, you have to prepend the function with str.

mcu_df.query('movie_title.str.contains("Captain America")')[['movie_title', 'metacritic_metascore', 'metacritic_userscore']]
movie_title metacritic_metascore metacritic_userscore
4 Captain America: The First Avenger 66.0 6.8
8 Captain America: The Winter Soldier 70.0 8.3
12 Captain America: Civil War 75.0 8.0