Pandas Refresher

Pandas Refresher#

In your first problem set, you’ll be using the King County Housing dataset. This dataset contains information about the sales of residences (homes, condos, apartment buildings) within King County, WA (the county containing Seattle) from the 2014-2015 period.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

housing_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/kc_house_data.csv', parse_dates = ['date'])
housing_df['day'] = housing_df['date'].dt.day
housing_df['month'] = housing_df['date'].dt.month
housing_df['year'] = housing_df['date'].dt.year

housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 id             21613 non-null  int64         
 date           21613 non-null  datetime64[ns]
 price          21613 non-null  float64       
 bedrooms       21613 non-null  int64         
 bathrooms      21613 non-null  float64       
 sqft_living    21613 non-null  int64         
 sqft_lot       21613 non-null  int64         
 floors         21613 non-null  float64       
 waterfront     21613 non-null  int64         
 view           21613 non-null  int64         
condition      21613 non-null  int64         
grade          21613 non-null  int64         
sqft_above     21613 non-null  int64         
sqft_basement  21613 non-null  int64         
yr_built       21613 non-null  int64         
yr_renovated   21613 non-null  int64         
zipcode        21613 non-null  int64         
lat            21613 non-null  float64       
long           21613 non-null  float64       
sqft_living15  21613 non-null  int64         
sqft_lot15     21613 non-null  int64         
day            21613 non-null  int32         
month          21613 non-null  int32         
year           21613 non-null  int32         
dtypes: datetime64[ns](1), float64(5), int32(3), int64(15)
memory usage: 3.7 MB

housing_df[['date', 'day', 'month', 'year']]

	date	day	month	year
0	2014-10-13	13	10	2014
1	2014-12-09	9	12	2014
2	2015-02-25	25	2	2015
3	2014-12-09	9	12	2014
4	2015-02-18	18	2	2015
...	...	...	...	...
21608	2014-05-21	21	5	2014
21609	2015-02-23	23	2	2015
21610	2014-06-23	23	6	2014
21611	2015-01-16	16	1	2015
21612	2014-10-15	15	10	2014

21613 rows × 4 columns

Finding information#

Google - Better than ChatGPT for most things and only 10% of the energy consumption on a query basis. THINK OF THE TREES!
DuckDuckGo - Like Google but without storing and using your data.

Common Pandas Functions#

Data Exploration#

head() - Returns the first n rows of a DataFrame (default is 5 rows). Useful for quickly previewing data.
tail() - Returns the last n rows of a DataFrame (default is 5 rows). Helpful for checking the end of your dataset.
info() - Displays a concise summary of a DataFrame, including column names, data types, non-null counts, and memory usage.
describe() - Generates descriptive statistics for numerical columns (count, mean, std, min, 25%, 50%, 75%, max).
unique() - Returns unique values.

Data Selection and Filtering#

loc[] - Label-based indexing for selecting rows and columns by labels or conditions.
iloc[] - Integer location-based indexing for selecting rows and columns by position.
query() - Filters DataFrame rows using a query string expression, allowing for conditional selection with a more readable syntax.

Data Cleaning#

fillna() - Fills missing values with a specified value or method.
dropna() - Removes rows or columns with missing values.
drop() - Removes specified rows or columns from a DataFrame.

Data Transformation and Aggregation#

groupby() - Groups DataFrame rows by one or more columns, enabling aggregation and analysis by group.
agg() - Applies one or more aggregation functions to DataFrame columns, often used with groupby() for grouped statistics.
apply() - Applies a custom function along an axis of the DataFrame (row-wise or column-wise), enabling flexible data transformations.
sort_values() - Sorts DataFrame rows by one or more columns.
value_counts() - Returns counts of unique values, useful for categorical analysis.

Data Combining#

merge() - Combines two DataFrames based on common columns or indices (like SQL JOIN).
concat() - Concatenates multiple DataFrames along rows or columns, useful for combining datasets with the same structure.

Visualization#

This chart may help you decide on how you would want to visualize your data. Then, visit Matplotlib Gallery, find the type of chart you want to make and template your code based on the examples.

Together Questions#

How many listings are recorded in this dataframe?
How many of them have between 2-5 bedrooms and 1+ bathrooms?

I’m looking for a house with 3-5 bedrooms and 2+ bathrooms. Where should I look?
What can I expect to pay for a 1500 sq ft house in this neighborhood?

Practice Questions#

Think before you code. For each problem below, discuss the steps you would take and find the corresponding functions. Write out your plan in the markdown cell with the question.

1

What is the range of values for number of bedrooms, number of bathrooms, and square footage?
If we were only interested in single family homes, how might you filter the dataset?

2

How many houses were sold for more than $500,000?
What is the average price of houses with 3 or more bedrooms?

3

What is the average price per number of bedrooms?
What is the average price per number of bedrooms for each zip code? How would you effectively convey this information?

4

Create a plot that conveys some difference (your choice) between zip codes (e.g. price/sq ft, number of houses sold, lot size, grade).
The plot(s) should show the spatial relationship between zip codes with only one data point per zip code (that is, you’re not displaying individual sales).

5

House flippers buy houses in need of repair/renovation, fix them up quickly, and resell them. Identify sales that would have been good candidates for flipping.
- Consider, the house doesn’t have to be cheap, it just has to have the potential to sell for more. How do you quantify that?
Create a graphic for prospective home flippers. What information would be useful for them to know? How would you show that?

6

Think of a particular audience (e.g. home seller, home buyer, real estate agent, etc) and identify a question or want they may have. Create a graphic to address that question/want.

def quantize(data, bin_width):
    return bin_width * (data//bin_width)

housing_df['long_r'] = quantize(housing_df['long'], 0.05)
housing_df['lat_r'] = quantize(housing_df['lat'], 0.05)

housing_df.groupby(by = ['long_r', 'lat_r']).agg({'price_per_sqft':'median'})

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[7], line 1
----> 1 housing_df.groupby(by = ['long_r', 'lat_r']).agg({'price_per_sqft':'median'})

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/pandas/core/groupby/generic.py:1432, in DataFrameGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
   1429     kwargs["engine_kwargs"] = engine_kwargs
   1431 op = GroupByApply(self, func, args=args, kwargs=kwargs)
-> 1432 result = op.agg()
   1433 if not is_dict_like(func) and result is not None:
   1434     # GH #52849
   1435     if not self.as_index and is_list_like(func):

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/pandas/core/apply.py:190, in Apply.agg(self)
    187     return self.apply_str()
    189 if is_dict_like(func):
--> 190     return self.agg_dict_like()
    191 elif is_list_like(func):
    192     # we require a list, but not a 'str'
    193     return self.agg_list_like()

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/pandas/core/apply.py:423, in Apply.agg_dict_like(self)
    415 def agg_dict_like(self) -> DataFrame | Series:
    416     """
    417     Compute aggregation in the case of a dict-like argument.
    418 
   (...)    421     Result of aggregation.
    422     """
--> 423     return self.agg_or_apply_dict_like(op_name="agg")

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/pandas/core/apply.py:1603, in GroupByApply.agg_or_apply_dict_like(self, op_name)
   1598     kwargs.update({"engine": engine, "engine_kwargs": engine_kwargs})
   1600 with com.temp_setattr(
   1601     obj, "as_index", True, condition=hasattr(obj, "as_index")
   1602 ):
-> 1603     result_index, result_data = self.compute_dict_like(
   1604         op_name, selected_obj, selection, kwargs
   1605     )
   1606 result = self.wrap_results_dict_like(selected_obj, result_index, result_data)
   1607 return result

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/pandas/core/apply.py:462, in Apply.compute_dict_like(self, op_name, selected_obj, selection, kwargs)
    460 is_groupby = isinstance(obj, (DataFrameGroupBy, SeriesGroupBy))
    461 func = cast(AggFuncTypeDict, self.func)
--> 462 func = self.normalize_dictlike_arg(op_name, selected_obj, func)
    464 is_non_unique_col = (
    465     selected_obj.ndim == 2
    466     and selected_obj.columns.nunique() < len(selected_obj.columns)
    467 )
    469 if selected_obj.ndim == 1:
    470     # key only used for output

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/pandas/core/apply.py:663, in Apply.normalize_dictlike_arg(self, how, obj, func)
    661     cols = Index(list(func.keys())).difference(obj.columns, sort=True)
    662     if len(cols) > 0:
--> 663         raise KeyError(f"Column(s) {list(cols)} do not exist")
    665 aggregator_types = (list, tuple, dict)
    667 # if we have a dict of any non-scalars
    668 # eg. {'A' : ['mean']}, normalize all to
    669 # be list-likes
    670 # Cannot use func.values() because arg may be a Series

KeyError: "Column(s) ['price_per_sqft'] do not exist"

housing_df['price_per_sqft'] = housing_df['price']/housing_df['sqft_living']
zip_price = housing_df.groupby(by='zipcode').agg({'price_per_sqft':'median', 'lat':'mean', 'long':'mean'}).sort_values(by='price_per_sqft', ascending=False)

zip_price.head()

	price_per_sqft	lat	long
zipcode
98039	565.165614	47.625840	-122.233540
98004	456.944444	47.616183	-122.205189
98109	427.696078	47.635602	-122.350092
98112	423.828125	47.629619	-122.297866
98119	416.652778	47.640034	-122.366918

plt.scatter(zip_price['long'], zip_price['lat'], c = zip_price['price_per_sqft'])
plt.show()

plt.hist(housing_df['lat'], bins = 50)
plt.show()
plt.hist(housing_df['long'], bins = 50)
plt.show()

../_images/2f466cff406dfcb3074ae8a668ada97369df5d137965795e53413d840ea7d063.png

../_images/71ab4b6d6f4f0c8bed47d6ec852b3dc8fe1f5d378ec68f3e0562bd78379e4ec4.png

../_images/de1de1c90082b9774592c3e64fbe841e5b4c22b84bd95af02fefe4fb35e605d8.png

import seaborn as sns
s = sns.jointplot(x='long', y='lat', hue = 'price_per_sqft', data = housing_df, kind = 'hist')
s.ax_joint.grid(False)
s.ax_marg_y.grid(False)
s.fig.suptitle("Sales by firm's age")

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[10], line 2
import seaborn as sns
----> 2 s = sns.jointplot(x='long', y='lat', hue = 'price_per_sqft', data = housing_df, kind = 'hist')
s.ax_joint.grid(False)
s.ax_marg_y.grid(False)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/seaborn/axisgrid.py:2275, in jointplot(data, x, y, hue, kind, height, ratio, space, dropna, xlim, ylim, color, palette, hue_order, hue_norm, marginal_ticks, joint_kws, marginal_kws, **kwargs)
elif kind.startswith("hist"):

   # TODO process pair parameters for bins, etc. and pass
   # to both joint and marginal plots
   joint_kws.setdefault("color", color)
-> 2275     grid.plot_joint(histplot, **joint_kws)
   marginal_kws.setdefault("kde", False)
   marginal_kws.setdefault("color", color)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/seaborn/axisgrid.py:1832, in JointGrid.plot_joint(self, func, **kwargs)
   self._inject_kwargs(func, kwargs, self._hue_params)
if str(func.__module__).startswith("seaborn"):
-> 1832     func(x=self.x, y=self.y, **kwargs)
else:
   func(self.x, self.y, **kwargs)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/seaborn/distributions.py:1434, in histplot(data, x, y, hue, weights, stat, bins, binwidth, binrange, discrete, cumulative, common_bins, common_norm, multiple, element, fill, shrink, kde, kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws, palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)
   p.plot_univariate_histogram(
       multiple=multiple,
       element=element,
   (...)   1429         **kwargs,
   )
else:
-> 1434     p.plot_bivariate_histogram(
       common_bins=common_bins,
       common_norm=common_norm,
       thresh=thresh,
       pthresh=pthresh,
       pmax=pmax,
       color=color,
       legend=legend,
       cbar=cbar,
       cbar_ax=cbar_ax,
       cbar_kws=cbar_kws,
       estimate_kws=estimate_kws,
       **kwargs,
   )
return ax

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/seaborn/distributions.py:850, in _DistributionPlotter.plot_bivariate_histogram(self, common_bins, common_norm, thresh, pthresh, pmax, color, legend, cbar, cbar_ax, cbar_kws, estimate_kws, **plot_kws)
   heights = np.ma.masked_less_equal(heights, thresh)
# pcolormesh is going to turn the grid off, but we want to keep it
# I'm not sure if there's a better way to get the grid state
--> 850 x_grid = any([l.get_visible() for l in ax.xaxis.get_gridlines()])
y_grid = any([l.get_visible() for l in ax.yaxis.get_gridlines()])
mesh = ax.pcolormesh(
   x_edges,
   y_edges,
   heights.T,
   **artist_kws,
)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/matplotlib/axis.py:1423, in Axis.get_gridlines(self)
def get_gridlines(self):
   r"""Return this Axis' grid lines as a list of `.Line2D`\s."""
-> 1423     ticks = self.get_major_ticks()
   return cbook.silent_list('Line2D gridline',
                            [tick.gridline for tick in ticks])

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/matplotlib/axis.py:1660, in Axis.get_major_ticks(self, numticks)
r"""
Return the list of major `.Tick`\s.

   (...)   1657     Use `.set_tick_params` instead if possible.
"""
if numticks is None:
-> 1660     numticks = len(self.get_majorticklocs())
while len(self.majorTicks) < numticks:
   # Update the new tick label properties from the old.
   tick = self._get_tick(major=True)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/matplotlib/axis.py:1525, in Axis.get_majorticklocs(self)
def get_majorticklocs(self):
   """Return this Axis' major tick locations in data coordinates."""
-> 1525     return self.major.locator()

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/matplotlib/ticker.py:2229, in MaxNLocator.__call__(self)
def __call__(self):
-> 2229     vmin, vmax = self.axis.get_view_interval()
   return self.tick_values(vmin, vmax)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/matplotlib/axis.py:2347, in _make_getset_interval.<locals>.getter(self)
def getter(self):
   # docstring inherited.
-> 2347     return getattr(getattr(self.axes, lim_name), attr_name)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/matplotlib/axes/_base.py:904, in _AxesBase.viewLim(self)
@property
def viewLim(self):
   """The view limits as `.Bbox` in data coordinates."""
--> 904     self._unstale_viewLim()
   return self._viewLim

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/matplotlib/axes/_base.py:898, in _AxesBase._unstale_viewLim(self)
   for ax in self._shared_axes[name].get_siblings(self):
       ax._stale_viewlims[name] = False
--> 898 self.autoscale_view(**{f"scale{name}": scale
                      for name, scale in need_scale.items()})

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/matplotlib/axes/_base.py:3012, in _AxesBase.autoscale_view(self, tight, scalex, scaley)
       x_stickies = np.sort(np.concatenate([
           artist.sticky_edges.x
           for ax in self._shared_axes["x"].get_siblings(self)
           for artist in ax.get_children()]))
   if self._ymargin and scaley and self.get_autoscaley_on():
-> 3012         y_stickies = np.sort(np.concatenate([
           artist.sticky_edges.y
           for ax in self._shared_axes["y"].get_siblings(self)
           for artist in ax.get_children()]))
if self.get_xscale() == 'log':
   x_stickies = x_stickies[x_stickies > 0]

KeyboardInterrupt: