Example: Simple Linear Regression with the King County Housing Dataset

2.3. Example: Simple Linear Regression with the King County Housing Dataset#

In your second problem set, you’ll be using the King County Housing dataset. This dataset contains information about the sales of residences (homes, condos, apartment buildings) within King County, WA (the county containing Seattle) from the 2014-2015 period.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

housing_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/kc_house_data.csv', parse_dates = ['date'])

housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 id             21613 non-null  int64         
 date           21613 non-null  datetime64[ns]
 price          21613 non-null  float64       
 bedrooms       21613 non-null  int64         
 bathrooms      21613 non-null  float64       
 sqft_living    21613 non-null  int64         
 sqft_lot       21613 non-null  int64         
 floors         21613 non-null  float64       
 waterfront     21613 non-null  int64         
 view           21613 non-null  int64         
condition      21613 non-null  int64         
grade          21613 non-null  int64         
sqft_above     21613 non-null  int64         
sqft_basement  21613 non-null  int64         
yr_built       21613 non-null  int64         
yr_renovated   21613 non-null  int64         
zipcode        21613 non-null  int64         
lat            21613 non-null  float64       
long           21613 non-null  float64       
sqft_living15  21613 non-null  int64         
sqft_lot15     21613 non-null  int64         
dtypes: datetime64[ns](1), float64(5), int64(15)
memory usage: 3.5 MB

housing_df.head()

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	7129300520	2014-10-13	221900.0	3	1.00	1180	5650	1.0	...	7	1180	0	1955	0	98178	47.5112	-122.257	1340	5650
1	6414100192	2014-12-09	538000.0	3	2.25	2570	7242	2.0	...	7	2170	400	1951	1991	98125	47.7210	-122.319	1690	7639
2	5631500400	2015-02-25	180000.0	2	1.00	770	10000	1.0	...	6	770	0	1933	0	98028	47.7379	-122.233	2720	8062
3	2487200875	2014-12-09	604000.0	4	3.00	1960	5000	1.0	...	7	1050	910	1965	0	98136	47.5208	-122.393	1360	5000
4	1954400510	2015-02-18	510000.0	3	2.00	1680	8080	1.0	...	8	1680	0	1987	0	98074	47.6168	-122.045	1800	7503

5 rows × 21 columns

2.3.1. Linear Regression#

2.3.1.1. Narrowing the scope of the problem#

For this example, we’ll filter the dataset only to 3 BR homes in a single zip code. And, for the sake of having enough data, we’ll use the zip code with the greatest number of 3 BR homes sold, which is 98042.

So you might imagine that you’re trying to sell your 3 BR home in 98042, and you want to estimate what a fair market price would be and set your asking price accordingly.

# Finding the zip code with the most 3 BR homes

housing_df.query('bedrooms==3')\
    .groupby(by = 'zipcode').agg({'id':'count'})\
        .sort_values(by='id', ascending = False)

	id
zipcode
98042	317
98034	303
98038	292
98103	283
98023	281
...	...
98010	51
98102	51
98024	44
98148	34
98039	11

70 rows × 1 columns

# Filtering the data to only include 3 BR homes in that zip code.

housing3 = housing_df.query('bedrooms == 3 and zipcode == 98042')
housing3

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
57	2799800710	2015-04-07	301000.0	3	2.50	2420	4750	2.0	0	0	...	8	2420	0	2003	0	98042	47.3663	-122.122	2690	4750
68	1274500060	2014-08-25	204000.0	3	1.00	1000	12070	1.0	0	0	...	7	1000	0	1968	0	98042	47.3621	-122.110	1010	12635
74	3444100400	2015-03-16	349000.0	3	1.75	1790	50529	1.0	0	0	...	7	1090	700	1965	0	98042	47.3511	-122.073	1940	50529
201	2222059065	2014-11-12	297000.0	3	2.50	1940	14952	2.0	0	0	...	8	1940	0	1994	0	98042	47.3777	-122.165	2030	10450
296	5468730030	2014-08-22	265000.0	3	2.00	1320	8959	1.0	0	0	...	7	1320	0	1993	0	98042	47.3536	-122.144	1740	7316
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
21420	9478550110	2015-03-03	299950.0	3	2.50	1740	4497	2.0	0	0	...	7	1740	0	2012	0	98042	47.3697	-122.117	1950	4486
21451	7140700690	2015-03-12	239950.0	3	1.75	1600	4888	1.0	0	0	...	6	1600	0	2014	0	98042	47.3830	-122.097	2520	5700
21526	1760650820	2015-04-28	290000.0	3	2.25	1610	3764	2.0	0	0	...	7	1610	0	2012	0	98042	47.3589	-122.083	1610	3825
21553	2522059251	2015-04-09	465000.0	3	2.50	2050	15035	2.0	0	0	...	9	2050	0	2006	0	98042	47.3619	-122.122	1300	15836
21585	3832050760	2014-08-28	270000.0	3	2.50	1870	5000	2.0	0	0	...	7	1870	0	2009	0	98042	47.3339	-122.055	2170	5399

317 rows × 21 columns

2.3.1.2. Cleaning the data and selecting a feature#

We’ll remove columns that are either uninformative (e.g. id, date) or constant (e.g. bedrooms, zipcode, waterfront).

I like to do this by:

requesting all the columns
copying all the columns into a list (columns_to_keep)
removing the ones I want removed
extracting just those columns from the dataframe

housing3.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

columns_to_keep = ['price', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated',
       'lat', 'long', 'sqft_living15', 'sqft_lot15']

housing3 = housing3[columns_to_keep]

Next, we’ll calculate the correlation and display it as a heat map. We’ll select the variable most correlated to price as our feature for modeling.

housing_corr = housing3.corr()
housing_corr

	price	bathrooms	sqft_living	sqft_lot	floors	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	lat	long	sqft_living15	sqft_lot15
price	1.000000	0.545442	0.809535	0.540426	0.394416	-0.161509	0.725263	0.797889	0.048380	0.310763	-0.029596	-0.009834	-0.161651	0.661501	0.417028
bathrooms	0.545442	1.000000	0.700860	0.080940	0.644609	-0.431470	0.533476	0.685417	0.055520	0.707861	-0.162385	0.133747	-0.192372	0.610515	0.118165
sqft_living	0.809535	0.700860	1.000000	0.284833	0.522319	-0.307134	0.734776	0.923458	0.217862	0.432447	-0.112420	0.036043	-0.241881	0.770141	0.265556
sqft_lot	0.540426	0.080940	0.284833	1.000000	0.019590	-0.013589	0.136916	0.230406	0.145040	-0.123200	0.138612	0.019995	-0.016856	0.238468	0.562668
floors	0.394416	0.644609	0.522319	0.019590	1.000000	-0.409100	0.403710	0.639940	-0.287080	0.632382	-0.117929	0.009782	-0.157313	0.488257	-0.021495
condition	-0.161509	-0.431470	-0.307134	-0.013589	-0.409100	1.000000	-0.201157	-0.297693	-0.031132	-0.504516	-0.113069	-0.000707	0.147085	-0.362378	-0.010971
grade	0.725263	0.533476	0.734776	0.136916	0.403710	-0.201157	1.000000	0.751795	-0.026266	0.337224	-0.164262	-0.016369	-0.264209	0.690822	0.092479
sqft_above	0.797889	0.685417	0.923458	0.230406	0.639940	-0.297693	0.751795	1.000000	-0.173297	0.480472	-0.119714	-0.004595	-0.237341	0.740524	0.165258
sqft_basement	0.048380	0.055520	0.217862	0.145040	-0.287080	-0.031132	-0.026266	-0.173297	1.000000	-0.112136	0.015948	0.104201	-0.017151	0.093178	0.261271
yr_built	0.310763	0.707861	0.432447	-0.123200	0.632382	-0.504516	0.337224	0.480472	-0.112136	1.000000	-0.175139	-0.010309	-0.036663	0.488356	-0.109969
yr_renovated	-0.029596	-0.162385	-0.112420	0.138612	-0.117929	-0.113069	-0.164262	-0.119714	0.015948	-0.175139	1.000000	-0.038023	0.093257	-0.113313	0.157284
lat	-0.009834	0.133747	0.036043	0.019995	0.009782	-0.000707	-0.016369	-0.004595	0.104201	-0.010309	-0.038023	1.000000	-0.430948	0.073604	-0.081197
long	-0.161651	-0.192372	-0.241881	-0.016856	-0.157313	0.147085	-0.264209	-0.237341	-0.017151	-0.036663	0.093257	-0.430948	1.000000	-0.237247	-0.030228
sqft_living15	0.661501	0.610515	0.770141	0.238468	0.488257	-0.362378	0.690822	0.740524	0.093178	0.488356	-0.113313	0.073604	-0.237247	1.000000	0.092127
sqft_lot15	0.417028	0.118165	0.265556	0.562668	-0.021495	-0.010971	0.092479	0.165258	0.261271	-0.109969	0.157284	-0.081197	-0.030228	0.092127	1.000000

import seaborn as sns

fig, ax = plt.subplots(1,1, figsize = (8,8))

sns.heatmap(np.abs(housing_corr), vmin = 0, vmax = 1,
            annot = True, annot_kws = {'size':8},
            cbar = False)
plt.show()

../_images/e40d72cdbb39b55172cf5eafbae5f0aaa0ac92e5451ee18e9191c442fdfc65de.png

With a correlation of 0.81, sqft_living is most correlated to price. So:

sqft_living is our feature
price is our target

Let’s plot price vs sqft_living to see our dataset.

x = housing3[['sqft_living']]
y = housing3[['price']]

plt.plot(x, y, '.')
plt.xlabel('Living Area (sq ft)')
plt.ylabel('Price ($)')
plt.show()

../_images/e29ebef2c293d1eef5c952291786edceffcbfa9ae736e08e1f86e04223453c6a.png

2.3.1.3. Modeling time!#

We will:

Import the necessary functions
Split our data into training and testing sets
Create our model template, LinearRegression()
Fit our model to the training data.
Make predictions using both our training and testing data.
Calculate $R^2$ and RMSE for both training and testing sets.
Visualize our results.

# Importing packages

import sklearn as sk

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import root_mean_squared_error, r2_score

# Splitting the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=13)

# Creating and fitting the model
linreg = LinearRegression()  #makes the template
linreg.fit(x_train, y_train) #finds the optimal slope and

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

x_model = np.array([[0, 4500]]).T
y_model = linreg.predict(x_model)

plt.plot(x, y, '.', alpha = 0.8)
plt.plot(x_model, y_model, 'k--', label = 'Best-fit line')

plt.title('Predicting Sale Price from Living Area')
plt.xlabel('Living Area (sq ft)')
plt.ylabel('Price ($)')
plt.show()

../_images/8181e39d3c547794e53bcb51953528c841e3a34d643e197b2bf67d1919f26820.png

What do you think of the fit of this line? I see an issue…

This fit is okay and I won’t change it, but something bothers me. Do you see it? What could be done to improve the fit?

# Inspecting the model parameters
linreg.__dict__

{'fit_intercept': True,
 'copy_X': True,
 'tol': 1e-06,
 'n_jobs': None,
 'positive': False,
 'feature_names_in_': array(['sqft_living'], dtype=object),
 'n_features_in_': 1,
 'coef_': array([[133.30980501]]),
 'rank_': 1,
 'singular_': array([9115.35178206]),
 'intercept_': array([60720.98693839])}

Do you see the slope and y-intercept in the list of model properties above?

slope = 133.31 ($/sqft)
y-intercept = 60,720.99 ($)

2.3.1.4. Assessing the model#

# Use the training and testing features to make predictions using the model
y_pred_train = linreg.predict(x_train)
y_pred_test = linreg.predict(x_test)

# Calculate R2
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

# Calculate RMSE
rmse_train = root_mean_squared_error(y_train, y_pred_train)
rmse_test = root_mean_squared_error(y_test, y_pred_test)

print(f'R2 train: {r2_train:.2f}\t\tRMSE train: {rmse_train:.2f}')
print(f'R2 test: {r2_test:.2f}\t\tRMSE test: {rmse_test:.2f}')

R2 train: 0.66		RMSE train: 55145.81
R2 test: 0.65		RMSE test: 62150.55

Why do we assess model performance on both training and testing data?

We’ll talk more about model fit later, but what we want to see: similar performance between training and testing results, but testing is generally slightly worse.

If the model does well with training data and much worse with test data, this typically indicates over-fitting and you probably need a simpler model or more data. We are using as simple a model as possible, so that’s not an issue.
If the model does badly with both training and testing data, this indicates under-fitting, and we could probably use a better model (more parameters).

Sometimes, depending on the random sampling of the train-test split, you may even get better performance on the testing data. This only really happens with smaller data sets such as this.

2.3.1.5. Another Visualization#

We can also visualize the predicted vs actual values. Later, when we use more features in our model, this is our only good option.

This plot helps visualize where the model is over-estimating or under-estimating the prices of homes. A point below the line means the actual price is greater than the predicted price, so the model is under-estimating; a point above suggests the model is over-estimating.

Do you see a problem now?

# I will divide prices by 100K saved as variable HK
HK = 100000 
low, high = 1, 8

plt.plot([low, high], [low, high], 'k--', label = 'perfect prediction')

# plot the training data
plt.plot(y_train/HK, y_pred_train/HK, 
         '.', color = 'teal', alpha = 0.5, 
         label = 'training data')

# plot the test data
plt.plot(y_test/HK, y_pred_test/HK, 
         '.', color = 'goldenrod', alpha = 0.8, 
         label = 'testing data')

plt.title('Predicted vs Actual Sale Price\nof 3-BR Homes in 98042 (Seattle)')
plt.xlabel('Actual Sale Price ($100K)')
plt.ylabel('Predicted Sale Price ($100K)')
plt.legend()
plt.show()

../_images/6cfeedb0a78535e84252ab35c5a8245456516513111aba9d321e526a92dcd789.png

	fit_intercept	True
	copy_X	True
	tol	1e-06
	n_jobs	None
	positive	False