PS02 - Regularization#

Your Name: YOUR NAME
Collaborators:

  • collaborator 1

  • collaborator 2

DS325, Gettysburg College
Prof Eatai Roth

Submit on Moodle
Due before Tuesday Feb 17, 2026 at 9a (the next assignment will not be a problem set. It will be assigned on Tuesday Feb 17, 2026).

Total pts: 20

IMPORTANT INSTRUCTIONS:#

  • When you submit your code, make sure that every cell runs without returning an error.

  • Once you have the results you need, edit out any extraneous code and outputs.

  • Do not rewrite code if it doesn’t need to be rewritten. For example, in the sample analysis, I write all the import statements you should need. I also perform data cleaning and scaling. Do not redo those in your analysis.

Problem 1#

In this assignment, you’ll be doing a start-to-finish analysis of the King County Housing data. In class, we fit different linear regression models to the data, but never managed to capture the trend sufficiently. Your aim is to fix that.

a. Fitting a LASSO model#

Fit a Lasso model to the original data. Play around with values of alpha (the regularization strength) until you feel you have a good balance of generilization (fewer parameters) and goodness-of-fit (\(R^2\)).

  • I do some data cleaning for you.

  • Create dummy variables using pd.get_dummies for each zip code. Then remove the zipcode column from the original data and add the new dummy columns.

  • Split the data 50-50 into training and testing set.

  • Fit two models using LinearRegression and LassoCV (use a range of alphas between 0.01 and 100) and make predictions using both the training and test set.

  • Calculate the \(R^2\) and RMSE for both training and test set. Compare these values.

    • Based on this comparison, how do you know if you are under- or over-fitting?

  • Plot the predicted vs actual prices as a scatter plot with the perfect-prediction line super-imposed.

    • Where does the model do well? Where does the model over- or under-estimate house values?

Annotate and comment your code for readability PLEASE.

  • I recommend writing as much of the code from scratch as possible.

  • If copying from the notes, I recommend typing out the lines of code rather than copy-paste.

  • If copy-pasting, I recommend reading every line of code and understanding what it does.

  • You should not be using any generative AI to prompt code, but autocomplete is okay.

Below, I do some data cleaning for you. Start your work below the cell labeled ‘Your work starts here’.

import pandas as pd

pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.max_colwidth', 15)

pd.options.display.max_colwidth = 15

import numpy as np
import matplotlib.pyplot as plt
housing_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/kc_house_data.csv')
housing_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long           21613 non-null  float64
 19  sqft_living15  21613 non-null  int64  
 20  sqft_lot15     21613 non-null  int64  
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB
housing_df = housing_df.query('2<=bedrooms<=5')
housing_df['price'] = housing_df['price']

zipcode_df = housing_df['zipcode']
housing_df.drop(columns = ['id', 'date', 'yr_renovated', 'yr_built', 'waterfront', 'view'], inplace = True)
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import root_mean_squared_error, r2_score
"Your code starts here."
'Your code starts here.'

Follow-up questions#

  1. What was the value of alpha you arrived at with LassoCV?

  • your answer here

  1. Which parameters changed most significantly between the linear regression and lasso models? Were any discarded (coefficent equal or close to zero)?

  • your answer here

  1. Give your interpretation of the coefficents for the zipcode features (the dummy columns)? Which zipcodes have the highest and the lowest coefficients? What does this tell you about the neighborhoods?

  • your answer here

Problem 2#

The synthetic dataset below shows advertising spending in TV, radio, and newspaper as well as the resultant sales.

Part I

  • Perform any exploratory analysis you feel is relevant.

  • Create training and testing sets with an 80-20 split.

  • Fit the best model you can to the data using either LinearRegression, RidgeCV, LassoCV, and/or ElasticNetCV. You will likely want to compare more than one model.

  • Assess your models with R^2 and RMSE.

Part II

  • Create 2nd order polynomial features, including interaction terms.

  • Repeat the process for Part I, fitting the best model you can to the augmented data.

Answer the questions below.

advert_df = pd.read_csv('https://raw.githubusercontent.com/sandeshajjampur/TV-Radio-Newspaper-Advertising/refs/heads/main/Advertising.csv')
advert_df
TV Radio Newspaper Sales
0 230.10 37.80 69.20 22.10
1 44.50 39.30 45.10 10.40
2 17.20 45.90 69.30 12.00
3 151.50 41.30 58.50 16.50
4 180.80 10.80 58.40 17.90
... ... ... ... ...
195 38.20 3.70 13.80 7.60
196 94.20 4.90 8.10 14.00
197 177.00 9.30 6.40 14.80
198 283.60 42.00 66.20 25.50
199 232.10 8.60 8.70 18.40

200 rows × 4 columns

X = advert_df[['TV', 'Radio', 'Newspaper']]
y = advert_df['Sales']

X_train, X_test, y_train, y_test = train_test_split(X, y)
'''Your code here. Add as many cells as you need.'''
'Your code here. Add as many cells as you need.'

Follow-up questions#

Part I

  1. Which was your best model? State the model type and the hyper-parameter(s) as well as the performance measures for that model (evaluated on the test set)?

  • your answer here

  1. Rank the features in order of importance, most to least?

  • your answer here

Part II

  1. Which was your best model? State the model type and the hyper-parameter(s) as well as the performance measures for that model (evaluated on the test set)?

  • your answer here

  1. How did performance compare to the original model without polynomial features?

  • your answer here

  1. Rank the features in order of importance, most to least? Were polynomial features useful?

  • your answer here

from sklearn.preprocessing import PolynomialFeatures