PS02 - Regularization

PS02 - Regularization#

Your Name: YOUR NAME
Collaborators:

collaborator 1
collaborator 2

DS325, Gettysburg College
Prof Eatai Roth

Submit on Moodle
Due before Tuesday Feb 17, 2026 at 9a (the next assignment will not be a problem set. It will be assigned on Tuesday Feb 17, 2026).

Total pts: 20

IMPORTANT INSTRUCTIONS:#

When you submit your code, make sure that every cell runs without returning an error.
Once you have the results you need, edit out any extraneous code and outputs.
Do not rewrite code if it doesn’t need to be rewritten. For example, in the sample analysis, I write all the import statements you should need. I also perform data cleaning and scaling. Do not redo those in your analysis.

Problem 1#

In this assignment, you’ll be doing a start-to-finish analysis of the King County Housing data. In class, we fit different linear regression models to the data, but never managed to capture the trend sufficiently. Your aim is to fix that.

a. Fitting a LASSO model#

Fit a Lasso model to the original data. Play around with values of alpha (the regularization strength) until you feel you have a good balance of generilization (fewer parameters) and goodness-of-fit (\(R^2\)).

I do some data cleaning for you.
Create dummy variables using pd.get_dummies for each zip code. Then remove the zipcode column from the original data and add the new dummy columns.
Split the data 50-50 into training and testing set.
Fit two models using LinearRegression and LassoCV (use a range of alphas between 0.01 and 100) and make predictions using both the training and test set.
Calculate the \(R^2\) and RMSE for both training and test set. Compare these values.
- Based on this comparison, how do you know if you are under- or over-fitting?
Plot the predicted vs actual prices as a scatter plot with the perfect-prediction line super-imposed.
- Where does the model do well? Where does the model over- or under-estimate house values?

Annotate and comment your code for readability PLEASE.

I recommend writing as much of the code from scratch as possible.
If copying from the notes, I recommend typing out the lines of code rather than copy-paste.
If copy-pasting, I recommend reading every line of code and understanding what it does.
You should not be using any generative AI to prompt code, but autocomplete is okay.

Below, I do some data cleaning for you. Start your work below the cell labeled ‘Your work starts here’.

import pandas as pd

pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.max_colwidth', 15)

pd.options.display.max_colwidth = 15

import numpy as np
import matplotlib.pyplot as plt

housing_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/kc_house_data.csv')
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 id             21613 non-null  int64  
 date           21613 non-null  object 
 price          21613 non-null  float64
 bedrooms       21613 non-null  int64  
 bathrooms      21613 non-null  float64
 sqft_living    21613 non-null  int64  
 sqft_lot       21613 non-null  int64  
 floors         21613 non-null  float64
 waterfront     21613 non-null  int64  
 view           21613 non-null  int64  
condition      21613 non-null  int64  
grade          21613 non-null  int64  
sqft_above     21613 non-null  int64  
sqft_basement  21613 non-null  int64  
yr_built       21613 non-null  int64  
yr_renovated   21613 non-null  int64  
zipcode        21613 non-null  int64  
lat            21613 non-null  float64
long           21613 non-null  float64
sqft_living15  21613 non-null  int64  
sqft_lot15     21613 non-null  int64  
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB

housing_df = housing_df.query('2<=bedrooms<=5')
housing_df['price'] = housing_df['price']

zipcode_df = housing_df['zipcode']
housing_df.drop(columns = ['id', 'date', 'yr_renovated', 'yr_built', 'waterfront', 'view'], inplace = True)

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import root_mean_squared_error, r2_score

"Your code starts here."

'Your code starts here.'

Follow-up questions#

What was the value of alpha you arrived at with LassoCV?

your answer here

Which parameters changed most significantly between the linear regression and lasso models? Were any discarded (coefficent equal or close to zero)?

your answer here

Give your interpretation of the coefficents for the zipcode features (the dummy columns)? Which zipcodes have the highest and the lowest coefficients? What does this tell you about the neighborhoods?

your answer here

Problem 2#

The synthetic dataset below shows advertising spending in TV, radio, and newspaper as well as the resultant sales.

Part I

Perform any exploratory analysis you feel is relevant.
Create training and testing sets with an 80-20 split.
Fit the best model you can to the data using either LinearRegression, RidgeCV, LassoCV, and/or ElasticNetCV. You will likely want to compare more than one model.
Assess your models with R^2 and RMSE.

Part II

Create 2nd order polynomial features, including interaction terms.
Repeat the process for Part I, fitting the best model you can to the augmented data.

Answer the questions below.

advert_df = pd.read_csv('https://raw.githubusercontent.com/sandeshajjampur/TV-Radio-Newspaper-Advertising/refs/heads/main/Advertising.csv')
advert_df

	TV	Radio	Newspaper	Sales
0	230.10	37.80	69.20	22.10
1	44.50	39.30	45.10	10.40
2	17.20	45.90	69.30	12.00
3	151.50	41.30	58.50	16.50
4	180.80	10.80	58.40	17.90
...	...	...	...	...
195	38.20	3.70	13.80	7.60
196	94.20	4.90	8.10	14.00
197	177.00	9.30	6.40	14.80
198	283.60	42.00	66.20	25.50
199	232.10	8.60	8.70	18.40

200 rows × 4 columns

X = advert_df[['TV', 'Radio', 'Newspaper']]
y = advert_df['Sales']

X_train, X_test, y_train, y_test = train_test_split(X, y)

'''Your code here. Add as many cells as you need.'''

'Your code here. Add as many cells as you need.'

Follow-up questions#

Part I

Which was your best model? State the model type and the hyper-parameter(s) as well as the performance measures for that model (evaluated on the test set)?

your answer here

Rank the features in order of importance, most to least?

your answer here

Part II

Which was your best model? State the model type and the hyper-parameter(s) as well as the performance measures for that model (evaluated on the test set)?

your answer here

How did performance compare to the original model without polynomial features?

your answer here

Rank the features in order of importance, most to least? Were polynomial features useful?

your answer here

from sklearn.preprocessing import PolynomialFeatures

PS02 - Regularization

Contents

PS02 - Regularization#

IMPORTANT INSTRUCTIONS:#

Problem 1#

a. Fitting a LASSO model#

Follow-up questions#

Problem 2#

Follow-up questions#