PS02 - Regularization#
Your Name: YOUR NAME
Collaborators:
collaborator 1
collaborator 2
DS325, Gettysburg College
Prof Eatai Roth
Submit on Moodle
Due before Tuesday Feb 17, 2026 at 9a (the next assignment will not be a problem set. It will be assigned on Tuesday Feb 17, 2026).
Total pts: 20
IMPORTANT INSTRUCTIONS:#
When you submit your code, make sure that every cell runs without returning an error.
Once you have the results you need, edit out any extraneous code and outputs.
Do not rewrite code if it doesn’t need to be rewritten. For example, in the sample analysis, I write all the import statements you should need. I also perform data cleaning and scaling. Do not redo those in your analysis.
Problem 1#
In this assignment, you’ll be doing a start-to-finish analysis of the King County Housing data. In class, we fit different linear regression models to the data, but never managed to capture the trend sufficiently. Your aim is to fix that.
a. Fitting a LASSO model#
Fit a Lasso model to the original data. Play around with values of alpha (the regularization strength) until you feel you have a good balance of generilization (fewer parameters) and goodness-of-fit (\(R^2\)).
I do some data cleaning for you.
Create dummy variables using pd.get_dummies for each zip code. Then remove the zipcode column from the original data and add the new dummy columns.
Split the data 50-50 into training and testing set.
Fit two models using LinearRegression and LassoCV (use a range of alphas between 0.01 and 100) and make predictions using both the training and test set.
Calculate the \(R^2\) and RMSE for both training and test set. Compare these values.
Based on this comparison, how do you know if you are under- or over-fitting?
Plot the predicted vs actual prices as a scatter plot with the perfect-prediction line super-imposed.
Where does the model do well? Where does the model over- or under-estimate house values?
Annotate and comment your code for readability PLEASE.
I recommend writing as much of the code from scratch as possible.
If copying from the notes, I recommend typing out the lines of code rather than copy-paste.
If copy-pasting, I recommend reading every line of code and understanding what it does.
You should not be using any generative AI to prompt code, but autocomplete is okay.
Below, I do some data cleaning for you. Start your work below the cell labeled ‘Your work starts here’.
import pandas as pd
pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.max_colwidth', 15)
pd.options.display.max_colwidth = 15
import numpy as np
import matplotlib.pyplot as plt
housing_df = pd.read_csv('https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/kc_house_data.csv')
housing_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 21613 non-null int64
1 date 21613 non-null object
2 price 21613 non-null float64
3 bedrooms 21613 non-null int64
4 bathrooms 21613 non-null float64
5 sqft_living 21613 non-null int64
6 sqft_lot 21613 non-null int64
7 floors 21613 non-null float64
8 waterfront 21613 non-null int64
9 view 21613 non-null int64
10 condition 21613 non-null int64
11 grade 21613 non-null int64
12 sqft_above 21613 non-null int64
13 sqft_basement 21613 non-null int64
14 yr_built 21613 non-null int64
15 yr_renovated 21613 non-null int64
16 zipcode 21613 non-null int64
17 lat 21613 non-null float64
18 long 21613 non-null float64
19 sqft_living15 21613 non-null int64
20 sqft_lot15 21613 non-null int64
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB
housing_df = housing_df.query('2<=bedrooms<=5')
housing_df['price'] = housing_df['price']
zipcode_df = housing_df['zipcode']
housing_df.drop(columns = ['id', 'date', 'yr_renovated', 'yr_built', 'waterfront', 'view'], inplace = True)
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import root_mean_squared_error, r2_score
"Your code starts here."
'Your code starts here.'
Follow-up questions#
What was the value of alpha you arrived at with LassoCV?
your answer here
Which parameters changed most significantly between the linear regression and lasso models? Were any discarded (coefficent equal or close to zero)?
your answer here
Give your interpretation of the coefficents for the zipcode features (the dummy columns)? Which zipcodes have the highest and the lowest coefficients? What does this tell you about the neighborhoods?
your answer here
Problem 2#
The synthetic dataset below shows advertising spending in TV, radio, and newspaper as well as the resultant sales.
Part I
Perform any exploratory analysis you feel is relevant.
Create training and testing sets with an 80-20 split.
Fit the best model you can to the data using either LinearRegression, RidgeCV, LassoCV, and/or ElasticNetCV. You will likely want to compare more than one model.
Assess your models with R^2 and RMSE.
Part II
Create 2nd order polynomial features, including interaction terms.
Repeat the process for Part I, fitting the best model you can to the augmented data.
Answer the questions below.
advert_df = pd.read_csv('https://raw.githubusercontent.com/sandeshajjampur/TV-Radio-Newspaper-Advertising/refs/heads/main/Advertising.csv')
advert_df
| TV | Radio | Newspaper | Sales | |
|---|---|---|---|---|
| 0 | 230.10 | 37.80 | 69.20 | 22.10 |
| 1 | 44.50 | 39.30 | 45.10 | 10.40 |
| 2 | 17.20 | 45.90 | 69.30 | 12.00 |
| 3 | 151.50 | 41.30 | 58.50 | 16.50 |
| 4 | 180.80 | 10.80 | 58.40 | 17.90 |
| ... | ... | ... | ... | ... |
| 195 | 38.20 | 3.70 | 13.80 | 7.60 |
| 196 | 94.20 | 4.90 | 8.10 | 14.00 |
| 197 | 177.00 | 9.30 | 6.40 | 14.80 |
| 198 | 283.60 | 42.00 | 66.20 | 25.50 |
| 199 | 232.10 | 8.60 | 8.70 | 18.40 |
200 rows × 4 columns
X = advert_df[['TV', 'Radio', 'Newspaper']]
y = advert_df['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y)
'''Your code here. Add as many cells as you need.'''
'Your code here. Add as many cells as you need.'
Follow-up questions#
Part I
Which was your best model? State the model type and the hyper-parameter(s) as well as the performance measures for that model (evaluated on the test set)?
your answer here
Rank the features in order of importance, most to least?
your answer here
Part II
Which was your best model? State the model type and the hyper-parameter(s) as well as the performance measures for that model (evaluated on the test set)?
your answer here
How did performance compare to the original model without polynomial features?
your answer here
Rank the features in order of importance, most to least? Were polynomial features useful?
your answer here
from sklearn.preprocessing import PolynomialFeatures