4. Polynomial Regression#

…because nothing in this world is truly linear.

Sometimes our data just aren’t linear. We can extend the linear regression approach to fitting ‘curvy’ data in several ways (e.g. we fit an exponential curve on PS1). Today we’ll introduce polynomial regression.

4.1. What is a polynomial?#

An equation composed of variables, variables raised to powers (e.g. squared, cubed, etc), and/or the product of variables and variables raised to powers.

Examples:

  • \(y = x^2\)

  • \(y = 5 x^3 + 4.5^x - 3.6\)

  • \(y = x_1^2 + 3.7 x_1 x_2 +2.1 x_2^2\)

And Taylor’s theorem tells us that any smooth curve can be approximated within some range arbitrarily closely by some polynomial (if you took calculus, this is the foundation of the Taylor Series).

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(-5, 5, 0.01)
y = np.sin(x) + np.cos(1.7 * x)

y1 = 1 + x
y2 = y1 - 1.445 * x**2
y3 = y2 - 1.667 * x**3

y8 = y3 + 0.3480041667 * x**4 \
        + 0.0083333333 * x**5 \
        - 0.0335244014 * x**6 \
        - 0.0001984127 * x**7 \
        + 0.0017300986 * x**8 \
             
y14 = 1.0 \
        + 1.0 * x \
        - 1.445 * x**2 \
        - 0.1666666667 * x**3 \
        + 0.3480041667 * x**4 \
        + 0.0083333333 * x**5 \
        - 0.0335244014 * x**6 \
        - 0.0001984127 * x**7 \
        + 0.0017300986 * x**8 \
        + 0.0000027557 * x**9 \
        - 0.0000555554 * x**10 \
        - 0.0000000251 * x**11 \
        + 0.0000012163 * x**12 \
        + 0.0000000002 * x**13 \
        - 0.0000000193 * x**14 \


fig, ax = plt.subplots(1,5, figsize = (15, 3), sharex = True, sharey = True)
ax[0].plot(x, y, 'k', linewidth = 2)
ax[0].plot(x, y1, 'r--', label = '1st order')
ax[0].text(.2, 2, '1st order', fontsize = 13)

ax[1].plot(x, y, 'k', linewidth = 2)
ax[1].plot(x, y2, 'r--', label = '2nd order')
ax[1].text(.2, 2, '2nd order', fontsize = 13)

ax[2].plot(x, y, 'k', linewidth = 2)
ax[2].plot(x, y3, 'r--', label = '3rd order')
ax[2].text(.2, 2, '3rd order', fontsize = 13)

ax[3].plot(x, y, 'k', linewidth = 2)
ax[3].plot(x, y8, 'r--', label = '8th order')
ax[3].text(.2, 2, '8th order', fontsize = 13)

ax[4].plot(x, y, 'k', linewidth = 2)
ax[4].plot(x, y14, 'r--', label = '14th order')
ax[4].text(.2, 2, '14th order', fontsize = 13)

plt.ylim([-2.5, 2.5])

plt.show()
../_images/180b557a9bacbef0f91bc1612b9f078c91b955eca0c487000de7c43269dbaa0f.png

4.2. But polynomials aren’t linear!#

True.

So we’re going to create new features from which we can build a polynomial and our equation will be linear with regards to these new features. Whenever we create new features from existing features, we call that feature engineering.

Take for example the polynomial expression:

\[y = 5 + 3.1 x + 2.6 x^2 - 4 x^3\]

Now let’s make a new set of features:

\[\begin{split} \begin{align*} v_1 &= x\\ v_2 &= x^2\\ v_3 &= x^3 \end{align*} \end{split}\]

Now substituting our new variables into our polynomial:

\[ y = 5 + 3.1 v_1 + 2.6 v_2 - 4 v_3 \]

This equation is linear with respect to our new features (\(z_1, z_2, z_3\)), so we can use linear regression to find the best fit coeffecients!

4.2.1. Engineering Polynomial features#

Scikit-learn provides a function PolynomialFeatures that creates new polynomial features from our existing features. Let’s examine this function.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures

X = np.array([1,2,3]).reshape(-1,1)

poly = PolynomialFeatures(degree=5, include_bias = False)
V = poly.fit_transform(X)

Notice, each column corresponds to one of our new features.

4.2.2. Creating some curvy synthetic data#

num_samples = 80

# Define polynomial coefficients for two different polynomials
poly_coeff = [-0.5, -4.2, 0, 0.9, 2]
my_poly = lambda z: np.polyval(poly_coeff, z)  # Polynomial function: -0.5*z^4 - 4.2*z^3 + 0.9*z + 2

# Generate random x data points
np.random.seed(98)
x = np.sort(np.linspace(-1, 1, num_samples) + np.random.normal(scale=0.1, size=num_samples))

# Plug the x data points into the polynomial function and add some noise
y = my_poly(x) + np.random.normal(loc=0, scale=0.5, size=num_samples)

# x_range is an array of data points in the x direction. y_range is the polynomial plotted over that range.
# These will be used to display the models.
x_range = np.linspace(-1, 1, 100).reshape(-1, 1)
y_range = my_poly(x_range)

# Plot the data points and polynomial curves
plt.scatter(x, y, s=10)
plt.plot(x_range, y_range, 'r--')

# Get y-axis limits for the plot
my_ylims = plt.gca().get_ylim()

plt.show()
../_images/63c57f0945faceb638f34153c61a1ef97b9b634d81fe8c69b4c12a05250756d0.png

4.2.3. Fitting the models#

Suppose we were presented the data above, not having made it ourselves. How would we know what degree of polynomial to use? We wouldn’t.

So, we’re going to fit polynomials of different degrees and inspect the results. We’ll also fit Lasso, Ridge, and Elastic Net models to the data.

  • Linear Regression with polynomial features up to degree 1, 2, 4, 8, and 16

  • Lasso, Ridge, and Elastic Net with polynomial features up to degree 16 (we’ll let regularization decide what stays and what goes)

## Linear Regressions

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x.reshape(-1,1), y, 
                                                    test_size = 0.5)

# Define the degrees of the polynomial features to be tested
degs = [1, 2, 4, 8, 16]

# Initialize a dictionary to store the models
models = {}

# Loop over each degree
for deg in degs:
    # Create polynomial features
    poly = PolynomialFeatures(degree=deg, include_bias=False)    
    V_train = poly.fit_transform(x_train)
    V_test = poly.transform(x_test)
    v_names = poly.get_feature_names_out()

    
    # Standardize the features
    ss = StandardScaler()
    Z_train = ss.fit_transform(V_train)
    Z_test = ss.transform(V_test)

    # Train a linear regression model on the scaled polynomial features
    reg_poly = LinearRegression()
    reg_poly.fit(Z_train, y_train)

    # Transform the x_range using the polynomial features and the scaler
    # These aren't data, they're x- and y-values used to plot the model
    z_model = ss.transform(poly.transform(x_range))
    y_model = reg_poly.predict(z_model)

    # Calculate the R^2 score for training and testing sets
    R2_train = reg_poly.score(Z_train, y_train)
    R2_test = reg_poly.score(Z_test, y_test)
    
    # Store the model and its details in the dictionary
    key = f'Poly_{deg}'
    models[key] = dict(
        title = f'Poly {deg}',
        feature_names = v_names,
        scaler = ss,
        model = reg_poly,
        z_model = z_model,
        y_model = y_model,
        R2_train = R2_train,
        R2_test = R2_test
    )
## Lasso

from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV

alpha_test = np.logspace(-2,3,51)

lasso_poly = LassoCV(alphas = alpha_test)
lasso_poly.fit(Z_train, y_train)

y_model = lasso_poly.predict(z_model)

R2_train = lasso_poly.score(Z_train, y_train)
R2_test = lasso_poly.score(Z_test, y_test)

models['lasso'] = dict(
        title = f'Lasso (alpha:{lasso_poly.alpha_:0.2f})',
        params = lasso_poly.alpha_,
        feature_names = v_names,
        scaler = ss,
        model = lasso_poly,
        y_model = y_model,
        R2_train = R2_train,
        R2_test = R2_test
    )
## Ridge

ridge_poly = RidgeCV(alphas = alpha_test)
ridge_poly.fit(Z_train, y_train)

y_model = ridge_poly.predict(z_model)

R2_train = ridge_poly.score(Z_train, y_train)
R2_test = ridge_poly.score(Z_test, y_test)

models['ridge'] = dict(
        title = f'Ridge (alpha:{ridge_poly.alpha_:0.2f})',
        params = ridge_poly.alpha_,
        feature_names = v_names,
        model = ridge_poly,
        scaler = ss,
        y_model = y_model,
        R2_train = R2_train,
        R2_test = R2_test
    )
## Elastic Net

l1_ratio_test = [0.1, 0.25, 0.5, 0.75, 0.9]

elastic_poly = ElasticNetCV(alphas = alpha_test, l1_ratio = l1_ratio_test)
elastic_poly.fit(Z_train, y_train)

y_model = elastic_poly.predict(z_model)

R2_train = elastic_poly.score(Z_train, y_train)
R2_test = elastic_poly.score(Z_test, y_test)

models['elastic'] = dict(
        title = f'Elastic Net (alpha:{elastic_poly.alpha_:0.2f}, l1:{elastic_poly.l1_ratio_:0.2f}',
        params = [elastic_poly.alpha_, elastic_poly.l1_ratio_],
        feature_names = v_names,
        scaler = ss,
        model = elastic_poly,
        y_model = y_model,
        R2_train = R2_train,
        R2_test = R2_test
    )

4.2.4. Displaying the model results#

fig,ax = plt.subplots(len(models), 2, figsize = (10,30), sharex=True, sharey=True)

x_range = np.linspace(-1,1,100).reshape(-1,1)
y_true = my_poly(x_range)

for k, key in enumerate(models):
    reg = models[key]

    ax[k,0].scatter(x_train, y_train, s=10, color = 'teal')
    ax[k,0].plot(x_range, reg['y_model'], 'k--')
    ax[k,0].plot(x_range, y_range, 'r--', alpha = 0.3)

    ax[k,0].text(0.5,0.8, f'{reg['title']}\n$R^2 = ${reg["R2_train"]:.2}', transform=ax[k,0].transAxes)


    ax[k,0].set_ylim(my_ylims)

    ax[k,1].scatter(x_test, y_test, s=10, color = 'goldenrod')
    ax[k,1].plot(x_range, y_range, 'r--', alpha = 0.3)
    ax[k,1].plot(x_range, reg['y_model'], 'k--')  
    
    ax[k,1].text(0.5,0.8, f'$R^2 = ${reg["R2_test"]:.2}', transform=ax[k,1].transAxes)

ax[0,0].set_title('Training Set')
ax[0,1].set_title('Testing Set')
plt.show()
../_images/b53c4943d1e6cc942abc81981a97d5f24a73a741faa4247f82674696ce8bb227.png
# When we fit our model with standardized features, the coefficients we solve for are with respect to the scaled features
# This function converts the parameters back to the scale of the original features (before standardization)

def unscale_coefs(scaler, coef, intercept = 0):
    sc = np.array(scaler.scale_)
    me = np.array(scaler.mean_)
    
    coef_unsc = np.true_divide(coef, sc)
    intercept_unsc = intercept - np.dot(coef_unsc, me)
    
    return coef_unsc, intercept_unsc

# A function to create a string of the equation of a model
def model_string(model = None, intercept=0, coef = [], X_names = [], model_name = None):
    # A function to print the equation of a linear model
    if model is not None:
        intercept = model.intercept_
        coef = model.coef_

    if model_name is None:
        model_str = f'y ='
    else:
        model_str = f'{model_name}:\n y ='

    if not intercept==0:
        model_str += f' {intercept:.2f}'
        
    for c, x in zip(coef, X_names):
        if np.isclose(c,0):
            continue
        elif c<0:
            model_str+= f' - {-c:.2f}*{x}'
        else:
            model_str+= f' + {c:.2f}*{x}'
    return(model_str)
    
import warnings
warnings.filterwarnings('ignore')

coeff_df = pd.DataFrame()
nans = [np.nan]*17
coeff_df['True'] = nans
coeff_df['True'].iloc[np.arange(len(poly_coeff))] = poly_coeff[-1::-1]
model_eqs = []

for k, key in enumerate(models):
    
    reg = models[key]
    title = reg['title']
    
    coefs_unscaled, intercept_unscaled = unscale_coefs(reg['scaler'], reg['model'].coef_, reg['model'].intercept_)
  
    coeffs = np.concatenate((np.array([intercept_unscaled]), coefs_unscaled))
    coeff_df[title] = nans
    coeff_df[title].iloc[np.arange(len(coeffs))] = coeffs
    
    # Model strings with scaled model parameters
    # model_str = model_string(model = reg['model'], X_names = reg['feature_names'], model_name = reg['title'])
    
    # Model strings with un-scaled model parameters
    model_str = model_string(coef=coefs_unscaled, intercept=intercept_unscaled, X_names = reg['feature_names'], model_name = reg['title'])

    model_eqs.append(model_str)


true_model_eq = model_string(coef = poly_coeff[-2::-1], intercept = poly_coeff[-1], X_names = reg['feature_names'], model_name = 'True model')
model_eqs.append(true_model_eq)
pd.options.display.float_format = '{:.2f}'.format
coeff_df.fillna('')
True Poly 1 Poly 2 Poly 4 Poly 8 Poly 16 Lasso (alpha:0.02) Ridge (alpha:3.16) Elastic Net (alpha:0.02, l1:0.90
0 2.00 1.89 2.12 2.12 2.12 2.35 2.08 2.08 2.08
1 0.90 -2.12 -2.10 1.02 0.38 2.63 -0.00 -0.21 0.00
2 0.00 -0.61 -0.57 -1.26 -26.64 -0.49 -0.39 -0.50
3 -4.20 -4.35 -1.71 -91.67 -0.73 -0.91 -0.78
4 -0.50 0.04 4.88 406.72 -0.00 -0.27 -0.00
5 -1.95 845.30 -2.67 -0.89 -2.61
6 -9.75 -2567.79 0.00 -0.15 -0.00
7 -0.27 -3314.14 -0.00 -0.65 -0.00
8 5.66 8352.32 0.00 -0.01 0.00
9 6567.95 -0.00 -0.40 -0.00
10 -15246.20 0.00 0.08 0.00
11 -7029.38 -0.00 -0.23 -0.00
12 15772.43 0.00 0.14 0.00
13 3911.09 -0.00 -0.12 -0.00
14 -8640.13 0.00 0.17 0.00
15 -895.20 -0.00 -0.06 -0.00
16 1949.14 0.09 0.17 0.09
for t in model_eqs:
    print(t)
Poly 1:
 y = 1.89 - 2.12*x0
Poly 2:
 y = 2.12 - 2.10*x0 - 0.61*x0^2
Poly 4:
 y = 2.12 + 1.02*x0 - 0.57*x0^2 - 4.35*x0^3 + 0.04*x0^4
Poly 8:
 y = 2.12 + 0.38*x0 - 1.26*x0^2 - 1.71*x0^3 + 4.88*x0^4 - 1.95*x0^5 - 9.75*x0^6 - 0.27*x0^7 + 5.66*x0^8
Poly 16:
 y = 2.35 + 2.63*x0 - 26.64*x0^2 - 91.67*x0^3 + 406.72*x0^4 + 845.30*x0^5 - 2567.79*x0^6 - 3314.14*x0^7 + 8352.32*x0^8 + 6567.95*x0^9 - 15246.20*x0^10 - 7029.38*x0^11 + 15772.43*x0^12 + 3911.09*x0^13 - 8640.13*x0^14 - 895.20*x0^15 + 1949.14*x0^16
Lasso (alpha:0.02):
 y = 2.08 - 0.49*x0^2 - 0.73*x0^3 - 2.67*x0^5 + 0.09*x0^16
Ridge (alpha:3.16):
 y = 2.08 - 0.21*x0 - 0.39*x0^2 - 0.91*x0^3 - 0.27*x0^4 - 0.89*x0^5 - 0.15*x0^6 - 0.65*x0^7 - 0.01*x0^8 - 0.40*x0^9 + 0.08*x0^10 - 0.23*x0^11 + 0.14*x0^12 - 0.12*x0^13 + 0.17*x0^14 - 0.06*x0^15 + 0.17*x0^16
Elastic Net (alpha:0.02, l1:0.90:
 y = 2.08 - 0.50*x0^2 - 0.78*x0^3 - 2.61*x0^5 + 0.09*x0^16
True model:
 y = 2.00 + 0.90*x0 - 4.20*x0^3 - 0.50*x0^4

Which model is best?

## Example: Auto HP vs MPG

Horsepower and fuel efficiency are known to have a non-linear relationship. Let's try resolving this relationship with polynomial features.
  Cell In[14], line 3
    Horsepower and fuel efficiency are known to have a non-linear relationship. Let's try resolving this relationship with polynomial features.
                                                                                   ^
SyntaxError: unterminated string literal (detected at line 3)
data_df = pd.read_csv('https://raw.githubusercontent.com/shreyamdg/automobile-data-set-analysis/refs/heads/master/cars.csv')
data_df.head()

data_df.columns
Index(['Unnamed: 0', 'symboling', 'normalized-losses', 'make', 'fuel-type',
       'aspiration', 'num-of-doors', 'body-style', 'drive-wheels',
       'engine-location', 'wheel-base', 'length', 'width', 'height',
       'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size',
       'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower',
       'peak-rpm', 'city-mpg', 'highway-mpg', 'price'],
      dtype='object')
cars_df = data_df.query('`fuel-type`=="gas"')

# Focus on horsepower vs MPG
cars_df = cars_df[['horsepower', 'city-mpg']]
cars_df['horsepower'].unique()
array(['111', '154', '102', '115', '110', '140', '160', '101', '121',
       '182', '48', '70', '68', '88', '145', '58', '76', '60', '86',
       '100', '78', '90', '176', '262', '135', '84', '120', '155', '184',
       '175', '116', '69', '97', '152', '200', '95', '142', '143', '207',
       '288', '?', '73', '82', '94', '62', '112', '92', '161', '156',
       '85', '114', '162', '134'], dtype=object)
cars_df['horsepower'] = pd.to_numeric(cars_df['horsepower'], errors = 'coerce')
cars_df.dropna(inplace=True)
cars_df['horsepower'].unique()
array([111., 154., 102., 115., 110., 140., 160., 101., 121., 182.,  48.,
        70.,  68.,  88., 145.,  58.,  76.,  60.,  86., 100.,  78.,  90.,
       176., 262., 135.,  84., 120., 155., 184., 175., 116.,  69.,  97.,
       152., 200.,  95., 142., 143., 207., 288.,  73.,  82.,  94.,  62.,
       112.,  92., 161., 156.,  85., 114., 162., 134.])
X = cars_df[['horsepower']]
y = cars_df['city-mpg']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
# Fit linear model
linreg = LinearRegression()
linreg.fit(X, y)
y_pred_lin = linreg.predict(X)
# Fit polynomial model (degree 2)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

lin_poly = LinearRegression()
lin_poly.fit(X_poly, y)
y_pred_poly = lin_poly.predict(X_poly)
# Compare
print(f"Linear R²: {r2_score(y, y_pred_lin):.4f}")
print(f"Polynomial R²: {r2_score(y, y_pred_poly):.4f}")

# Visualize
X_sorted = np.sort(X, axis=0)
X_sorted_poly = poly.transform(X_sorted)

fig, ax = plt.subplots(1,2, figsize=(12, 5))

ax[0].plot(X, y, '.', alpha=0.5)
ax[0].plot(X_sorted, linreg.predict(X_sorted), 'b-', linewidth=2, label='Linear')

ax[0].set_xlabel('Horsepower')
ax[0].set_ylabel('MPG')
ax[0].set_title('Linear Model (just hp)')
ax[0].legend()

plt.plot(X, y, '.', alpha=0.5)
plt.plot(X_sorted, poly_model.predict(X_sorted_poly), 'r-', linewidth=2, label='Polynomial (degree 2)')

plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('Polynomial Model (hp and hp^2)')
plt.legend()

plt.tight_layout()
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[23], line 2
      1 # Compare
----> 2 print(f"Linear R²: {r2_score(y, y_pred_lin):.4f}")
      3 print(f"Polynomial R²: {r2_score(y, y_pred_poly):.4f}")
      5 # Visualize

NameError: name 'r2_score' is not defined