Example: Multiple Linear Regression Estimating Fuel Efficiency

2.4. Example: Multiple Linear Regression Estimating Fuel Efficiency#

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data_df = pd.read_csv('https://raw.githubusercontent.com/shreyamdg/automobile-data-set-analysis/refs/heads/master/cars.csv')
data_df.head()

	Unnamed: 0	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	2	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	3	2	164	audi	gas	std	four	sedan	fwd	front	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	4	2	164	audi	gas	std	four	sedan	4wd	front	...	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450

5 rows × 27 columns

data_df.columns

Index(['Unnamed: 0', 'symboling', 'normalized-losses', 'make', 'fuel-type',
       'aspiration', 'num-of-doors', 'body-style', 'drive-wheels',
       'engine-location', 'wheel-base', 'length', 'width', 'height',
       'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size',
       'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower',
       'peak-rpm', 'city-mpg', 'highway-mpg', 'price'],
      dtype='object')

columns_to_keep = data_df.columns[3:]
cars_df = data_df[columns_to_keep]

2.4.1. Exploratory Data Analysis#

# cars_df = cars_df[cars_df['fuel-type']=='gas']
cars_df = cars_df.query('`fuel-type`== "gas"')
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 185 entries, 0 to 204
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 make               185 non-null    object 
 fuel-type          185 non-null    object 
 aspiration         185 non-null    object 
 num-of-doors       185 non-null    object 
 body-style         185 non-null    object 
 drive-wheels       185 non-null    object 
 engine-location    185 non-null    object 
 wheel-base         185 non-null    float64
 length             185 non-null    float64
 width              185 non-null    float64
height             185 non-null    float64
curb-weight        185 non-null    int64  
engine-type        185 non-null    object 
num-of-cylinders   185 non-null    object 
engine-size        185 non-null    int64  
fuel-system        185 non-null    object 
bore               185 non-null    object 
stroke             185 non-null    object 
compression-ratio  185 non-null    float64
horsepower         185 non-null    object 
peak-rpm           185 non-null    object 
city-mpg           185 non-null    int64  
highway-mpg        185 non-null    int64  
price              185 non-null    object 
dtypes: float64(5), int64(4), object(15)
memory usage: 36.1+ KB

sns.pairplot(cars_df)
plt.show()

../_images/1b604403d88344f372a320afabd6b39901d6039f696f062dc100776e47039908.png

cars_corr = cars_df.corr(numeric_only=True)

sns.heatmap(np.abs(cars_corr), vmin = 0, vmax = 1, annot = True)
plt.show()

../_images/fe7d474d0435cae44eea181165884dae20a47f3836fd064d77f5ae78d568306d.png

2.4.2. Modeling#

Can we estimate fuel economy from other car specifications?

# Importing packages

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import root_mean_squared_error, r2_score

# Chose features from correlation heatmap
# curb-weight most correlated, and height and compression independent of curb-weight
features = ['curb-weight', 'height', 'compression-ratio']
target = ['city-mpg']

X = cars_df[features]
y = cars_df[target]

In this case, we’ve selected the features ourselves, using intuition about correlation and avoiding colinearity between features. Our next steps:

Split the data into training (80%) and testing sets (20%). Each set includes both a features (X) and corresponding targets (y).
Normalize the data with StandardScalar. We do this because linear regression relies on a distance calculation.
- First, create the StandardScalar transformer.
- Fit the scaler to the feature vector of the training set. This calculates the mean and stdev for every feature.
- Transform the training features using the scaler (X transformed to Z).
Create and fit the linear regression model.
- Create the ‘template’ of a LinearRegression model.
- Fit the model using the scaled features (Z) and targets (y) from the training set.
Make predictions.
- Use the linear model to calculate the predicted targets (\(\hat y\)).
- To make predictions using the test set, we must first scale the features.
Assess the model and visualize results.
- Calculate RMSE and R²
- Plot prediction vs actual

# Set aside data for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Normalize the data

sc = StandardScaler()

# Calculates mean and stdev of X_train
sc.fit(X_train)

Z_train = sc.transform(X_train)

# Choose our model type (linear regression) and fit the parameters
linreg = LinearRegression()

# This fit is where we get values for thetas (our parameters)
linreg.fit(Z_train, y_train)

linreg.__dict__

{'fit_intercept': True,
 'copy_X': True,
 'tol': 1e-06,
 'n_jobs': None,
 'positive': False,
 'n_features_in_': 3,
 'coef_': array([[-5.2496175 ,  0.69410527,  0.96797346]]),
 'rank_': 3,
 'singular_': array([14.24305812, 11.83718112, 10.05069344]),
 'intercept_': array([24.72972973])}

2.4.3. Model Evaluation#

# Make predictions with the model
y_pred_train = linreg.predict(Z_train)

# Make predictions on our test data

Z_test = sc.transform(X_test)
y_pred_test = linreg.predict(Z_test)

R2_train = r2_score(y_train, y_pred_train)
R2_test = r2_score(y_test, y_pred_test)

RMSE_train = root_mean_squared_error(y_train, y_pred_train)
RMSE_test = root_mean_squared_error(y_test, y_pred_test)

print(f'Train: \tR2: {R2_train:.3f}\tRMSE: {RMSE_train:.3f}')
print(f'Test: \tR2: {R2_test:.3f}\tRMSE: {RMSE_test:.3f}')

Train: 	R2: 0.746	RMSE: 3.227
Test: 	R2: 0.789	RMSE: 2.689

plt.plot(y_test, y_pred_test, '.', 
         color = 'teal', alpha = 0.5,
         label = 'test')
plt.plot(y_train, y_pred_train, '.', 
         color = 'goldenrod', alpha = 0.5,
         label = 'train')

plt.plot([0, 40], [0, 40], 'k--',
         label = 'perfect prediction')

plt.xlabel('Actual MPG')
plt.ylabel('Predicted MPG')
plt.legend()
plt.show()

../_images/c71c7e9771de9a1f271a76e9f0008df0a2b1847bb9a9ca0a47e8430a66fb4132.png

Example: Multiple Linear Regression Estimating Fuel Efficiency

Contents

2.4. Example: Multiple Linear Regression Estimating Fuel Efficiency#

2.4.1. Exploratory Data Analysis#

2.4.2. Modeling#

2.4.3. Model Evaluation#