PS04 - Comparing Tree Methods

12. PS04 - Comparing Tree Methods#

Decision Trees vs. Random Forests vs. Gradient Boosting

This assignment works through the entire modeling process, comparing the different tree models in terms of performance, bias and variance.

We’ll be using US Census data to predict, based on other attributes, whether an individual has a yearly income above or below $50 K (a binary classification).

Process Roadmap

Load and explore the data
Pre-process (ordinal + one-hot encoding, imputation, train/test split)
Modeling x3
Model Assessment, Head-to-head performance comparison
Interpretation

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay

12.1. 0. Load & Explore#

## fetching the data
adult, income = fetch_openml('adult', version=2, as_frame=True, return_X_y = True, parser='auto')

adult.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country
0	25	Private	226802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	40	United-States
1	38	Private	89814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	50	United-States
2	28	Local-gov	336951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	40	United-States
3	44	Private	160323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7688	40	United-States
4	18	NaN	103497	Some-college	10	Never-married	NaN	Own-child	White	Female	0	30	United-States

income.head()

  <=50K
  <=50K
   >50K
   >50K
  <=50K
Name: class, dtype: category
Categories (2, object): ['<=50K', '>50K']

income.value_counts()

class
<=50K    37155
>50K     11687
Name: count, dtype: int64

12.2. Inspect the data#

Do we have missing values? How should we deal with them?
What is the split of over- and under-50K in our data?
Are there features we have questions about?

adult.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             48842 non-null  int64   
 1   workclass       46043 non-null  category
 2   fnlwgt          48842 non-null  int64   
 3   education       48842 non-null  category
 4   education-num   48842 non-null  int64   
 5   marital-status  48842 non-null  category
 6   occupation      46033 non-null  category
 7   relationship    48842 non-null  category
 8   race            48842 non-null  category
 9   sex             48842 non-null  category
 10  capital-gain    48842 non-null  int64   
 11  capital-loss    48842 non-null  int64   
 12  hours-per-week  48842 non-null  int64   
 13  native-country  47985 non-null  category
dtypes: category(8), int64(6)
memory usage: 2.6 MB

12.3. 2. Pre-processing#

Imputation In the past, when we’ve had missing data, we’ve discarded the entire row (sample). In this example, we’ll introduce imputation, inferring a value for missing data. And we’ll start with the simplest form of imputation, replacing data with a constant value.

The transformer we use is called SimpleImputer and it permits several options for filling in missing data, selectable through the strategy hyperparameter:

mean - the mean of available data
median - the median of available data
most_frequent - the mode of available data
constant - a user-selected value (must then set fill_value)

We’ll use ‘constant’.

Encoding Some columns will require encoding and we’ll apply OrdinalEncoder or OneHotEncoder when appropriate.

si = SimpleImputer(strategy = 'constant', fill_value = 'NA')
feature_names = adult.columns

adult = si.fit_transform(adult)

adult = pd.DataFrame(adult, columns = feature_names)
adult

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7688	0	40	United-States
4	18	NA	103497	Some-college	10	Never-married	NA	Own-child	White	Female	0	0	30	United-States
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
48837	27	Private	257302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48838	40	Private	154374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48839	58	Private	151910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48840	22	Private	201490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48841	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	0	40	United-States

48842 rows × 14 columns

adult.drop(columns = 'education', inplace = True)
adult.head(2)

	age	workclass	fnlwgt	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226802	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89814	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States

cat_features = ['workclass', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 
       'native-country']

oh = OneHotEncoder(drop = 'if_binary')

coltrans = ColumnTransformer(
    transformers = [('robert', oh, cat_features)],
    remainder = 'passthrough',
    verbose_feature_names_out = False
)

adult_trans = coltrans.fit_transform(adult).toarray()

adult_trans = pd.DataFrame(adult_trans, columns = coltrans.get_feature_names_out())
adult_trans

	workclass_Federal-gov	workclass_Local-gov	workclass_NA	workclass_Never-worked	workclass_Private	workclass_Self-emp-inc	workclass_Self-emp-not-inc	workclass_State-gov	workclass_Without-pay	marital-status_Divorced	...	native-country_Trinadad&Tobago	native-country_United-States	native-country_Vietnam	native-country_Yugoslavia	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	25.0	226802.0	7.0	0.0	0.0	40.0
1	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	38.0	89814.0	9.0	0.0	0.0	50.0
2	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	28.0	336951.0	12.0	0.0	0.0	40.0
3	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	44.0	160323.0	10.0	7688.0	0.0	40.0
4	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	18.0	103497.0	10.0	0.0	0.0	30.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
48837	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	27.0	257302.0	12.0	0.0	0.0	38.0
48838	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	40.0	154374.0	9.0	0.0	0.0	40.0
48839	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	58.0	151910.0	9.0	0.0	0.0	40.0
48840	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	22.0	201490.0	9.0	0.0	0.0	20.0
48841	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	52.0	287927.0	9.0	15024.0	0.0	40.0

48842 rows × 91 columns

X_train, X_test, y_train, y_test = train_test_split(adult_trans, income, test_size=0.20, random_state=42)

12.4. 3. Modeling and Model Comparison#

Let’s try two comparisons, sweeping max_depth while keeping forest size at 100 and sweeping forest size while keeping max depth at 2.

Decision Tree → rapidly overfits: high variance at deep depths, high bias at shallow depths.
Random Forest → averaging over many trees reduces variance without increasing bias much.
Gradient Boosting → each stage corrects the residual of the previous one, reducing bias step-by-step.

For each model, calculate and store the accuracy on the training and testing sets.

### Sweep max_depth
depths = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10,  15, 20])

# Dictionaries to hold our accuracy scores
DT = {'train':[], 'test':[]}
RF = {'train':[], 'test':[]}
GB = {'train':[], 'test':[]}

for d in depths:
    print(f'Fitting Models with depth: {d}')
    
    # Create the models
    tree_clf = DecisionTreeClassifier(max_depth = d)
    forest_clf = RandomForestClassifier(max_depth = d, n_estimators = 100)
    boosting_clf = GradientBoostingClassifier(max_depth = d, n_estimators = 100)

    # Fit the models
    tree_clf.fit(X_train, y_train)
    forest_clf.fit(X_train, y_train)
    boosting_clf.fit(X_train, y_train)

    # Make predictions
    y_tree_train = tree_clf.predict(X_train)
    y_tree_test = tree_clf.predict(X_test)

    y_forest_train = forest_clf.predict(X_train)
    y_forest_test = forest_clf.predict(X_test)

    y_boosting_train = boosting_clf.predict(X_train)
    y_boosting_test = boosting_clf.predict(X_test)

    # Calculate and store accuracies
    DT['train'].append(accuracy_score(y_train, y_tree_train))
    DT['test'].append(accuracy_score(y_test, y_tree_test))

    RF['train'].append(accuracy_score(y_train, y_forest_train))
    RF['test'].append(accuracy_score(y_test, y_forest_test))

    GB['train'].append(accuracy_score(y_train, y_boosting_train))
    GB['test'].append(accuracy_score(y_test, y_boosting_test))

print('Done.')                       
    

Fitting Models with depth: 1

Fitting Models with depth: 2

Fitting Models with depth: 3

Fitting Models with depth: 4

Fitting Models with depth: 5

Fitting Models with depth: 6

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[11], line 20
tree_clf.fit(X_train, y_train)
forest_clf.fit(X_train, y_train)
---> 20 boosting_clf.fit(X_train, y_train)
# Make predictions
y_tree_train = tree_clf.predict(X_train)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   estimator._validate_params()
with config_context(
   skip_parameter_validation=(
       prefer_skip_nested_validation or global_skip_validation
   )
):
-> 1365     return fit_method(estimator, *args, **kwargs)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/ensemble/_gb.py:787, in BaseGradientBoosting.fit(self, X, y, sample_weight, monitor)
   self._resize_state()
# fit the boosting stages
--> 787 n_stages = self._fit_stages(
   X_train,
   y_train,
   raw_predictions,
   sample_weight_train,
   self._rng,
   X_val,
   y_val,
   sample_weight_val,
   begin_at_stage,
   monitor,
)
# change shape of arrays after fit (early-stopping or additional ests)
if n_stages != self.estimators_.shape[0]:

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/ensemble/_gb.py:883, in BaseGradientBoosting._fit_stages(self, X, y, raw_predictions, sample_weight, random_state, X_val, y_val, sample_weight_val, begin_at_stage, monitor)
       initial_loss = factor * self._loss(
           y_true=y_oob_masked,
           raw_prediction=raw_predictions[~sample_mask],
           sample_weight=sample_weight_oob_masked,
       )
# fit next stage of trees
--> 883 raw_predictions = self._fit_stage(
   i,
   X,
   y,
   raw_predictions,
   sample_weight,
   sample_mask,
   random_state,
   X_csc=X_csc,
   X_csr=X_csr,
)
# track loss
if do_oob:

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/ensemble/_gb.py:489, in BaseGradientBoosting._fit_stage(self, i, X, y, raw_predictions, sample_weight, sample_mask, random_state, X_csc, X_csr)
   sample_weight = sample_weight * sample_mask.astype(np.float64)
X = X_csc if X_csc is not None else X
--> 489 tree.fit(
   X, neg_g_view[:, k], sample_weight=sample_weight, check_input=False
)
# update tree leaves
X_for_tree_update = X_csr if X_csr is not None else X

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   estimator._validate_params()
with config_context(
   skip_parameter_validation=(
       prefer_skip_nested_validation or global_skip_validation
   )
):
-> 1365     return fit_method(estimator, *args, **kwargs)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/tree/_classes.py:1404, in DecisionTreeRegressor.fit(self, X, y, sample_weight, check_input)
@_fit_context(prefer_skip_nested_validation=True)
def fit(self, X, y, sample_weight=None, check_input=True):
   """Build a decision tree regressor from the training set (X, y).

   Parameters
   (...)   1401         Fitted estimator.
   """
-> 1404     super()._fit(
       X,
       y,
       sample_weight=sample_weight,
       check_input=check_input,
   )
   return self

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/tree/_classes.py:472, in BaseDecisionTree._fit(self, X, y, sample_weight, check_input, missing_values_in_feature_mask)
else:
   builder = BestFirstTreeBuilder(
       splitter,
       min_samples_split,
   (...)    469         self.min_impurity_decrease,
   )
--> 472 builder.build(self.tree_, X, y, sample_weight, missing_values_in_feature_mask)
if self.n_outputs_ == 1 and is_classifier(self):
   self.n_classes_ = self.n_classes_[0]

KeyboardInterrupt: 

### Sweep n_estimators
n_est = np.array([2, 4, 8, 16, 32, 64, 128, 256, 512, 1028])

# Dictionaries to hold our accuracy scores
RF2 = {'train':[], 'test':[]}
GB2 = {'train':[], 'test':[]}

for n in n_est:
    print(f'Fitting Models with n_est: {n}')
    
    # Create the models
    forest_clf = RandomForestClassifier(max_depth = 2, n_estimators = n)
    boosting_clf = GradientBoostingClassifier(max_depth = 2, n_estimators = n)

    # Fit the models
    forest_clf.fit(X_train, y_train)
    boosting_clf.fit(X_train, y_train)

    # Make predictions
    y_forest_train = forest_clf.predict(X_train)
    y_forest_test = forest_clf.predict(X_test)

    y_boosting_train = boosting_clf.predict(X_train)
    y_boosting_test = boosting_clf.predict(X_test)

    # Calculate and store accuracies
    RF2['train'].append(accuracy_score(y_train, y_forest_train))
    RF2['test'].append(accuracy_score(y_test, y_forest_test))

    GB2['train'].append(accuracy_score(y_train, y_boosting_train))
    GB2['test'].append(accuracy_score(y_test, y_boosting_test))

print('Done.') 

12.5. 4. Visualization#

Create two sets of subplots, one for each hyper-parameter sweep. Each subplot corresponds to one of the model types.

For each of these sweeps and for each model, plot a line graph showing how accuracy changes as the hyperparameter is changed, comparing trends between the training and testing set.

fig, ax = plt.subplots(1, 3, figsize = (15, 5), sharey = True)

ax[0].plot(depths, DT['train'], 'b--', label = 'training')
ax[0].plot(depths, DT['test'], 'b', label = 'testing')

ax[1].plot(depths, RF['train'], 'b--', label = 'training')
ax[1].plot(depths, RF['test'], 'b', label = 'testing')

ax[2].plot(depths, GB['train'], 'b--', label = 'training')
ax[2].plot(depths, GB['test'], 'b', label = 'testing')



ax[0].set_title('Decision Tree')
ax[1].set_title('Random Forest')
ax[2].set_title('Gradient Boosting Trees')

ax[0].legend()
ax[0].set_ylim([0, 1])
plt.show()

fig, ax = plt.subplots(1, 2, figsize = (10, 5), sharey = True)

ax[0].plot(n_est, RF2['train'], 'b--', label = 'training')
ax[0].plot(n_est, RF2['test'], 'b', label = 'testing')

ax[1].plot(n_est, GB2['train'], 'b--', label = 'training')
ax[1].plot(n_est, GB2['test'], 'b', label = 'testing')


ax[0].set_title('Random Forest')
ax[1].set_title('Gradient Boosting Trees')

ax[0].legend()
ax[0].set_ylim([0.7, 1])
plt.show()

12.6. Summary#

Property	Decision Tree	Random Forest	Gradient Boosting
Bias	Medium (tunable)	Medium	Low
Variance	High (overfit easily)	Low (averaging)	Low–Medium
Interpretability	✅ High	⚠️ Medium	⚠️ Low
Training speed	✅ Very fast	Medium	⚠️ Slower
Prediction speed	✅ Very fast	Medium	Medium
Peak accuracy	Lowest	Middle	Highest

12.6.1. Key Takeaways#

Decision Trees are powerful building blocks but suffer from high variance. A tiny change to training data can completely alter the tree structure.
Random Forests lessens variance by averaging many independently-grown trees (each trained on a random bootstrap + random feature subset). Even with deep, unconstrained trees, the ensemble generalises well. A commonly accepted belief in machine learning circles states that you can’t overfit a random forest.
Gradient Boosting aims at reducing bias: each new shallow tree is fit to the residuals of the previous ensemble. The model starts with high bias and systematically reduces it. The cost is paid as more training time and a larger number of hyperparameters.
Feature encoding matters: the categorical columns (education, occupation, workclass, relationship, marital-status, race, sex, native-country) were one-hot encoded and contributed substantially to model performance, as confirmed by RF feature importances.

Practical advice: start with a Random Forest (robust, few hyperparameters), then try Gradient Boosting (XGBoost/LightGBM in production) if you need maximum performance. Use a Decision Tree when interpretability is a hard requirement.