12. PS04 - Comparing Tree Methods#

Decision Trees vs. Random Forests vs. Gradient Boosting

This assignment works through the entire modeling process, comparing the different tree models in terms of performance, bias and variance.

We’ll be using US Census data to predict, based on other attributes, whether an individual has a yearly income above or below $50 K (a binary classification).

Process Roadmap

  1. Load and explore the data

  2. Pre-process (ordinal + one-hot encoding, imputation, train/test split)

  3. Modeling x3

  4. Model Assessment, Head-to-head performance comparison

  5. Interpretation

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay

12.1. 0. Load & Explore#

## fetching the data
adult, income = fetch_openml('adult', version=2, as_frame=True, return_X_y = True, parser='auto')
adult.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States
1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States
2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States
3 44 Private 160323 Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States
4 18 NaN 103497 Some-college 10 Never-married NaN Own-child White Female 0 0 30 United-States
income.head()
0    <=50K
1    <=50K
2     >50K
3     >50K
4    <=50K
Name: class, dtype: category
Categories (2, object): ['<=50K', '>50K']
income.value_counts()
class
<=50K    37155
>50K     11687
Name: count, dtype: int64

12.2. Inspect the data#

  • Do we have missing values? How should we deal with them?

  • What is the split of over- and under-50K in our data?

  • Are there features we have questions about?

adult.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             48842 non-null  int64   
 1   workclass       46043 non-null  category
 2   fnlwgt          48842 non-null  int64   
 3   education       48842 non-null  category
 4   education-num   48842 non-null  int64   
 5   marital-status  48842 non-null  category
 6   occupation      46033 non-null  category
 7   relationship    48842 non-null  category
 8   race            48842 non-null  category
 9   sex             48842 non-null  category
 10  capital-gain    48842 non-null  int64   
 11  capital-loss    48842 non-null  int64   
 12  hours-per-week  48842 non-null  int64   
 13  native-country  47985 non-null  category
dtypes: category(8), int64(6)
memory usage: 2.6 MB

12.3. 2. Pre-processing#

Imputation In the past, when we’ve had missing data, we’ve discarded the entire row (sample). In this example, we’ll introduce imputation, inferring a value for missing data. And we’ll start with the simplest form of imputation, replacing data with a constant value.

The transformer we use is called SimpleImputer and it permits several options for filling in missing data, selectable through the strategy hyperparameter:

  • mean - the mean of available data

  • median - the median of available data

  • most_frequent - the mode of available data

  • constant - a user-selected value (must then set fill_value)

We’ll use ‘constant’.

Encoding Some columns will require encoding and we’ll apply OrdinalEncoder or OneHotEncoder when appropriate.

si = SimpleImputer(strategy = 'constant', fill_value = 'NA')
feature_names = adult.columns

adult = si.fit_transform(adult)

adult = pd.DataFrame(adult, columns = feature_names)
adult
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States
1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States
2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States
3 44 Private 160323 Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States
4 18 NA 103497 Some-college 10 Never-married NA Own-child White Female 0 0 30 United-States
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48837 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States
48838 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States
48839 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States
48840 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States
48841 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States

48842 rows × 14 columns

adult.drop(columns = 'education', inplace = True)
adult.head(2)
age workclass fnlwgt education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 25 Private 226802 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States
1 38 Private 89814 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States
cat_features = ['workclass', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 
       'native-country']

oh = OneHotEncoder(drop = 'if_binary')

coltrans = ColumnTransformer(
    transformers = [('robert', oh, cat_features)],
    remainder = 'passthrough',
    verbose_feature_names_out = False
)

adult_trans = coltrans.fit_transform(adult).toarray()

adult_trans = pd.DataFrame(adult_trans, columns = coltrans.get_feature_names_out())
adult_trans
workclass_Federal-gov workclass_Local-gov workclass_NA workclass_Never-worked workclass_Private workclass_Self-emp-inc workclass_Self-emp-not-inc workclass_State-gov workclass_Without-pay marital-status_Divorced ... native-country_Trinadad&Tobago native-country_United-States native-country_Vietnam native-country_Yugoslavia age fnlwgt education-num capital-gain capital-loss hours-per-week
0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 25.0 226802.0 7.0 0.0 0.0 40.0
1 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 38.0 89814.0 9.0 0.0 0.0 50.0
2 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 28.0 336951.0 12.0 0.0 0.0 40.0
3 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 44.0 160323.0 10.0 7688.0 0.0 40.0
4 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 18.0 103497.0 10.0 0.0 0.0 30.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48837 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 27.0 257302.0 12.0 0.0 0.0 38.0
48838 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 40.0 154374.0 9.0 0.0 0.0 40.0
48839 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 58.0 151910.0 9.0 0.0 0.0 40.0
48840 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 22.0 201490.0 9.0 0.0 0.0 20.0
48841 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 52.0 287927.0 9.0 15024.0 0.0 40.0

48842 rows × 91 columns

X_train, X_test, y_train, y_test = train_test_split(adult_trans, income, test_size=0.20, random_state=42)

12.4. 3. Modeling and Model Comparison#

Let’s try two comparisons, sweeping max_depth while keeping forest size at 100 and sweeping forest size while keeping max depth at 2.

  • Decision Tree → rapidly overfits: high variance at deep depths, high bias at shallow depths.

  • Random Forest → averaging over many trees reduces variance without increasing bias much.

  • Gradient Boosting → each stage corrects the residual of the previous one, reducing bias step-by-step.

For each model, calculate and store the accuracy on the training and testing sets.

### Sweep max_depth
depths = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10,  15, 20])

# Dictionaries to hold our accuracy scores
DT = {'train':[], 'test':[]}
RF = {'train':[], 'test':[]}
GB = {'train':[], 'test':[]}

for d in depths:
    print(f'Fitting Models with depth: {d}')
    
    # Create the models
    tree_clf = DecisionTreeClassifier(max_depth = d)
    forest_clf = RandomForestClassifier(max_depth = d, n_estimators = 100)
    boosting_clf = GradientBoostingClassifier(max_depth = d, n_estimators = 100)

    # Fit the models
    tree_clf.fit(X_train, y_train)
    forest_clf.fit(X_train, y_train)
    boosting_clf.fit(X_train, y_train)

    # Make predictions
    y_tree_train = tree_clf.predict(X_train)
    y_tree_test = tree_clf.predict(X_test)

    y_forest_train = forest_clf.predict(X_train)
    y_forest_test = forest_clf.predict(X_test)

    y_boosting_train = boosting_clf.predict(X_train)
    y_boosting_test = boosting_clf.predict(X_test)

    # Calculate and store accuracies
    DT['train'].append(accuracy_score(y_train, y_tree_train))
    DT['test'].append(accuracy_score(y_test, y_tree_test))

    RF['train'].append(accuracy_score(y_train, y_forest_train))
    RF['test'].append(accuracy_score(y_test, y_forest_test))

    GB['train'].append(accuracy_score(y_train, y_boosting_train))
    GB['test'].append(accuracy_score(y_test, y_boosting_test))

print('Done.')                       
    
Fitting Models with depth: 1
Fitting Models with depth: 2
Fitting Models with depth: 3
Fitting Models with depth: 4
Fitting Models with depth: 5
Fitting Models with depth: 6
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[11], line 20
     18 tree_clf.fit(X_train, y_train)
     19 forest_clf.fit(X_train, y_train)
---> 20 boosting_clf.fit(X_train, y_train)
     22 # Make predictions
     23 y_tree_train = tree_clf.predict(X_train)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1358     estimator._validate_params()
   1360 with config_context(
   1361     skip_parameter_validation=(
   1362         prefer_skip_nested_validation or global_skip_validation
   1363     )
   1364 ):
-> 1365     return fit_method(estimator, *args, **kwargs)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/ensemble/_gb.py:787, in BaseGradientBoosting.fit(self, X, y, sample_weight, monitor)
    784     self._resize_state()
    786 # fit the boosting stages
--> 787 n_stages = self._fit_stages(
    788     X_train,
    789     y_train,
    790     raw_predictions,
    791     sample_weight_train,
    792     self._rng,
    793     X_val,
    794     y_val,
    795     sample_weight_val,
    796     begin_at_stage,
    797     monitor,
    798 )
    800 # change shape of arrays after fit (early-stopping or additional ests)
    801 if n_stages != self.estimators_.shape[0]:

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/ensemble/_gb.py:883, in BaseGradientBoosting._fit_stages(self, X, y, raw_predictions, sample_weight, random_state, X_val, y_val, sample_weight_val, begin_at_stage, monitor)
    876         initial_loss = factor * self._loss(
    877             y_true=y_oob_masked,
    878             raw_prediction=raw_predictions[~sample_mask],
    879             sample_weight=sample_weight_oob_masked,
    880         )
    882 # fit next stage of trees
--> 883 raw_predictions = self._fit_stage(
    884     i,
    885     X,
    886     y,
    887     raw_predictions,
    888     sample_weight,
    889     sample_mask,
    890     random_state,
    891     X_csc=X_csc,
    892     X_csr=X_csr,
    893 )
    895 # track loss
    896 if do_oob:

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/ensemble/_gb.py:489, in BaseGradientBoosting._fit_stage(self, i, X, y, raw_predictions, sample_weight, sample_mask, random_state, X_csc, X_csr)
    486     sample_weight = sample_weight * sample_mask.astype(np.float64)
    488 X = X_csc if X_csc is not None else X
--> 489 tree.fit(
    490     X, neg_g_view[:, k], sample_weight=sample_weight, check_input=False
    491 )
    493 # update tree leaves
    494 X_for_tree_update = X_csr if X_csr is not None else X

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1358     estimator._validate_params()
   1360 with config_context(
   1361     skip_parameter_validation=(
   1362         prefer_skip_nested_validation or global_skip_validation
   1363     )
   1364 ):
-> 1365     return fit_method(estimator, *args, **kwargs)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/tree/_classes.py:1404, in DecisionTreeRegressor.fit(self, X, y, sample_weight, check_input)
   1374 @_fit_context(prefer_skip_nested_validation=True)
   1375 def fit(self, X, y, sample_weight=None, check_input=True):
   1376     """Build a decision tree regressor from the training set (X, y).
   1377 
   1378     Parameters
   (...)   1401         Fitted estimator.
   1402     """
-> 1404     super()._fit(
   1405         X,
   1406         y,
   1407         sample_weight=sample_weight,
   1408         check_input=check_input,
   1409     )
   1410     return self

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/tree/_classes.py:472, in BaseDecisionTree._fit(self, X, y, sample_weight, check_input, missing_values_in_feature_mask)
    461 else:
    462     builder = BestFirstTreeBuilder(
    463         splitter,
    464         min_samples_split,
   (...)    469         self.min_impurity_decrease,
    470     )
--> 472 builder.build(self.tree_, X, y, sample_weight, missing_values_in_feature_mask)
    474 if self.n_outputs_ == 1 and is_classifier(self):
    475     self.n_classes_ = self.n_classes_[0]

KeyboardInterrupt: 
### Sweep n_estimators
n_est = np.array([2, 4, 8, 16, 32, 64, 128, 256, 512, 1028])

# Dictionaries to hold our accuracy scores
RF2 = {'train':[], 'test':[]}
GB2 = {'train':[], 'test':[]}

for n in n_est:
    print(f'Fitting Models with n_est: {n}')
    
    # Create the models
    forest_clf = RandomForestClassifier(max_depth = 2, n_estimators = n)
    boosting_clf = GradientBoostingClassifier(max_depth = 2, n_estimators = n)

    # Fit the models
    forest_clf.fit(X_train, y_train)
    boosting_clf.fit(X_train, y_train)

    # Make predictions
    y_forest_train = forest_clf.predict(X_train)
    y_forest_test = forest_clf.predict(X_test)

    y_boosting_train = boosting_clf.predict(X_train)
    y_boosting_test = boosting_clf.predict(X_test)

    # Calculate and store accuracies
    RF2['train'].append(accuracy_score(y_train, y_forest_train))
    RF2['test'].append(accuracy_score(y_test, y_forest_test))

    GB2['train'].append(accuracy_score(y_train, y_boosting_train))
    GB2['test'].append(accuracy_score(y_test, y_boosting_test))

print('Done.') 

12.5. 4. Visualization#

Create two sets of subplots, one for each hyper-parameter sweep. Each subplot corresponds to one of the model types.

For each of these sweeps and for each model, plot a line graph showing how accuracy changes as the hyperparameter is changed, comparing trends between the training and testing set.

fig, ax = plt.subplots(1, 3, figsize = (15, 5), sharey = True)

ax[0].plot(depths, DT['train'], 'b--', label = 'training')
ax[0].plot(depths, DT['test'], 'b', label = 'testing')

ax[1].plot(depths, RF['train'], 'b--', label = 'training')
ax[1].plot(depths, RF['test'], 'b', label = 'testing')

ax[2].plot(depths, GB['train'], 'b--', label = 'training')
ax[2].plot(depths, GB['test'], 'b', label = 'testing')



ax[0].set_title('Decision Tree')
ax[1].set_title('Random Forest')
ax[2].set_title('Gradient Boosting Trees')

ax[0].legend()
ax[0].set_ylim([0, 1])
plt.show()
fig, ax = plt.subplots(1, 2, figsize = (10, 5), sharey = True)

ax[0].plot(n_est, RF2['train'], 'b--', label = 'training')
ax[0].plot(n_est, RF2['test'], 'b', label = 'testing')

ax[1].plot(n_est, GB2['train'], 'b--', label = 'training')
ax[1].plot(n_est, GB2['test'], 'b', label = 'testing')


ax[0].set_title('Random Forest')
ax[1].set_title('Gradient Boosting Trees')

ax[0].legend()
ax[0].set_ylim([0.7, 1])
plt.show()

12.6. Summary#

Property

Decision Tree

Random Forest

Gradient Boosting

Bias

Medium (tunable)

Medium

Low

Variance

High (overfit easily)

Low (averaging)

Low–Medium

Interpretability

✅ High

⚠️ Medium

⚠️ Low

Training speed

✅ Very fast

Medium

⚠️ Slower

Prediction speed

✅ Very fast

Medium

Medium

Peak accuracy

Lowest

Middle

Highest

12.6.1. Key Takeaways#

  1. Decision Trees are powerful building blocks but suffer from high variance. A tiny change to training data can completely alter the tree structure.

  2. Random Forests lessens variance by averaging many independently-grown trees (each trained on a random bootstrap + random feature subset). Even with deep, unconstrained trees, the ensemble generalises well. A commonly accepted belief in machine learning circles states that you can’t overfit a random forest.

  3. Gradient Boosting aims at reducing bias: each new shallow tree is fit to the residuals of the previous ensemble. The model starts with high bias and systematically reduces it. The cost is paid as more training time and a larger number of hyperparameters.

  4. Feature encoding matters: the categorical columns (education, occupation, workclass, relationship, marital-status, race, sex, native-country) were one-hot encoded and contributed substantially to model performance, as confirmed by RF feature importances.

Practical advice: start with a Random Forest (robust, few hyperparameters), then try Gradient Boosting (XGBoost/LightGBM in production) if you need maximum performance. Use a Decision Tree when interpretability is a hard requirement.