12. PS04 - Comparing Tree Methods#
Decision Trees vs. Random Forests vs. Gradient Boosting
This assignment works through the entire modeling process, comparing the different tree models in terms of performance, bias and variance.
We’ll be using US Census data to predict, based on other attributes, whether an individual has a yearly income above or below $50 K (a binary classification).
Process Roadmap
Load and explore the data
Pre-process (ordinal + one-hot encoding, imputation, train/test split)
Modeling x3
Model Assessment, Head-to-head performance comparison
Interpretation
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay
12.1. 0. Load & Explore#
## fetching the data
adult, income = fetch_openml('adult', version=2, as_frame=True, return_X_y = True, parser='auto')
adult.head()
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 0 | 40 | United-States |
| 1 | 38 | Private | 89814 | HS-grad | 9 | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 0 | 50 | United-States |
| 2 | 28 | Local-gov | 336951 | Assoc-acdm | 12 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 0 | 40 | United-States |
| 3 | 44 | Private | 160323 | Some-college | 10 | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | 7688 | 0 | 40 | United-States |
| 4 | 18 | NaN | 103497 | Some-college | 10 | Never-married | NaN | Own-child | White | Female | 0 | 0 | 30 | United-States |
income.head()
0 <=50K
1 <=50K
2 >50K
3 >50K
4 <=50K
Name: class, dtype: category
Categories (2, object): ['<=50K', '>50K']
income.value_counts()
class
<=50K 37155
>50K 11687
Name: count, dtype: int64
12.2. Inspect the data#
Do we have missing values? How should we deal with them?
What is the split of over- and under-50K in our data?
Are there features we have questions about?
adult.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 48842 non-null int64
1 workclass 46043 non-null category
2 fnlwgt 48842 non-null int64
3 education 48842 non-null category
4 education-num 48842 non-null int64
5 marital-status 48842 non-null category
6 occupation 46033 non-null category
7 relationship 48842 non-null category
8 race 48842 non-null category
9 sex 48842 non-null category
10 capital-gain 48842 non-null int64
11 capital-loss 48842 non-null int64
12 hours-per-week 48842 non-null int64
13 native-country 47985 non-null category
dtypes: category(8), int64(6)
memory usage: 2.6 MB
12.3. 2. Pre-processing#
Imputation In the past, when we’ve had missing data, we’ve discarded the entire row (sample). In this example, we’ll introduce imputation, inferring a value for missing data. And we’ll start with the simplest form of imputation, replacing data with a constant value.
The transformer we use is called SimpleImputer and it permits several options for filling in missing data, selectable through the strategy hyperparameter:
mean- the mean of available datamedian- the median of available datamost_frequent- the mode of available dataconstant- a user-selected value (must then setfill_value)
We’ll use ‘constant’.
Encoding
Some columns will require encoding and we’ll apply OrdinalEncoder or OneHotEncoder when appropriate.
si = SimpleImputer(strategy = 'constant', fill_value = 'NA')
feature_names = adult.columns
adult = si.fit_transform(adult)
adult = pd.DataFrame(adult, columns = feature_names)
adult
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 0 | 40 | United-States |
| 1 | 38 | Private | 89814 | HS-grad | 9 | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 0 | 50 | United-States |
| 2 | 28 | Local-gov | 336951 | Assoc-acdm | 12 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 0 | 40 | United-States |
| 3 | 44 | Private | 160323 | Some-college | 10 | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | 7688 | 0 | 40 | United-States |
| 4 | 18 | NA | 103497 | Some-college | 10 | Never-married | NA | Own-child | White | Female | 0 | 0 | 30 | United-States |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48837 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States |
| 48838 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States |
| 48839 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States |
| 48840 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States |
| 48841 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States |
48842 rows × 14 columns
adult.drop(columns = 'education', inplace = True)
adult.head(2)
| age | workclass | fnlwgt | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | Private | 226802 | 7 | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 0 | 40 | United-States |
| 1 | 38 | Private | 89814 | 9 | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 0 | 50 | United-States |
cat_features = ['workclass', 'marital-status',
'occupation', 'relationship', 'race', 'sex',
'native-country']
oh = OneHotEncoder(drop = 'if_binary')
coltrans = ColumnTransformer(
transformers = [('robert', oh, cat_features)],
remainder = 'passthrough',
verbose_feature_names_out = False
)
adult_trans = coltrans.fit_transform(adult).toarray()
adult_trans = pd.DataFrame(adult_trans, columns = coltrans.get_feature_names_out())
adult_trans
| workclass_Federal-gov | workclass_Local-gov | workclass_NA | workclass_Never-worked | workclass_Private | workclass_Self-emp-inc | workclass_Self-emp-not-inc | workclass_State-gov | workclass_Without-pay | marital-status_Divorced | ... | native-country_Trinadad&Tobago | native-country_United-States | native-country_Vietnam | native-country_Yugoslavia | age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 25.0 | 226802.0 | 7.0 | 0.0 | 0.0 | 40.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 38.0 | 89814.0 | 9.0 | 0.0 | 0.0 | 50.0 |
| 2 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 28.0 | 336951.0 | 12.0 | 0.0 | 0.0 | 40.0 |
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 44.0 | 160323.0 | 10.0 | 7688.0 | 0.0 | 40.0 |
| 4 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 18.0 | 103497.0 | 10.0 | 0.0 | 0.0 | 30.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48837 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 27.0 | 257302.0 | 12.0 | 0.0 | 0.0 | 38.0 |
| 48838 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 40.0 | 154374.0 | 9.0 | 0.0 | 0.0 | 40.0 |
| 48839 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 58.0 | 151910.0 | 9.0 | 0.0 | 0.0 | 40.0 |
| 48840 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 22.0 | 201490.0 | 9.0 | 0.0 | 0.0 | 20.0 |
| 48841 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 52.0 | 287927.0 | 9.0 | 15024.0 | 0.0 | 40.0 |
48842 rows × 91 columns
X_train, X_test, y_train, y_test = train_test_split(adult_trans, income, test_size=0.20, random_state=42)
12.4. 3. Modeling and Model Comparison#
Let’s try two comparisons, sweeping max_depth while keeping forest size at 100 and sweeping forest size while keeping max depth at 2.
Decision Tree → rapidly overfits: high variance at deep depths, high bias at shallow depths.
Random Forest → averaging over many trees reduces variance without increasing bias much.
Gradient Boosting → each stage corrects the residual of the previous one, reducing bias step-by-step.
For each model, calculate and store the accuracy on the training and testing sets.
### Sweep max_depth
depths = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20])
# Dictionaries to hold our accuracy scores
DT = {'train':[], 'test':[]}
RF = {'train':[], 'test':[]}
GB = {'train':[], 'test':[]}
for d in depths:
print(f'Fitting Models with depth: {d}')
# Create the models
tree_clf = DecisionTreeClassifier(max_depth = d)
forest_clf = RandomForestClassifier(max_depth = d, n_estimators = 100)
boosting_clf = GradientBoostingClassifier(max_depth = d, n_estimators = 100)
# Fit the models
tree_clf.fit(X_train, y_train)
forest_clf.fit(X_train, y_train)
boosting_clf.fit(X_train, y_train)
# Make predictions
y_tree_train = tree_clf.predict(X_train)
y_tree_test = tree_clf.predict(X_test)
y_forest_train = forest_clf.predict(X_train)
y_forest_test = forest_clf.predict(X_test)
y_boosting_train = boosting_clf.predict(X_train)
y_boosting_test = boosting_clf.predict(X_test)
# Calculate and store accuracies
DT['train'].append(accuracy_score(y_train, y_tree_train))
DT['test'].append(accuracy_score(y_test, y_tree_test))
RF['train'].append(accuracy_score(y_train, y_forest_train))
RF['test'].append(accuracy_score(y_test, y_forest_test))
GB['train'].append(accuracy_score(y_train, y_boosting_train))
GB['test'].append(accuracy_score(y_test, y_boosting_test))
print('Done.')
Fitting Models with depth: 1
Fitting Models with depth: 2
Fitting Models with depth: 3
Fitting Models with depth: 4
Fitting Models with depth: 5
Fitting Models with depth: 6
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[11], line 20
18 tree_clf.fit(X_train, y_train)
19 forest_clf.fit(X_train, y_train)
---> 20 boosting_clf.fit(X_train, y_train)
22 # Make predictions
23 y_tree_train = tree_clf.predict(X_train)
File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1358 estimator._validate_params()
1360 with config_context(
1361 skip_parameter_validation=(
1362 prefer_skip_nested_validation or global_skip_validation
1363 )
1364 ):
-> 1365 return fit_method(estimator, *args, **kwargs)
File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/ensemble/_gb.py:787, in BaseGradientBoosting.fit(self, X, y, sample_weight, monitor)
784 self._resize_state()
786 # fit the boosting stages
--> 787 n_stages = self._fit_stages(
788 X_train,
789 y_train,
790 raw_predictions,
791 sample_weight_train,
792 self._rng,
793 X_val,
794 y_val,
795 sample_weight_val,
796 begin_at_stage,
797 monitor,
798 )
800 # change shape of arrays after fit (early-stopping or additional ests)
801 if n_stages != self.estimators_.shape[0]:
File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/ensemble/_gb.py:883, in BaseGradientBoosting._fit_stages(self, X, y, raw_predictions, sample_weight, random_state, X_val, y_val, sample_weight_val, begin_at_stage, monitor)
876 initial_loss = factor * self._loss(
877 y_true=y_oob_masked,
878 raw_prediction=raw_predictions[~sample_mask],
879 sample_weight=sample_weight_oob_masked,
880 )
882 # fit next stage of trees
--> 883 raw_predictions = self._fit_stage(
884 i,
885 X,
886 y,
887 raw_predictions,
888 sample_weight,
889 sample_mask,
890 random_state,
891 X_csc=X_csc,
892 X_csr=X_csr,
893 )
895 # track loss
896 if do_oob:
File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/ensemble/_gb.py:489, in BaseGradientBoosting._fit_stage(self, i, X, y, raw_predictions, sample_weight, sample_mask, random_state, X_csc, X_csr)
486 sample_weight = sample_weight * sample_mask.astype(np.float64)
488 X = X_csc if X_csc is not None else X
--> 489 tree.fit(
490 X, neg_g_view[:, k], sample_weight=sample_weight, check_input=False
491 )
493 # update tree leaves
494 X_for_tree_update = X_csr if X_csr is not None else X
File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1358 estimator._validate_params()
1360 with config_context(
1361 skip_parameter_validation=(
1362 prefer_skip_nested_validation or global_skip_validation
1363 )
1364 ):
-> 1365 return fit_method(estimator, *args, **kwargs)
File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/tree/_classes.py:1404, in DecisionTreeRegressor.fit(self, X, y, sample_weight, check_input)
1374 @_fit_context(prefer_skip_nested_validation=True)
1375 def fit(self, X, y, sample_weight=None, check_input=True):
1376 """Build a decision tree regressor from the training set (X, y).
1377
1378 Parameters
(...) 1401 Fitted estimator.
1402 """
-> 1404 super()._fit(
1405 X,
1406 y,
1407 sample_weight=sample_weight,
1408 check_input=check_input,
1409 )
1410 return self
File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/tree/_classes.py:472, in BaseDecisionTree._fit(self, X, y, sample_weight, check_input, missing_values_in_feature_mask)
461 else:
462 builder = BestFirstTreeBuilder(
463 splitter,
464 min_samples_split,
(...) 469 self.min_impurity_decrease,
470 )
--> 472 builder.build(self.tree_, X, y, sample_weight, missing_values_in_feature_mask)
474 if self.n_outputs_ == 1 and is_classifier(self):
475 self.n_classes_ = self.n_classes_[0]
KeyboardInterrupt:
### Sweep n_estimators
n_est = np.array([2, 4, 8, 16, 32, 64, 128, 256, 512, 1028])
# Dictionaries to hold our accuracy scores
RF2 = {'train':[], 'test':[]}
GB2 = {'train':[], 'test':[]}
for n in n_est:
print(f'Fitting Models with n_est: {n}')
# Create the models
forest_clf = RandomForestClassifier(max_depth = 2, n_estimators = n)
boosting_clf = GradientBoostingClassifier(max_depth = 2, n_estimators = n)
# Fit the models
forest_clf.fit(X_train, y_train)
boosting_clf.fit(X_train, y_train)
# Make predictions
y_forest_train = forest_clf.predict(X_train)
y_forest_test = forest_clf.predict(X_test)
y_boosting_train = boosting_clf.predict(X_train)
y_boosting_test = boosting_clf.predict(X_test)
# Calculate and store accuracies
RF2['train'].append(accuracy_score(y_train, y_forest_train))
RF2['test'].append(accuracy_score(y_test, y_forest_test))
GB2['train'].append(accuracy_score(y_train, y_boosting_train))
GB2['test'].append(accuracy_score(y_test, y_boosting_test))
print('Done.')
12.5. 4. Visualization#
Create two sets of subplots, one for each hyper-parameter sweep. Each subplot corresponds to one of the model types.
For each of these sweeps and for each model, plot a line graph showing how accuracy changes as the hyperparameter is changed, comparing trends between the training and testing set.
fig, ax = plt.subplots(1, 3, figsize = (15, 5), sharey = True)
ax[0].plot(depths, DT['train'], 'b--', label = 'training')
ax[0].plot(depths, DT['test'], 'b', label = 'testing')
ax[1].plot(depths, RF['train'], 'b--', label = 'training')
ax[1].plot(depths, RF['test'], 'b', label = 'testing')
ax[2].plot(depths, GB['train'], 'b--', label = 'training')
ax[2].plot(depths, GB['test'], 'b', label = 'testing')
ax[0].set_title('Decision Tree')
ax[1].set_title('Random Forest')
ax[2].set_title('Gradient Boosting Trees')
ax[0].legend()
ax[0].set_ylim([0, 1])
plt.show()
fig, ax = plt.subplots(1, 2, figsize = (10, 5), sharey = True)
ax[0].plot(n_est, RF2['train'], 'b--', label = 'training')
ax[0].plot(n_est, RF2['test'], 'b', label = 'testing')
ax[1].plot(n_est, GB2['train'], 'b--', label = 'training')
ax[1].plot(n_est, GB2['test'], 'b', label = 'testing')
ax[0].set_title('Random Forest')
ax[1].set_title('Gradient Boosting Trees')
ax[0].legend()
ax[0].set_ylim([0.7, 1])
plt.show()
12.6. Summary#
Property |
Decision Tree |
Random Forest |
Gradient Boosting |
|---|---|---|---|
Bias |
Medium (tunable) |
Medium |
Low |
Variance |
High (overfit easily) |
Low (averaging) |
Low–Medium |
Interpretability |
✅ High |
⚠️ Medium |
⚠️ Low |
Training speed |
✅ Very fast |
Medium |
⚠️ Slower |
Prediction speed |
✅ Very fast |
Medium |
Medium |
Peak accuracy |
Lowest |
Middle |
Highest |
12.6.1. Key Takeaways#
Decision Trees are powerful building blocks but suffer from high variance. A tiny change to training data can completely alter the tree structure.
Random Forests lessens variance by averaging many independently-grown trees (each trained on a random bootstrap + random feature subset). Even with deep, unconstrained trees, the ensemble generalises well. A commonly accepted belief in machine learning circles states that you can’t overfit a random forest.
Gradient Boosting aims at reducing bias: each new shallow tree is fit to the residuals of the previous ensemble. The model starts with high bias and systematically reduces it. The cost is paid as more training time and a larger number of hyperparameters.
Feature encoding matters: the categorical columns (
education,occupation,workclass,relationship,marital-status,race,sex,native-country) were one-hot encoded and contributed substantially to model performance, as confirmed by RF feature importances.
Practical advice: start with a Random Forest (robust, few hyperparameters), then try Gradient Boosting (XGBoost/LightGBM in production) if you need maximum performance. Use a Decision Tree when interpretability is a hard requirement.