Forests and Trees

11. Forests and Trees#

Ensemble methods combine predictions across multiple models to improve prediction.

weak learners - simple models that predict slightly better than random guessing
strong learners - more complex models that predict significantly better than chance

11.1. Random Forests (RandomForestClassifier)#

A random forest creates numerous trees in parallel (at the same time). Each tree is relatively shallow, sometimes only a single node (called a stump).

To make a prediction, a sample is processed by each tree, each tree makes a prediction, and the majority vote wins. But what makes the trees in the forest different from each other?

In fitting, diversity of trees is created using two methods: feature selection and bagging.

11.1.1. Random feature subsets#

Each tree only gets a subset of the features. For example, in a random forest deciding whether or not you should buy a car, one tree might make a prediction based on [‘reliability’, ‘feul economy’, ‘price’], while another uses [‘top speed’, ‘interior room’, ‘cost to repair’], and another uses [‘resale value’, ‘feul economy’, ‘value of standard tech’] and another…

11.1.2. Bagging (Bootstrap aggregating)#

Bootstrapping is a method for creating new data sets by sampling existing data sets. In bootstrapping, you select samples randomly and allow a sample to be selected multiple times (called sampling with replacement).

Bagging is a method that uses bootstrapping to create different training sets for each tree, and then aggregating the results.

11.1.3. Why it works?#

The idea is that no one tree will be great, but they’ll all make different mistakes. But there’ll be more overlap in correct guesses than in mistakes. So for any given sample, the majority vote is more likely to be correct than any one tree.

11.2. Gradient Boosted Trees#

Whereas Random Forests fit trees in parallel and every tree gets an equal vote, Boosted trees create trees sequentially, each new tree focusing on the shortcomings of the previous. And at the voting stage, some trees get more say than others.

There are many flavors of Boosted trees: AdaBoost, XGBoost, CatBoost

They all work a little differently, but here’s an outline of AdaBoost as an example:

11.2.1. AdaBoost (Adaptive Boosting) ( `GradientBoostingClassifier`)#

In AdaBoost, a tree comprises only one decision node; this kind of tree is called a stump. In each iteration, a new stump is created that splits the data based on a different condition. As the algorithm iterates, it keeps track of:

Sample Weight - each iteration, the algorithm focuses more on misclassified samples.
- A sample that is classified correctly is down-weighted. We get this right, don’t spend more energy on this case.
- A sample that is classified incorrectly is up-weighted. We get this wrong, focus on this case.
Tree Influence - how much say a tree will have in the final vote. Trees that do better at classifying get more say.
- A tree that is 50% correct gets no say. This tree is just guessing
- A tree that is >50% gets a positive vote (0 to infinity). A tree that is 100% correct gets infinite vote! Listen to that tree!
- A tree that is <50% gets a negative vote (0 to -infinity). A tree that is 0% correct gets a -infinite vote! Do the opposite of that tree!

The AdaBoost process:

Start with all the samples each counts the same.
Same as in a decision tree, pick a question that splits the data to minimize Gini Impurity.
Sum up sample weights for mis-classified samples and calculate Tree Influence.
Assign new weights to samples, increasing weights on mistakes and decreasing weights on correct classifications.
Create new stump, and repeat 2-5 until classification error is below some threshold you choose.

When you predict, you feed the sample through all the stumps and each votes according to their influence.

11.2.2. Example: Spam email prediction#

# !pip install ucimlrepo

from ucimlrepo import fetch_ucirepo 
import pandas as pd
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
Y = spambase.data.targets 

# to make y a compatible shape for sklearn models
y = Y['Class']
labels = ['ham', 'spam']

# metadata 
# print(spambase.metadata) 
  
# variable information 
print(spambase.variables) 

                          name     role        type demographic  \
             word_freq_make  Feature  Continuous        None   
          word_freq_address  Feature  Continuous        None   
              word_freq_all  Feature  Continuous        None   
               word_freq_3d  Feature  Continuous        None   
              word_freq_our  Feature  Continuous        None   
             word_freq_over  Feature  Continuous        None   
           word_freq_remove  Feature  Continuous        None   
         word_freq_internet  Feature  Continuous        None   
            word_freq_order  Feature  Continuous        None   
             word_freq_mail  Feature  Continuous        None   
         word_freq_receive  Feature  Continuous        None   
            word_freq_will  Feature  Continuous        None   
          word_freq_people  Feature  Continuous        None   
          word_freq_report  Feature  Continuous        None   
       word_freq_addresses  Feature  Continuous        None   
            word_freq_free  Feature  Continuous        None   
        word_freq_business  Feature  Continuous        None   
           word_freq_email  Feature  Continuous        None   
             word_freq_you  Feature  Continuous        None   
          word_freq_credit  Feature  Continuous        None   
            word_freq_your  Feature  Continuous        None   
            word_freq_font  Feature  Continuous        None   
             word_freq_000  Feature  Continuous        None   
           word_freq_money  Feature  Continuous        None   
              word_freq_hp  Feature  Continuous        None   
             word_freq_hpl  Feature  Continuous        None   
          word_freq_george  Feature  Continuous        None   
             word_freq_650  Feature  Continuous        None   
             word_freq_lab  Feature  Continuous        None   
            word_freq_labs  Feature  Continuous        None   
          word_freq_telnet  Feature  Continuous        None   
             word_freq_857  Feature  Continuous        None   
            word_freq_data  Feature  Continuous        None   
             word_freq_415  Feature  Continuous        None   
              word_freq_85  Feature  Continuous        None   
      word_freq_technology  Feature  Continuous        None   
            word_freq_1999  Feature  Continuous        None   
           word_freq_parts  Feature  Continuous        None   
              word_freq_pm  Feature  Continuous        None   
          word_freq_direct  Feature  Continuous        None   
              word_freq_cs  Feature  Continuous        None   
         word_freq_meeting  Feature  Continuous        None   
        word_freq_original  Feature  Continuous        None   
         word_freq_project  Feature  Continuous        None   
              word_freq_re  Feature  Continuous        None   
             word_freq_edu  Feature  Continuous        None   
           word_freq_table  Feature  Continuous        None   
      word_freq_conference  Feature  Continuous        None   
               char_freq_;  Feature  Continuous        None   
               char_freq_(  Feature  Continuous        None   
               char_freq_[  Feature  Continuous        None   
               char_freq_!  Feature  Continuous        None   
               char_freq_$  Feature  Continuous        None   
               char_freq_#  Feature  Continuous        None   
capital_run_length_average  Feature  Continuous        None   
capital_run_length_longest  Feature  Continuous        None   
  capital_run_length_total  Feature  Continuous        None   
                     Class   Target      Binary        None   

                 description units missing_values  
                     None  None             no  
                     None  None             no  
                     None  None             no  
                     None  None             no  
                     None  None             no  
                     None  None             no  
                     None  None             no  
                     None  None             no  
                     None  None             no  
                     None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
                    None  None             no  
spam (1) or not spam (0)  None             no  

X.head()

	word_freq_make	word_freq_address	word_freq_all	word_freq_our	word_freq_over	word_freq_remove	word_freq_internet	word_freq_order	word_freq_mail	...	char_freq_;	char_freq_(	char_freq_!	char_freq_$	char_freq_#	capital_run_length_average	capital_run_length_longest	capital_run_length_total
0	0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	0.00	...	0.00	0.000	0.778	0.000	0.000	3.756	61	278
1	0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	0.94	...	0.00	0.132	0.372	0.180	0.048	5.114	101	1028
2	0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	0.25	...	0.01	0.143	0.276	0.184	0.010	9.821	485	2259
3	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	...	0.00	0.137	0.137	0.000	0.000	3.537	40	191
4	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	...	0.00	0.135	0.135	0.000	0.000	3.537	40	191

5 rows × 57 columns

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, train_test_split

tree_params = {
    "max_depth": [1, 4, 8]
}

grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params, cv = 5)

forest_params = {
    "n_estimators":[5, 50, 500, 5000],
    "max_depth": [1, 2, 4]
}

grid_forest = GridSearchCV(RandomForestClassifier(), forest_params, cv = 5)


# boosted_params = {"n_estimators" : [5, 50, 500, 5000]}
# grid_boosted = GridSearchCV(GradientBoostingClassifier(), boosted_params, cv = 5)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

grid_tree.fit(X_train, y_train)
grid_forest.fit(X_train, y_train)

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[6], line 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
grid_tree.fit(X_train, y_train)
----> 4 grid_forest.fit(X_train, y_train)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   estimator._validate_params()
with config_context(
   skip_parameter_validation=(
       prefer_skip_nested_validation or global_skip_validation
   )
):
-> 1365     return fit_method(estimator, *args, **kwargs)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/model_selection/_search.py:1051, in BaseSearchCV.fit(self, X, y, **params)
   results = self._format_results(
       all_candidate_params, n_splits, all_out, all_more_results
   )
   return results
-> 1051 self._run_search(evaluate_candidates)
# multimetric is determined here because in the case of a callable
# self.scoring the return type is only known after calling
first_test_score = all_out[0]["test_scores"]

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/model_selection/_search.py:1605, in GridSearchCV._run_search(self, evaluate_candidates)
def _run_search(self, evaluate_candidates):
   """Search all candidates in param_grid"""
-> 1605     evaluate_candidates(ParameterGrid(self.param_grid))

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/model_selection/_search.py:997, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
if self.verbose > 0:
   print(
       "Fitting {0} folds for each of {1} candidates,"
       " totalling {2} fits".format(
           n_splits, n_candidates, n_candidates * n_splits
       )
   )
--> 997 out = parallel(
   delayed(_fit_and_score)(
       clone(base_estimator),
       X,
       y,
       train=train,
       test=test,
       parameters=parameters,
       split_progress=(split_idx, n_splits),
       candidate_progress=(cand_idx, n_candidates),
       **fit_and_score_kwargs,
   )
   for (cand_idx, parameters), (split_idx, (train, test)) in product(
       enumerate(candidate_params),
       enumerate(cv.split(X, y, **routed_params.splitter.split)),
   )
)
if len(out) < 1:
   raise ValueError(
       "No fits were performed. "
       "Was the CV iterator empty? "
       "Were there no candidates?"
   )

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/utils/parallel.py:82, in Parallel.__call__(self, iterable)
warning_filters = warnings.filters
iterable_with_config_and_warning_filters = (
   (
       _with_config_and_warning_filters(delayed_func, config, warning_filters),
   (...)     80     for delayed_func, args, kwargs in iterable
)
---> 82 return super().__call__(iterable_with_config_and_warning_filters)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/joblib/parallel.py:1986, in Parallel.__call__(self, iterable)
   output = self._get_sequential_output(iterable)
   next(output)
-> 1986     return output if self.return_generator else list(output)
# Let's create an ID that uniquely identifies the current call. If the
# call is interrupted early and that the same instance is immediately
# reused, this id will be used to prevent workers that were
# concurrently finalizing a task from the previous call to run the
# callback.
with self._lock:

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/joblib/parallel.py:1914, in Parallel._get_sequential_output(self, iterable)
self.n_dispatched_batches += 1
self.n_dispatched_tasks += 1
-> 1914 res = func(*args, **kwargs)
self.n_completed_tasks += 1
self.print_progress()

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/utils/parallel.py:147, in _FuncWrapper.__call__(self, *args, **kwargs)
with config_context(**config), warnings.catch_warnings():
   warnings.filters = warning_filters
--> 147     return self.function(*args, **kwargs)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/model_selection/_validation.py:859, in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, score_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
       estimator.fit(X_train, **fit_params)
   else:
--> 859         estimator.fit(X_train, y_train, **fit_params)
except Exception:
   # Note fit time as time until error
   fit_time = time.time() - start_time

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   estimator._validate_params()
with config_context(
   skip_parameter_validation=(
       prefer_skip_nested_validation or global_skip_validation
   )
):
-> 1365     return fit_method(estimator, *args, **kwargs)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/ensemble/_forest.py:486, in BaseForest.fit(self, X, y, sample_weight)
trees = [
   self._make_estimator(append=False, random_state=random_state)
   for i in range(n_more_estimators)
]
# Parallel loop: we prefer the threading backend as the Cython code
# for fitting the trees is internally releasing the Python GIL
# making threading more efficient than multiprocessing in
# that case. However, for joblib 0.12+ we respect any
# parallel_backend contexts set at a higher level,
# since correctness does not rely on using threads.
--> 486 trees = Parallel(
   n_jobs=self.n_jobs,
   verbose=self.verbose,
   prefer="threads",
)(
   delayed(_parallel_build_trees)(
       t,
       self.bootstrap,
       X,
       y,
       sample_weight,
       i,
       len(trees),
       verbose=self.verbose,
       class_weight=self.class_weight,
       n_samples_bootstrap=n_samples_bootstrap,
       missing_values_in_feature_mask=missing_values_in_feature_mask,
   )
   for i, t in enumerate(trees)
)
# Collect newly grown trees
self.estimators_.extend(trees)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/utils/parallel.py:82, in Parallel.__call__(self, iterable)
warning_filters = warnings.filters
iterable_with_config_and_warning_filters = (
   (
       _with_config_and_warning_filters(delayed_func, config, warning_filters),
   (...)     80     for delayed_func, args, kwargs in iterable
)
---> 82 return super().__call__(iterable_with_config_and_warning_filters)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/joblib/parallel.py:1986, in Parallel.__call__(self, iterable)
   output = self._get_sequential_output(iterable)
   next(output)
-> 1986     return output if self.return_generator else list(output)
# Let's create an ID that uniquely identifies the current call. If the
# call is interrupted early and that the same instance is immediately
# reused, this id will be used to prevent workers that were
# concurrently finalizing a task from the previous call to run the
# callback.
with self._lock:

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/joblib/parallel.py:1914, in Parallel._get_sequential_output(self, iterable)
self.n_dispatched_batches += 1
self.n_dispatched_tasks += 1
-> 1914 res = func(*args, **kwargs)
self.n_completed_tasks += 1
self.print_progress()

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/utils/parallel.py:147, in _FuncWrapper.__call__(self, *args, **kwargs)
with config_context(**config), warnings.catch_warnings():
   warnings.filters = warning_filters
--> 147     return self.function(*args, **kwargs)

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/ensemble/_forest.py:188, in _parallel_build_trees(tree, bootstrap, X, y, sample_weight, tree_idx, n_trees, verbose, class_weight, n_samples_bootstrap, missing_values_in_feature_mask)
   elif class_weight == "balanced_subsample":
       curr_sample_weight *= compute_sample_weight("balanced", y, indices=indices)
--> 188     tree._fit(
       X,
       y,
       sample_weight=curr_sample_weight,
       check_input=False,
       missing_values_in_feature_mask=missing_values_in_feature_mask,
   )
else:
   tree._fit(
       X,
       y,
   (...)    201         missing_values_in_feature_mask=missing_values_in_feature_mask,
   )

File ~/.pyenv/versions/3.13.1/envs/datascience/lib/python3.13/site-packages/sklearn/tree/_classes.py:472, in BaseDecisionTree._fit(self, X, y, sample_weight, check_input, missing_values_in_feature_mask)
else:
   builder = BestFirstTreeBuilder(
       splitter,
       min_samples_split,
   (...)    469         self.min_impurity_decrease,
   )
--> 472 builder.build(self.tree_, X, y, sample_weight, missing_values_in_feature_mask)
if self.n_outputs_ == 1 and is_classifier(self):
   self.n_classes_ = self.n_classes_[0]

KeyboardInterrupt: 

tree = grid_tree.best_estimator_
forest = grid_forest.best_estimator_

tree_pred_train = tree.predict(X_train)
tree_pred_test = tree.predict(X_test)

forest_pred_train = forest.predict(X_train)
forest_pred_test = forest.predict(X_test)

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

fig, ax = plt.subplots(1,2, figsize = (14, 6))

ConfusionMatrixDisplay.from_predictions(y_train, tree_pred_train, ax = ax[0], normalize = 'true')
ConfusionMatrixDisplay.from_predictions(y_test, tree_pred_test, ax = ax[1], normalize = 'true')

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x11c548d40>

../_images/7c8b7e5b35a1cf8a5eea2200a965dba4490ff4d02a3552594984f724d28baae6.png

fig, ax = plt.subplots(1,2, figsize = (14, 6))

ConfusionMatrixDisplay.from_predictions(y_train, forest_pred_train, ax = ax[0], normalize = 'true')
ConfusionMatrixDisplay.from_predictions(y_test, forest_pred_test, ax = ax[1], normalize = 'true')

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x11c40ae50>

../_images/512f4374f03d2210233be9a0694f262eb8cb861320e49abe221904511affe9b0.png

n_estimators = [5, 50, 500, 5000]

boosted_dict = {}

for ne in n_estimators:
    print(f'Training Boosted Tree with {ne} estimators')
    boosted_dict[ne] = GradientBoostingClassifier(n_estimators = ne)
    boosted_dict[ne].fit(X_train, y_train)

Training Boosted Tree with 5 estimators
Training Boosted Tree with 50 estimators
Training Boosted Tree with 500 estimators
Training Boosted Tree with 5000 estimators

fig, ax = plt.subplots(4, 2, figsize = (10, 20))

for k, ne in enumerate(n_estimators):
    y_pred_train = boosted_dict[ne].predict(X_train)
    y_pred_test = boosted_dict[ne].predict(X_test)

    ConfusionMatrixDisplay.from_predictions(y_train, y_pred_train, ax = ax[k, 0], normalize = 'true')
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred_test, ax = ax[k, 1], normalize = 'true')

plt.show()
    

../_images/1b94cec6932bdf44cdf6c16b3a1acde82097b0e1e4bdd247f3f2f6028c65fa7b.png

11.2.3. Example: Palmer Penguins#

# palmer = pd.read_csv('https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv', index_col = 'rowid')

# palmer.dropna(axis = 0, inplace=True)
# palmer.reset_index(drop = True, inplace=True)

# features = ['bill_length_mm', 'bill_depth_mm',
#        'flipper_length_mm', 'body_mass_g']

# target = 'species'
# labels = ['Adelie', 'Chinstrap', 'Gentoo']

# X = palmer[features]
# y = palmer[target]

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, train_test_split


# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

Best Decision Tree params: {'max_depth': 6, 'min_samples_split': 10}
Best Random Forest params: {'max_depth': 2, 'n_estimators': 100}
Best Gradient Boosted params: {'max_depth': 2, 'n_estimators': 100}

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report

import matplotlib.pyplot as plt

models = [
    ('Decision Tree', y_tree_train, y_tree_test),
    ('Random Forest', y_forest_train, y_forest_test),
    ('Gradient Boosted', y_boosted_train, y_boosted_test)
]

fig, axes = plt.subplots(3, 2, figsize=(8, 9))

for i, (name, y_pred_train, y_pred_test) in enumerate(models):
    cm_train = confusion_matrix(y_train, y_pred_train)
    cm_test = confusion_matrix(y_test, y_pred_test)

    disp_train = ConfusionMatrixDisplay(cm_train, display_labels=labels)
    disp_test = ConfusionMatrixDisplay(cm_test)

    disp_train.plot(ax=axes[i, 0], cmap='Blues', values_format='d')
    axes[i, 0].set_title(f'{name} - Train')

    disp_test.plot(ax=axes[i, 1], cmap='Blues', values_format='d')
    axes[i, 1].set_title(f'{name} - Test')

plt.tight_layout()

../_images/90d5b6c7d339ad08ef7fa1d751fd96e96a00c8a47e5dd49cc9ca2cf79eaf5bbc.png

import numpy as np

def display_feature_importance(model):
    imps = model.feature_importances_
    features = model.feature_names_in_
    
    sort_idx = np.argsort(imps)[::-1]
    imps = imps[sort_idx]
    features = features[sort_idx]
    
    for k, (feature, imp) in enumerate(zip(features, imps), start = 1):
        print(f'{k:>3}. {feature:_<30}{imp:.4f}')
        

print('\nDECISION TREE\n====================================')
display_feature_importance(tree)

DECISION TREE
====================================
char_freq_$___________________0.4501
word_freq_remove______________0.2068
char_freq_!___________________0.0992
word_freq_hp__________________0.0598
capital_run_length_total______0.0470
word_freq_free________________0.0320
word_freq_edu_________________0.0168
word_freq_george______________0.0163
word_freq_000_________________0.0162
capital_run_length_longest____0.0110
word_freq_you_________________0.0108
capital_run_length_average____0.0089
word_freq_1999________________0.0081
word_freq_hpl_________________0.0066
word_freq_email_______________0.0038
word_freq_over________________0.0023
word_freq_conference__________0.0022
word_freq_mail________________0.0022
word_freq_font________________0.0000
word_freq_business____________0.0000
word_freq_your________________0.0000
word_freq_credit______________0.0000
word_freq_address_____________0.0000
word_freq_report______________0.0000
word_freq_addresses___________0.0000
word_freq_all_________________0.0000
word_freq_people______________0.0000
word_freq_will________________0.0000
word_freq_order_______________0.0000
word_freq_internet____________0.0000
word_freq_our_________________0.0000
word_freq_3d__________________0.0000
word_freq_receive_____________0.0000
word_freq_lab_________________0.0000
word_freq_money_______________0.0000
word_freq_cs__________________0.0000
char_freq_#___________________0.0000
char_freq_[___________________0.0000
char_freq_(___________________0.0000
char_freq_;___________________0.0000
word_freq_table_______________0.0000
word_freq_re__________________0.0000
word_freq_project_____________0.0000
word_freq_original____________0.0000
word_freq_meeting_____________0.0000
word_freq_direct______________0.0000
word_freq_650_________________0.0000
word_freq_pm__________________0.0000
word_freq_parts_______________0.0000
word_freq_technology__________0.0000
word_freq_85__________________0.0000
word_freq_415_________________0.0000
word_freq_data________________0.0000
word_freq_857_________________0.0000
word_freq_telnet______________0.0000
word_freq_labs________________0.0000
word_freq_make________________0.0000

print('\nRANDOM FOREST\n====================================')
display_feature_importance(forest)

RANDOM FOREST
====================================
char_freq_$___________________0.1412
char_freq_!___________________0.1224
word_freq_remove______________0.1150
word_freq_free________________0.0709
capital_run_length_longest____0.0601
word_freq_your________________0.0578
capital_run_length_average____0.0521
word_freq_money_______________0.0515
capital_run_length_total______0.0507
word_freq_george______________0.0482
word_freq_000_________________0.0414
word_freq_hp__________________0.0302
word_freq_internet____________0.0232
word_freq_hpl_________________0.0213
word_freq_our_________________0.0199
word_freq_you_________________0.0195
word_freq_all_________________0.0141
word_freq_1999________________0.0098
word_freq_business____________0.0094
word_freq_receive_____________0.0072
word_freq_over________________0.0065
word_freq_make________________0.0051
word_freq_address_____________0.0045
word_freq_edu_________________0.0028
word_freq_will________________0.0026
word_freq_order_______________0.0025
word_freq_lab_________________0.0024
word_freq_re__________________0.0016
word_freq_addresses___________0.0016
word_freq_credit______________0.0012
word_freq_meeting_____________0.0011
word_freq_labs________________0.0006
char_freq_;___________________0.0005
word_freq_415_________________0.0003
char_freq_(___________________0.0002
word_freq_conference__________0.0002
word_freq_original____________0.0002
word_freq_mail________________0.0001
word_freq_pm__________________0.0001
word_freq_technology__________0.0001
char_freq_[___________________0.0000
word_freq_3d__________________0.0000
char_freq_#___________________0.0000
word_freq_table_______________0.0000
word_freq_85__________________0.0000
word_freq_people______________0.0000
word_freq_report______________0.0000
word_freq_project_____________0.0000
word_freq_font________________0.0000
word_freq_cs__________________0.0000
word_freq_direct______________0.0000
word_freq_parts_______________0.0000
word_freq_650_________________0.0000
word_freq_telnet______________0.0000
word_freq_857_________________0.0000
word_freq_data________________0.0000
word_freq_email_______________0.0000

print('\nBOOSTED TREE\n====================================')
display_feature_importance(boosted)

BOOSTED TREE
====================================
char_freq_$___________________0.2472
char_freq_!___________________0.2080
word_freq_remove______________0.1500
word_freq_free________________0.0722
word_freq_hp__________________0.0679
capital_run_length_average____0.0658
capital_run_length_longest____0.0375
word_freq_george______________0.0357
word_freq_your________________0.0247
word_freq_money_______________0.0200
word_freq_our_________________0.0175
capital_run_length_total______0.0100
word_freq_edu_________________0.0094
word_freq_650_________________0.0068
word_freq_re__________________0.0041
word_freq_meeting_____________0.0032
word_freq_000_________________0.0032
word_freq_1999________________0.0031
word_freq_receive_____________0.0026
word_freq_internet____________0.0020
word_freq_you_________________0.0018
word_freq_business____________0.0017
word_freq_over________________0.0015
char_freq_;___________________0.0012
word_freq_3d__________________0.0008
word_freq_font________________0.0008
word_freq_project_____________0.0004
word_freq_conference__________0.0004
word_freq_report______________0.0003
word_freq_will________________0.0002
word_freq_addresses___________0.0000
word_freq_people______________0.0000
word_freq_email_______________0.0000
word_freq_all_________________0.0000
word_freq_address_____________0.0000
word_freq_mail________________0.0000
word_freq_order_______________0.0000
word_freq_lab_________________0.0000
word_freq_credit______________0.0000
word_freq_pm__________________0.0000
char_freq_#___________________0.0000
char_freq_[___________________0.0000
char_freq_(___________________0.0000
word_freq_table_______________0.0000
word_freq_original____________0.0000
word_freq_cs__________________0.0000
word_freq_direct______________0.0000
word_freq_parts_______________0.0000
word_freq_hpl_________________0.0000
word_freq_technology__________0.0000
word_freq_85__________________0.0000
word_freq_415_________________0.0000
word_freq_data________________0.0000
word_freq_857_________________0.0000
word_freq_telnet______________0.0000
word_freq_labs________________0.0000
word_freq_make________________0.0000

11.2.4. In class exercise#

The following dataset can be found at UCI ML repository

Based on census information, can we predict whether an individual makes over $50K/yr?

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 

X

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
48837	39	Private	215419	Bachelors	13	Divorced	Prof-specialty	Not-in-family	White	Female	0	0	36	United-States
48838	64	NaN	321403	HS-grad	9	Widowed	NaN	Other-relative	Black	Male	0	0	40	United-States
48839	38	Private	374983	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	50	United-States
48840	44	Private	83891	Bachelors	13	Divorced	Adm-clerical	Own-child	Asian-Pac-Islander	Male	5455	0	40	United-States
48841	35	Self-emp-inc	182148	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	60	United-States

48842 rows × 14 columns

y.replace({'<=50K.':'<=50K', '>50K.':'>50K'}, inplace = True)
y = y['income'].ravel()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[31], line 1
----> 1 y.replace({'<=50K.':'<=50K', '>50K.':'>50K'}, inplace = True)
      2 y = y['income'].ravel()
      3 y

AttributeError: 'numpy.ndarray' object has no attribute 'replace'

X = X.drop(columns = 'education')
X.replace({np.nan:'?'}, inplace = True)

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

ord_features = ['sex']
oe = OrdinalEncoder(categories = [['Male', 'Female']])

cat_features = ['workclass', 'marital-status', 'occupation', 'relationship', 'race', 'native-country']
oh = OneHotEncoder()

ss = StandardScaler()

num_features = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
ct = ColumnTransformer([
    ('ord', oe, ord_features),
    ('oh', oh, cat_features),
    ('ss', ss, num_features)
],
    sparse_threshold = 0,
    verbose_feature_names_out=False)

Xt = ct.fit_transform(X)

columns = ct.get_feature_names_out()

Xt_df = pd.DataFrame(Xt, columns = columns)

Xt_df.head()

	sex	workclass_Private	workclass_Self-emp-not-inc	workclass_State-gov	...	native-country_United-States	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
0	0.0	0.0	0.0	1.0	...	1.0	0.025996	-1.061979	1.136512	0.146932	-0.217127	-0.034087
1	0.0	0.0	1.0	0.0	...	1.0	0.828308	-1.007104	1.136512	-0.144804	-0.217127	-2.213032
2	0.0	1.0	0.0	0.0	...	1.0	-0.046942	0.246034	-0.419335	-0.144804	-0.217127	-0.034087
3	0.0	1.0	0.0	0.0	...	1.0	1.047121	0.426663	-1.197259	-0.144804	-0.217127	-0.034087
4	1.0	1.0	0.0	0.0	...	0.0	-0.776316	1.408530	1.136512	-0.144804	-0.217127	-0.034087

5 rows × 91 columns

# Split data
X_train, X_test, y_train, y_test = train_test_split(Xt_df, y, test_size=0.5, random_state=42)

# Decision Tree
dt_params = {'max_depth': [3, 6, 9, 12], 'min_samples_split': [10, 30, 100]}
dt_grid = GridSearchCV(DecisionTreeClassifier(random_state=42), dt_params, cv=5, n_jobs=-1)
dt_grid.fit(X_train, y_train)

# Random Forest
rf_params = {'n_estimators': [10, 100, 1000], 'max_depth': [1, 3]}
rf_grid = GridSearchCV(RandomForestClassifier(random_state=42), rf_params, cv=5, n_jobs=-1)
rf_grid.fit(X_train, y_train)

# Gradient Boosted Trees
gb_params = {'n_estimators': [10, 10, 1000], 'max_depth': [1, 3]}
gb_grid = GridSearchCV(GradientBoostingClassifier(random_state=42), gb_params, cv=5, n_jobs=-1)
gb_grid.fit(X_train, y_train)


# Get the best models
print("Best Decision Tree params:", dt_grid.best_params_)
print("Best Random Forest params:", rf_grid.best_params_)
print("Best Gradient Boosted params:", gb_grid.best_params_)

tree = dt_grid.best_estimator_
forest = rf_grid.best_estimator_
boosted = gb_grid.best_estimator_

Best Decision Tree params: {'max_depth': 9, 'min_samples_split': 30}
Best Random Forest params: {'max_depth': 3, 'n_estimators': 1000}
Best Gradient Boosted params: {'max_depth': 3, 'n_estimators': 1000}

y_tree_train = tree.predict(X_train)
y_tree_test = tree.predict(X_test)

y_forest_train = forest.predict(X_train)
y_forest_test = forest.predict(X_test)

y_boosted_train = boosted.predict(X_train)
y_boosted_test = boosted.predict(X_test)

models = [
    ('Decision Tree', y_tree_train, y_tree_test),
    ('Random Forest', y_forest_train, y_forest_test),
    ('Gradient Boosted', y_boosted_train, y_boosted_test)
]

fig, axes = plt.subplots(3, 2, figsize=(8, 9))

for i, (name, y_pred_train, y_pred_test) in enumerate(models):
    cm_train = confusion_matrix(y_train, y_pred_train, normalize = 'true')
    cm_test = confusion_matrix(y_test, y_pred_test, normalize = 'true')

    disp_train = ConfusionMatrixDisplay(cm_train, display_labels=labels)
    disp_test = ConfusionMatrixDisplay(cm_test)

    disp_train.plot(ax=axes[i, 0], cmap='Blues', values_format='.2f')
    axes[i, 0].set_title(f'{name} - Train')

    disp_test.plot(ax=axes[i, 1], cmap='Blues', values_format='.2f')
    axes[i, 1].set_title(f'{name} - Test')

plt.tight_layout()

../_images/6ccab413369d401c8f7270983cfb32e3bb92fefe45e3bfb4b74970c18edf93f1.png