Trees and Ensemble methods

9. Trees and Ensemble methods#

9.1. Decision/Classification Trees#

Tree-based models — Decision/Classification Trees and Regression Trees — are a family of algorithms that approach classification and regression using a heirarchy of conditional/if-else statements (imagine a flow chart or a game of 20 questions). You might describe tree-based models as human-like in their approach to problem solving. And these models are not only intuitive and interpretable, they are some of the most ubiquitous and powerful algorithms in ML.

We’ll start by looking at a cartoon example of a decision tree, this one predicts whether a casual golfer is expected to shoot below or above par on a given outing.

decision tree

9.1.1. Anatomy of a tree#

Decision node - a binary question that divides the data set
Root node - the top decision node in the tree
Leaf node - a terminal node where the classification is made
Parent/Child node - the node before/after a split
Split - the forks off a node
Branch - all the decision and leaf nodes that spawn from a single up-stream decision

While the cartoon does a good job showing the structure and flow of a decision tree, it’s important to clarify an important detail about the nodes.

In the example, some of the nodes are truly categorical questions:

“With friends?”
“Walk or cart?”

While others are seemingly subjective. To make the subjective nodes binary, we define the question in terms of inequalities:

“Windy?” is ambiguous. Is 2 mph winds windy? 5 mph? 20 mph? Instead we would use an inequality on the feature: wind > 5 mph
“Cold?” is similarly ambiguous. So a node splitting on temperature would need to be formulated as an inequality: temp < 50 F

9.1.2. How do we choose what a good condition for a node would be?#

It is preferable for tree-models to have fewer nodes; smaller models are better. And ideally, each leaf node would represent a set of samples that all belonged to the same class, that is to say, if you followed the branches down to that leaf node, your classification would be correct (and this is not always possible).

Intuitively, what questions might you ask to classify:

animals into their classes: Mammals, Birds, Reptiles, Amphibians, and Fish.
beers into their types: pilsner, lager, ale, porter, stout
movies into hits or flops
voters into their political leanings: Liberal, Conservative, Libertarian, Populist, etc.

Towards this ideal, at each node, we would want a conditional statement that split the set in a way so that the resulting subsets are as homogeneous (mostly or all one class) as possible. This homogeneity is referred to as the “purity” of the node, and there are different ways of quantifying said purity: Gini Impurity and Information Gain.

9.1.3. Gini Impurity#

Gini impurity is the likelihood of randomly misclassifying a randomly selected sample given the distribution of samples in the set; that is, if you randomly select a sample and randomly select a label, Gini impurity is the likelihood of a mismatch. A Gini impurity of zero represents a perfectly homogeneous set (all the same class).

If using Gini impurity as the criterion, the fitting algorithm will choose conditions to minimize the Gini impurity at each successive split (using a greedy algorithm).

The Gini index for a sub-node is:

\[\begin{split} I(p) = \sum_{i=1}^M p_i(1-p_i) \\ = 1- \sum_{i=1}^M p_i^2 \end{split}\]

where:

\(i\) is the class index
\(M\) is the number of classes
\(p_i\) is the proportion of a given class in the subset.

The Gini impurity for a split is the sum of the Gini index for each resultant node, weighted by the number of samples in each node.

\[ I(p) = \frac{1}{N}\sum_{k=1}^M n_k I_k(p) \]

where:

\(N\) is the sample size at the parent node
\(n_k\) is the sample size at each sub-node

Example: Dogs, Cat, Mouse classifier.

9.2. Information Gain#

Entropy in physics quantifies dis-order. Analagously, entropy in information theory quantifies uncertainty.

Entropy (H) of a set is defined as:

\[ H(p) = - \sum_{i=1}^M p_i \log(p_i) \]

What is the entropy if \(p_i = 0\) or \(p_i = 1\)?

When entropy is zero, there is no uncertainity; in the case of a decision tree, the subsample is perfectly homogeneous.

Information gain is the reduction of entropy between the parent and child node.

\[ IG(p) = H_{\text{parent}}(p) - \sum H_{\text{children}}(p) \]

If using Information Gain as the criterion, the fitting algorithm will choose conditions to maximize Information Gain (reduce entropy) at each successive split (using a greedy algorithm).

9.3. Example: Drug prescriptions#

For this example, we’ll be looking at the drug200 data set (https://www.kaggle.com/datasets/prathamtripathi/drug-classification).

The dataset summarizes the demographics and health condition of 200 patients (Age, sex, blood, pressure, cholesterol, and sodium/potassium ratio) as well as which of five drugs had best results with the patient. The goal is to build a classifier to choose which drug to prescribe to a future patient.

import pandas as pd

drugs_df = pd.read_csv('https://raw.githubusercontent.com/kvinlazy/Dataset/refs/heads/master/drug200.csv')

drugs_df

	Age	Sex	BP	Cholesterol	Na_to_K	Drug
0	23	F	HIGH	HIGH	25.355	drugY
1	47	M	LOW	HIGH	13.093	drugC
2	47	M	LOW	HIGH	10.114	drugC
3	28	F	NORMAL	HIGH	7.798	drugX
4	61	F	LOW	HIGH	18.043	drugY
...	...	...	...	...	...	...
195	56	F	LOW	HIGH	11.567	drugC
196	16	M	LOW	HIGH	12.006	drugC
197	52	M	NORMAL	HIGH	9.894	drugX
198	23	M	NORMAL	NORMAL	14.020	drugX
199	40	F	LOW	NORMAL	11.349	drugX

200 rows × 6 columns

9.3.1. Transform (Encode) the data#

y = drugs_df['Drug']
X = drugs_df.drop(columns = 'Drug')

X

	Age	Sex	BP	Cholesterol	Na_to_K
0	23	F	HIGH	HIGH	25.355
1	47	M	LOW	HIGH	13.093
2	47	M	LOW	HIGH	10.114
3	28	F	NORMAL	HIGH	7.798
4	61	F	LOW	HIGH	18.043
...	...	...	...	...	...
195	56	F	LOW	HIGH	11.567
196	16	M	LOW	HIGH	12.006
197	52	M	NORMAL	HIGH	9.894
198	23	M	NORMAL	NORMAL	14.020
199	40	F	LOW	NORMAL	11.349

200 rows × 5 columns

drugs_df['Cholesterol'].unique()

array(['HIGH', 'NORMAL'], dtype=object)

from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

ord_features = ['Sex', 'BP', 'Cholesterol']
sex = ['M', 'F']
BP = ['LOW', 'NORMAL', 'HIGH']
Chol = ['NORMAL', 'HIGH']

ordenc = OrdinalEncoder(categories = [sex, BP, Chol])

num_features = ['Age', 'Na_to_K']
sc = StandardScaler()

coltrans = ColumnTransformer(
    transformers = [
        ('ordinal', ordenc, ord_features),
        ('numerical', sc, num_features)
    ],
    remainder = 'drop',
    verbose_feature_names_out= False
)

X_trans = coltrans.fit_transform(X)

feature_names = coltrans.get_feature_names_out()
X_df = pd.DataFrame(X_trans, columns = feature_names)

 X_df

	Sex	BP	Cholesterol	Age	Na_to_K
0	1.0	2.0	1.0	-1.291591	1.286522
1	0.0	0.0	1.0	0.162699	-0.415145
2	0.0	0.0	1.0	0.162699	-0.828558
3	1.0	1.0	1.0	-0.988614	-1.149963
4	1.0	0.0	1.0	1.011034	0.271794
...	...	...	...	...	...
195	1.0	0.0	1.0	0.708057	-0.626917
196	0.0	0.0	1.0	-1.715759	-0.565995
197	0.0	1.0	1.0	0.465676	-0.859089
198	0.0	1.0	0.0	-1.291591	-0.286500
199	1.0	0.0	0.0	-0.261469	-0.657170