8. Encoding#
Dealing with discrete/categorical data: Ordinal and Categorical
Discrete data come in two flavors, ordinal and categorical:
ordinal data - can be ordered
‘low’, ‘medium’, ‘high’
‘F’, ‘D-’, ‘D’, ‘D+’, ‘C-’, ‘C’, ‘C+’, …, ‘A’, ‘A+’
‘1 br’, ‘2 br’, ‘3 br’, …
not ordinal data - are not ordered
‘cat’, ‘dog’, ‘parrot’, ‘hamster’
‘faculty’, ‘staff’, ‘student’
‘never licensed’, ‘valid permit’, ‘valid license’, ‘suspended/expired license’
Using sklearn, we need to convert these categories into numerical values. There are several transforms available, and which we choose depends on the data.
OrdinalEncoder(categories = list_of_lists_of_categories) - for ordinal data. We tell the ordinal encoder the order of the string data using a list (see example below). OrdinalEncoder can be applied to multiple columns at a time. (e.g. [‘low’,’med’,’high] —> [0, 1, 2])
Ordinal encoding will replace ordered variables into numbers (0,1,2,…) in a single column.
OneHotEncoder() - for categorical data. one-hot encoding of mutually exclusive categorical variables.
One-Hot encoding creates a new feature for each possible value. Each row will have a 1 in a single column and 0 in the others, hence one hot.
And one more:
LabelEncoder()- for categorical data. Label encoding is intended to be used only on the label/target y, hence can only be applied to one column at a time.
We can apply different transforms to different columns of our data using ColumnTransformer().
Let’s look at an example.
import pandas as pd
import numpy as np
grade_roster = dict(
student_name = ['Alma', 'Barak', 'Connell', 'Devon', 'Erin', 'Felicia'],
student_level = ['first-year', 'senior', 'sophomore', 'junior', 'senior', 'sophomore'],
student_grade = ['A', 'A+', 'B+', 'B', 'C+', 'A'],
student_major = ['Business', 'Econ', 'Business', 'CS', 'CS', 'Econ']
)
grade_roster_df = pd.DataFrame(grade_roster)
grade_roster_df
| student_name | student_level | student_grade | student_major | |
|---|---|---|---|---|
| 0 | Alma | first-year | A | Business |
| 1 | Barak | senior | A+ | Econ |
| 2 | Connell | sophomore | B+ | Business |
| 3 | Devon | junior | B | CS |
| 4 | Erin | senior | C+ | CS |
| 5 | Felicia | sophomore | A | Econ |
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
# Encoding Ordinal Features
levels = ['first-year', 'sophomore', 'junior', 'senior']
grades = ['F', 'D-', 'D', 'D+', 'C-', 'C', 'C+', \
'B-', 'B', 'B+', 'A-', 'A', 'A+']
ord_features = ['student_level', 'student_grade']
ordEnc = OrdinalEncoder(categories = [levels, grades])
# Encoding non-ordinal features
cat_features = ['student_major']
oneHotEnc = OneHotEncoder()
coltrans = ColumnTransformer(
transformers=[
("ord", ordEnc, ord_features), # (nickname, transformer object variable, which columns to apply to)
("onehot", oneHotEnc, cat_features)
],
remainder = 'drop', # what to do with any feature not listed above
verbose_feature_names_out=False)
X_trans = coltrans.fit_transform(grade_roster_df)
X_trans
array([[ 0., 11., 1., 0., 0.],
[ 3., 12., 0., 0., 1.],
[ 1., 9., 1., 0., 0.],
[ 2., 8., 0., 1., 0.],
[ 3., 6., 0., 1., 0.],
[ 1., 11., 0., 0., 1.]])
new_feature_names = coltrans.get_feature_names_out()
new_feature_names
array(['student_level', 'student_grade', 'student_major_Business',
'student_major_CS', 'student_major_Econ'], dtype=object)
grade_roster_df2 = pd.DataFrame(X_trans, columns = new_feature_names)
grade_roster_df2
| student_level | student_grade | student_major_Business | student_major_CS | student_major_Econ | |
|---|---|---|---|---|---|
| 0 | 0.0 | 11.0 | 1.0 | 0.0 | 0.0 |
| 1 | 3.0 | 12.0 | 0.0 | 0.0 | 1.0 |
| 2 | 1.0 | 9.0 | 1.0 | 0.0 | 0.0 |
| 3 | 2.0 | 8.0 | 0.0 | 1.0 | 0.0 |
| 4 | 3.0 | 6.0 | 0.0 | 1.0 | 0.0 |
| 5 | 1.0 | 11.0 | 0.0 | 0.0 | 1.0 |