Encoding

8. Encoding#

Dealing with discrete/categorical data: Ordinal and Categorical

Discrete data come in two flavors, ordinal and categorical:

ordinal data - can be ordered
- ‘low’, ‘medium’, ‘high’
- ‘F’, ‘D-’, ‘D’, ‘D+’, ‘C-’, ‘C’, ‘C+’, …, ‘A’, ‘A+’
- ‘1 br’, ‘2 br’, ‘3 br’, …
not ordinal data - are not ordered
- ‘cat’, ‘dog’, ‘parrot’, ‘hamster’
- ‘faculty’, ‘staff’, ‘student’
- ‘never licensed’, ‘valid permit’, ‘valid license’, ‘suspended/expired license’

Using sklearn, we need to convert these categories into numerical values. There are several transforms available, and which we choose depends on the data.

OrdinalEncoder(categories = list_of_lists_of_categories) - for ordinal data. We tell the ordinal encoder the order of the string data using a list (see example below). OrdinalEncoder can be applied to multiple columns at a time. (e.g. [‘low’,’med’,’high] —> [0, 1, 2])
- Ordinal encoding will replace ordered variables into numbers (0,1,2,…) in a single column.
OneHotEncoder() - for categorical data. one-hot encoding of mutually exclusive categorical variables.
- One-Hot encoding creates a new feature for each possible value. Each row will have a 1 in a single column and 0 in the others, hence one hot.

And one more:

LabelEncoder()- for categorical data. Label encoding is intended to be used only on the label/target y, hence can only be applied to one column at a time.

We can apply different transforms to different columns of our data using ColumnTransformer().

Let’s look at an example.

import pandas as pd
import numpy as np

grade_roster = dict(
    student_name = ['Alma', 'Barak', 'Connell', 'Devon', 'Erin', 'Felicia'],
    student_level = ['first-year', 'senior', 'sophomore', 'junior', 'senior', 'sophomore'], 
    student_grade = ['A', 'A+', 'B+', 'B', 'C+', 'A'],
    student_major = ['Business', 'Econ', 'Business', 'CS', 'CS', 'Econ']
)

grade_roster_df = pd.DataFrame(grade_roster)
grade_roster_df

	student_name	student_level	student_grade	student_major
0	Alma	first-year	A	Business
1	Barak	senior	A+	Econ
2	Connell	sophomore	B+	Business
3	Devon	junior	B	CS
4	Erin	senior	C+	CS
5	Felicia	sophomore	A	Econ

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

# Encoding Ordinal Features
levels = ['first-year', 'sophomore', 'junior', 'senior']
grades = ['F', 'D-', 'D', 'D+', 'C-', 'C', 'C+', \
          'B-', 'B', 'B+', 'A-', 'A', 'A+']

ord_features = ['student_level', 'student_grade']
ordEnc = OrdinalEncoder(categories = [levels, grades])

# Encoding non-ordinal features
cat_features = ['student_major']
oneHotEnc = OneHotEncoder()

coltrans = ColumnTransformer(
    transformers=[ 
        ("ord", ordEnc, ord_features),   # (nickname, transformer object variable, which columns to apply to)
        ("onehot", oneHotEnc, cat_features)
        ],
    remainder = 'drop',  # what to do with any feature not listed above
    verbose_feature_names_out=False)

X_trans = coltrans.fit_transform(grade_roster_df)
X_trans

array([[ 0., 11.,  1.,  0.,  0.],
       [ 3., 12.,  0.,  0.,  1.],
       [ 1.,  9.,  1.,  0.,  0.],
       [ 2.,  8.,  0.,  1.,  0.],
       [ 3.,  6.,  0.,  1.,  0.],
       [ 1., 11.,  0.,  0.,  1.]])

new_feature_names = coltrans.get_feature_names_out()
new_feature_names

array(['student_level', 'student_grade', 'student_major_Business',
       'student_major_CS', 'student_major_Econ'], dtype=object)

grade_roster_df2 = pd.DataFrame(X_trans, columns = new_feature_names)
grade_roster_df2

	student_level	student_grade	student_major_Business	student_major_CS	student_major_Econ
0	0.0	11.0	1.0	0.0	0.0
1	3.0	12.0	0.0	0.0	1.0
2	1.0	9.0	1.0	0.0	0.0
3	2.0	8.0	0.0	1.0	0.0
4	3.0	6.0	0.0	1.0	0.0
5	1.0	11.0	0.0	0.0	1.0