Encoding

8. Encoding#

Dealing with discrete/categorical data: Ordinal and Categorical

Discrete data come in two flavors, ordinal and categorical:

  • ordinal data - can be ordered

    • ‘low’, ‘medium’, ‘high’

    • ‘F’, ‘D-’, ‘D’, ‘D+’, ‘C-’, ‘C’, ‘C+’, …, ‘A’, ‘A+’

    • ‘1 br’, ‘2 br’, ‘3 br’, …

  • not ordinal data - are not ordered

    • ‘cat’, ‘dog’, ‘parrot’, ‘hamster’

    • ‘faculty’, ‘staff’, ‘student’

    • ‘never licensed’, ‘valid permit’, ‘valid license’, ‘suspended/expired license’

Using sklearn, we need to convert these categories into numerical values. There are several transforms available, and which we choose depends on the data.

  • OrdinalEncoder(categories = list_of_lists_of_categories) - for ordinal data. We tell the ordinal encoder the order of the string data using a list (see example below). OrdinalEncoder can be applied to multiple columns at a time. (e.g. [‘low’,’med’,’high] —> [0, 1, 2])

    • Ordinal encoding will replace ordered variables into numbers (0,1,2,…) in a single column.

  • OneHotEncoder() - for categorical data. one-hot encoding of mutually exclusive categorical variables.

    • One-Hot encoding creates a new feature for each possible value. Each row will have a 1 in a single column and 0 in the others, hence one hot.

And one more:

  • LabelEncoder()- for categorical data. Label encoding is intended to be used only on the label/target y, hence can only be applied to one column at a time.

We can apply different transforms to different columns of our data using ColumnTransformer().

Let’s look at an example.

import pandas as pd
import numpy as np

grade_roster = dict(
    student_name = ['Alma', 'Barak', 'Connell', 'Devon', 'Erin', 'Felicia'],
    student_level = ['first-year', 'senior', 'sophomore', 'junior', 'senior', 'sophomore'], 
    student_grade = ['A', 'A+', 'B+', 'B', 'C+', 'A'],
    student_major = ['Business', 'Econ', 'Business', 'CS', 'CS', 'Econ']
)

grade_roster_df = pd.DataFrame(grade_roster)
grade_roster_df
student_name student_level student_grade student_major
0 Alma first-year A Business
1 Barak senior A+ Econ
2 Connell sophomore B+ Business
3 Devon junior B CS
4 Erin senior C+ CS
5 Felicia sophomore A Econ
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

# Encoding Ordinal Features
levels = ['first-year', 'sophomore', 'junior', 'senior']
grades = ['F', 'D-', 'D', 'D+', 'C-', 'C', 'C+', \
          'B-', 'B', 'B+', 'A-', 'A', 'A+']

ord_features = ['student_level', 'student_grade']
ordEnc = OrdinalEncoder(categories = [levels, grades])

# Encoding non-ordinal features
cat_features = ['student_major']
oneHotEnc = OneHotEncoder()

coltrans = ColumnTransformer(
    transformers=[ 
        ("ord", ordEnc, ord_features),   # (nickname, transformer object variable, which columns to apply to)
        ("onehot", oneHotEnc, cat_features)
        ],
    remainder = 'drop',  # what to do with any feature not listed above
    verbose_feature_names_out=False)

X_trans = coltrans.fit_transform(grade_roster_df)
X_trans
array([[ 0., 11.,  1.,  0.,  0.],
       [ 3., 12.,  0.,  0.,  1.],
       [ 1.,  9.,  1.,  0.,  0.],
       [ 2.,  8.,  0.,  1.,  0.],
       [ 3.,  6.,  0.,  1.,  0.],
       [ 1., 11.,  0.,  0.,  1.]])
new_feature_names = coltrans.get_feature_names_out()
new_feature_names
array(['student_level', 'student_grade', 'student_major_Business',
       'student_major_CS', 'student_major_Econ'], dtype=object)
grade_roster_df2 = pd.DataFrame(X_trans, columns = new_feature_names)
grade_roster_df2
student_level student_grade student_major_Business student_major_CS student_major_Econ
0 0.0 11.0 1.0 0.0 0.0
1 3.0 12.0 0.0 0.0 1.0
2 1.0 9.0 1.0 0.0 0.0
3 2.0 8.0 0.0 1.0 0.0
4 3.0 6.0 0.0 1.0 0.0
5 1.0 11.0 0.0 0.0 1.0