14. PCA: Examples and Observations#

Principle Component Analysis (PCA) is one of the more complex concepts in data science. In this notebook, we look at some examples and make some general observations as to the (beneficial) effects and uses of PCA.

14.1. The Smoothie Bar — A PCA Analogy#

Menu — Individual Ingredients with Measurements

#

Name

Ingredients & Amounts

1

Purple Rain

Acai 4oz, Banana 1, Peanut Butter 2 tbsp, Strawberry ½ cup, Blueberry ½ cup, Raspberry ½ cup

2

Morning Warrior

Greek Yogurt 6oz, Protein Powder 1 scoop, Almond Milk 8oz, Maca Powder 1 tsp, Espresso 1oz, Ginger ½ tsp

3

Green Goddess

Spinach 1 cup, Kale ½ cup, Cucumber 3oz, Orange Juice 4oz, Lime Juice 1 tbsp, Lemon 1 tbsp, Bee Pollen 1 tsp, Chia Seeds 1 tbsp, Spirulina 1 tsp

4

Tropical Storm

Coconut Milk 8oz, Pineapple ½ cup, Mango ½ cup, Peach ½ cup, Cherry ¼ cup, Apricot ¼ cup

5

Berry Protein

Greek Yogurt 6oz, Protein Powder 1 scoop, Almond Milk 8oz, Strawberry ½ cup, Blueberry ½ cup, Raspberry ½ cup

6

Island Dream

Coconut Milk 8oz, Pineapple ½ cup, Mango ½ cup, Orange Juice 4oz, Lime Juice 1 tbsp, Lemon 1 tbsp

7

Acai Superfood

Acai 4oz, Banana 1, Peanut Butter 2 tbsp, Bee Pollen 1 tsp, Chia Seeds 1 tbsp, Spirulina 1 tsp

8

Choco Thunder

Greek Yogurt 6oz, Protein Powder 1 scoop, Almond Milk 8oz, Cacao Powder 2 tbsp, Cacao Nibs 1 tbsp, Honey 1 tbsp, Almond Butter 2 tbsp, Rolled Oats ¼ cup

9

Detox Green

Spinach 1 cup, Kale ½ cup, Cucumber 3oz, Orange Juice 4oz, Lime Juice 1 tbsp, Lemon 1 tbsp, Maca Powder 1 tsp, Espresso 1oz, Ginger ½ tsp

10

Sunrise

Acai 4oz, Banana 1, Peanut Butter 2 tbsp, Peach ½ cup, Cherry ¼ cup, Apricot ¼ cup, Orange Juice 4oz, Lime Juice 1 tbsp, Lemon 1 tbsp

11

Jungle Juice

Coconut Milk 6oz, Pineapple ½ cup, Mango ½ cup, Spinach 1 cup, Kale ½ cup, Cucumber 3oz

12

Power Shake

Greek Yogurt 6oz, Protein Powder 2 scoops, Almond Milk 8oz, Cacao Powder 2 tbsp, Cacao Nibs 1 tbsp, Maca Powder 1 tsp, Espresso 1oz, Ginger ½ tsp

13

Beachside

Coconut Milk 8oz, Pineapple ½ cup, Mango ½ cup, Strawberry ½ cup, Blueberry ½ cup, Raspberry ½ cup

14

Zen Garden

Spinach 1 cup, Kale ½ cup, Cucumber 3oz, Honey 1 tbsp, Almond Butter 2 tbsp, Rolled Oats ¼ cup, Bee Pollen 1 tsp, Chia Seeds 1 tbsp, Spirulina 1 tsp

15

Triple Berry

Acai 4oz, Banana 1, Peanut Butter 2 tbsp, Strawberry ½ cup, Blueberry ½ cup, Raspberry ½ cup, Honey 1 tbsp, Almond Butter 2 tbsp, Rolled Oats ¼ cup

16

Coco Loco

Coconut Milk 8oz, Pineapple ½ cup, Mango ½ cup, Cacao Powder 2 tbsp, Cacao Nibs 1 tbsp, Honey 1 tbsp, Almond Butter 2 tbsp, Rolled Oats ¼ cup

17

Citrus Protein

Greek Yogurt 6oz, Protein Powder 1 scoop, Almond Milk 8oz, Orange Juice 4oz, Lime Juice 1 tbsp, Lemon 1 tbsp, Bee Pollen 1 tsp, Chia Seeds 1 tbsp, Spirulina 1 tsp

18

Green Machine

Spinach 1 cup, Kale ½ cup, Cucumber 3oz, Greek Yogurt 4oz, Protein Powder 1 scoop, Almond Milk 6oz, Bee Pollen 1 tsp, Chia Seeds 1 tbsp, Spirulina 1 tsp

19

Peachy Keen

Acai 4oz, Banana 1, Peanut Butter 2 tbsp, Peach ½ cup, Cherry ¼ cup, Apricot ¼ cup, Honey 1 tbsp, Almond Butter 2 tbsp, Rolled Oats ¼ cup

20

The Works

Acai 4oz, Banana 1, Peanut Butter 2 tbsp, Coconut Milk 4oz, Pineapple ¼ cup, Mango ¼ cup, Strawberry ½ cup, Blueberry ½ cup, Raspberry ½ cup, Bee Pollen 1 tsp, Chia Seeds 1 tbsp, Spirulina 1 tsp


Principal Components with Measurements

PC Name

Category

Ingredients & Amounts

Acai Base

Base

Acai 4oz, Banana 1, Peanut Butter 2 tbsp

Protein Base

Base

Greek Yogurt 6oz, Protein Powder 1 scoop, Almond Milk 8oz

Green Base

Base

Spinach 1 cup, Kale ½ cup, Cucumber 3oz

Tropical Base

Base

Coconut Milk 8oz, Pineapple ½ cup, Mango ½ cup

Berry Blend

Flavor

Strawberry ½ cup, Blueberry ½ cup, Raspberry ½ cup

Citrus Zing

Flavor

Orange Juice 4oz, Lime Juice 1 tbsp, Lemon 1 tbsp

Stone Fruit

Flavor

Peach ½ cup, Cherry ¼ cup, Apricot ¼ cup

Dark Chocolate

Flavor

Cacao Powder 2 tbsp, Cacao Nibs 1 tbsp

Superfood Boost

Add-on

Bee Pollen 1 tsp, Chia Seeds 1 tbsp, Spirulina 1 tsp

Energy Boost

Add-on

Maca Powder 1 tsp, Espresso 1oz, Ginger ½ tsp

Creamy Boost

Add-on

Honey 1 tbsp, Almond Butter 2 tbsp, Rolled Oats ¼ cup

Menu Re-expressed as Principal Components

#

Name

Base(s)

Flavor(s)

Add-on(s)

1

Purple Rain

Acai Base (1×)

Berry Blend (1×)

2

Morning Warrior

Protein Base (1×)

Energy Boost (1×)

3

Green Goddess

Green Base (1×)

Citrus Zing (1×)

Superfood Boost (1×)

4

Tropical Storm

Tropical Base (1×)

Stone Fruit (1×)

5

Berry Protein

Protein Base (1×)

Berry Blend (1×)

6

Island Dream

Tropical Base (1×)

Citrus Zing (1×)

7

Acai Superfood

Acai Base (1×)

Superfood Boost (1×)

8

Choco Thunder

Protein Base (1×)

Dark Chocolate (1×)

Creamy Boost (1×)

9

Detox Green

Green Base (1×)

Citrus Zing (1×)

Energy Boost (1×)

10

Sunrise

Acai Base (1×)

Stone Fruit (1×), Citrus Zing (1×)

11

Jungle Juice

Tropical Base (0.75×), Green Base (1×)

12

Power Shake

Protein Base (1×) + extra scoop

Dark Chocolate (1×)

Energy Boost (1×)

13

Beachside

Tropical Base (1×)

Berry Blend (1×)

14

Zen Garden

Green Base (1×)

Superfood Boost (1×), Creamy Boost (1×)

15

Triple Berry

Acai Base (1×)

Berry Blend (1×)

Creamy Boost (1×)

16

Coco Loco

Tropical Base (1×)

Dark Chocolate (1×)

Creamy Boost (1×)

17

Citrus Protein

Protein Base (1×)

Citrus Zing (1×)

Superfood Boost (1×)

18

Green Machine

Green Base (1×), Protein Base (0.67×)

Superfood Boost (1×)

19

Peachy Keen

Acai Base (1×)

Stone Fruit (1×)

Creamy Boost (1×)

20

The Works

Acai Base (1×), Tropical Base (0.5×)

Berry Blend (1×)

Superfood Boost (1×)

Working through the examples, observe:

  • Features that vary together appear in the same Principle Components. As a result, PCA solves the co-linearity problem.

  • You can approximately reconstruct the original feature values with fewer than all the PCs.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Hide code cell source

def plotPCA(data, pca=None, num_components = None, idx = 0, feature_names = None):

# plot PCA components and a reconstruction of one data instance

    if pca is None:
        if num_components is None:
            pca = PCA()
        else:
            pca = PCA(n_components = num_components)
    
    X = pca.fit_transform(data)

    if num_components is None:
        num_components = pca.n_components_
    
    
    if isinstance(data, pd.DataFrame):
        x0 = data.iloc[idx,:]
    else:
        x0 = data[idx,:]
        
    x_pca = np.zeros_like(x0)
    
    coeffs = X[idx,:]
    
    if feature_names is None:
        try:
            feature_names = pca.feature_names_in_
        except:
            feature_names = [str(i) for i in range(data.shape[1])]

    C0 = 'goldenrod'
    C1 = 'dodgerblue'
    alpha = 0.6
    

    offset = 2
    num_cols = num_components + offset
    width_ratios = [1, 0.2] + [1]*num_components
        
    fig, ax = plt.subplots(2, num_cols, figsize = ((num_cols)*2, 8), 
                        sharex = True, sharey = False,
                        gridspec_kw = dict(width_ratios=width_ratios,
                                            hspace = 0.1, wspace = 0.1)                       
                        )
    
    if not np.all(np.isclose(pca.mean_, 0)):
        ax[0, 0].bar(feature_names, pca.mean_, color = C1, alpha = alpha)
        ax[0, 0].bar(feature_names, x0, color = C0, alpha = alpha)
        
        ax[0, 0].sharey(ax[1,offset])
    
    ax[0,0].set_title('Mean')

    ax[1, 0].bar(feature_names, x0-pca.mean_, color = C0, alpha = alpha)
    ax[1, 0].bar(feature_names, x_pca, color = C1, alpha = alpha)
    ax[1, 0].set_xlabel('Features - Mean = ', fontsize = 16)
    ax[1, 0].tick_params(axis='x', labelrotation = 90, )
    ax[1, 0].sharey(ax[1,offset])


    # ax[0, 2].set_title(f'PC1')   

    ax[0,1].set_visible(False)
    ax[1,1].set_visible(False)

    for r_idx, (r, coef) in enumerate(zip(pca.components_, coeffs)):
        x_pca += coef * r
        ax[0, r_idx+offset].bar(feature_names, r, color = C1, alpha = alpha)
        ax[0, r_idx+offset].text(0.5, 0.05, f'coef = {coef:0.2f}', ha = 'center', transform=ax[0, r_idx+offset].transAxes)
        
        ax[1, r_idx+offset].bar(feature_names, x0-pca.mean_, color = C0, alpha = alpha)
        ax[1, r_idx+offset].bar(feature_names, x_pca, alpha = alpha, color = C1)
        
        if r_idx == 0:
            ax_str = f'{coef:0.2f} * PC1'
        
        if r_idx>0:
            ax[0, r_idx+offset].sharey(ax[0,offset])
            ax[1, r_idx+offset].sharey(ax[1,offset])
            plt.setp(ax[0,r_idx+offset].get_yticklabels(), visible=False)
            plt.setp(ax[1,r_idx+offset].get_yticklabels(), visible=False)
            
            if coef > 0:
                ax_str = f' + {coef:4.2} * PC{r_idx+1}'
            elif coef < 0:
                ax_str = f' - {-1*coef:4.2} * PC{r_idx+1}'
            else:
                ax_str = f' + 0'
                        
        ax[0, r_idx+offset].set_title(f'PC{r_idx+1}')   
        ax[1, r_idx+offset].set_xlabel(ax_str, fontsize = 16)
        ax[1, r_idx+offset].tick_params(axis='x', labelrotation = 90, )


    ax[0,2].set_ylim([-1, 1])
    fig.suptitle('Principle components (top row) and a sample reconstruction (bottom row)')

    return pca, fig, ax

14.1.1. Example 0: Synthetic data (boxes)#

num_data = 100

boxes_dict = {'L_ft': 5*np.random.randn(num_data),
              'W_ft': 10*np.random.randn(num_data)
              }

boxes_df = pd.DataFrame(boxes_dict)
boxes_df
L_ft W_ft
0 -1.507725 -8.747656
1 -2.087854 -6.173149
2 1.611289 1.033444
3 -4.126524 -0.973193
4 2.107764 -12.159853
... ... ...
95 -5.899326 1.027081
96 -6.501513 1.506319
97 1.828042 1.278987
98 2.147934 -2.109605
99 -9.168768 6.091723

100 rows × 2 columns

boxes_df[['L_in', 'W_in']] = boxes_df[['L_ft','W_ft']]*12
boxes_df['P_ft'] = 2*(boxes_df['L_ft']+boxes_df['W_ft'])
boxes_df['P_in'] = 2*(boxes_df['L_in']+boxes_df['W_in'])

boxes_df
L_ft W_ft L_in W_in P_ft P_in
0 -1.507725 -8.747656 -18.092697 -104.971877 -20.510762 -246.129148
1 -2.087854 -6.173149 -25.054250 -74.077787 -16.522006 -198.264074
2 1.611289 1.033444 19.335466 12.401324 5.289465 63.473579
3 -4.126524 -0.973193 -49.518293 -11.678312 -10.199434 -122.393210
4 2.107764 -12.159853 25.293172 -145.918232 -20.104177 -241.250120
... ... ... ... ... ... ...
95 -5.899326 1.027081 -70.791912 12.324976 -9.744489 -116.933873
96 -6.501513 1.506319 -78.018161 18.075824 -9.990390 -119.884675
97 1.828042 1.278987 21.936501 15.347843 6.214057 74.568688
98 2.147934 -2.109605 25.775213 -25.315255 0.076660 0.919916
99 -9.168768 6.091723 -110.025214 73.100678 -6.154089 -73.849072

100 rows × 6 columns

pca, fig, ax = plotPCA(boxes_df)
plt.show()
../_images/963f5fd7b45bf17bd716add13d66034c197927eb2abd6f51611c7e51e5fd30fc.png
ss = StandardScaler()
boxes_scaled = ss.fit_transform(boxes_df)
boxes_scaled_df = pd.DataFrame(boxes_scaled, columns = boxes_df.columns)

pca_scaled, fig, ax = plotPCA(boxes_scaled_df)
plt.show()
../_images/387a8ee5386b45a455c7e66e7cf0f88180fb0681989cf320f7a4ff537c45ad64.png
plt.plot(pca_scaled.explained_variance_, 'b.')
plt.show()
../_images/82c9471228dffe64d7e3641c1b99cdad3734de4538ef82dab0287dcf58fbf8e0.png

14.1.2. Example 1: Macro-nutrients#

macros_df = pd.read_csv('https://raw.githubusercontent.com/f-imp/Principal-Component-Analysis-PCA-over-3-datasets/refs/heads/master/datasets/Pizza.csv')
macros_df.head()
brand id mois prot fat ash sodium carb cal
0 A 14069 27.82 21.43 44.87 5.11 1.77 0.77 4.93
1 A 14053 28.49 21.26 43.89 5.34 1.79 1.02 4.84
2 A 14025 28.35 19.99 45.78 5.08 1.63 0.80 4.95
3 A 14016 30.55 20.15 43.13 4.79 1.61 1.38 4.74
4 A 14005 30.49 21.28 41.65 4.82 1.64 1.76 4.67
features_to_keep = ['mois', 'prot', 'fat', 'ash', 'sodium', 'carb', 'cal']

macros_df = macros_df[features_to_keep]
macros_df
mois prot fat ash sodium carb cal
0 27.82 21.43 44.87 5.11 1.77 0.77 4.93
1 28.49 21.26 43.89 5.34 1.79 1.02 4.84
2 28.35 19.99 45.78 5.08 1.63 0.80 4.95
3 30.55 20.15 43.13 4.79 1.61 1.38 4.74
4 30.49 21.28 41.65 4.82 1.64 1.76 4.67
... ... ... ... ... ... ... ...
295 44.91 11.07 17.00 2.49 0.66 25.36 2.91
296 43.15 11.79 18.46 2.43 0.67 24.17 3.10
297 44.55 11.01 16.03 2.43 0.64 25.98 2.92
298 47.60 10.43 15.18 2.32 0.56 24.47 2.76
299 46.84 9.91 15.50 2.27 0.57 25.48 2.81

300 rows × 7 columns

pca = PCA(n_components = 6)
macros_pca = pca.fit_transform(macros_df)

pca_df = pd.DataFrame(data = pca.components_, columns = macros_df.columns)
pca_df
mois prot fat ash sodium carb cal
0 -0.276963 -0.266941 -0.278934 -0.055434 -0.011142 0.878084 -0.000603
1 0.747074 -0.055733 -0.657845 -0.040604 -0.023814 0.006818 -0.061254
2 -0.352016 0.809718 -0.467976 0.022225 -0.026245 -0.012469 -0.010062
3 -0.195900 -0.255747 -0.259802 0.871443 0.201453 -0.164525 -0.040678
4 0.059475 0.083719 0.035776 -0.166634 0.978316 0.057470 0.001497
5 0.440974 0.443490 0.448624 0.450220 -0.030463 0.444405 -0.080452
feature_names = list(macros_df.columns)
ss = StandardScaler()

X = ss.fit_transform(macros_df)
fig, ax = plotPCA(X, feature_names = feature_names)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 5
      2 ss = StandardScaler()
      4 X = ss.fit_transform(macros_df)
----> 5 fig, ax = plotPCA(X, feature_names = feature_names)

ValueError: too many values to unpack (expected 2)
../_images/8a6267d80f5c38b85fe9636759f341c3737b4f333735609bb9c9fd0ce5b77e7c.png

14.1.3. Example 2: Automobile specs#

cars_df = pd.read_csv('https://raw.githubusercontent.com/shreyamdg/automobile-data-set-analysis/refs/heads/master/cars.csv')
cars_df.head()
Unnamed: 0 symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 0 3 ? alfa-romero gas std two convertible rwd front ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 1 3 ? alfa-romero gas std two convertible rwd front ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 2 1 ? alfa-romero gas std two hatchback rwd front ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 3 2 164 audi gas std four sedan fwd front ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 4 2 164 audi gas std four sedan 4wd front ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

5 rows × 27 columns

specs_df = cars_df.select_dtypes(include = 'number').drop(columns = ['Unnamed: 0', 'symboling'])
names_df = cars_df[['make',  'num-of-doors', 'body-style', 'fuel-type']]
idx = 7

ss = StandardScaler()
X = ss.fit_transform(specs_df)
feature_names = list(specs_df.columns)

display(names_df.iloc[[7]])

plotPCA(X, num_components = 5, feature_names = feature_names, idx = idx)
plt.show()
make num-of-doors body-style fuel-type
7 audi four wagon gas
../_images/f3b283410b7989796852c2b48281c080d44f0070939f71bfb44b560dd8b29260.png