What is LDA (Linear Discriminant Analysis)?

The idea behind LDA is simple. Mathematically speaking, we need to find a new feature space to project the data in order to maximize classes separability.

Linear Discriminant Analysis is a supervised algorithm as it take the class label in consideration. It is a way to reduce 'dimensionality' while at the same time preserving as much of the class discrimination information as possible.

Basically LDA finds a centroid of each class datapoints. For example with thirteen different features LDA will find the centroid of each of its class using the thirteen different dataset. Now on the basis of this, It determines a new dimension which is nothing but an axis which should satisfy two criteria:

  1. Maximize the distance between the centroid of each class.
  2. Minimize the variation (which LDA calls scatter and is represented by s2), within each category.

Whati is PCA

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.

When to use PCA

  • Data Visualization
  • Speeding Machine Learning(ML) Algorithm

How to do PCA

We can calculate a Principal Component Analysis on a dataset using the PCA() class in the scikit-learn library. The benefit of this approach is that once the projection is calculated, it can be applied to new data again and again quite easily.

When creating the class, the number of components can be specified as a parameter.

The class is first fit on a dataset by calling the fit() function, and then the original dataset or other data can be projected into a subspace with the chosen number of dimensions by calling the transform() function.

Once fit, the eigenvalues and principal components can be accessed on the PCA class via the explained_variance__ and components_attributes.

eigenvales:固有値

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [0]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
In [3]:
data = pd.read_csv('/content/drive/My Drive/kaggle/Feature_Selection_by_Filter_Method/data/santander-train.csv', nrows=20000)
data.head()
Out[3]:
ID var3 var15 imp_ent_var16_ult1 imp_op_var39_comer_ult1 imp_op_var39_comer_ult3 imp_op_var40_comer_ult1 imp_op_var40_comer_ult3 imp_op_var40_efect_ult1 imp_op_var40_efect_ult3 imp_op_var40_ult1 imp_op_var41_comer_ult1 imp_op_var41_comer_ult3 imp_op_var41_efect_ult1 imp_op_var41_efect_ult3 imp_op_var41_ult1 imp_op_var39_efect_ult1 imp_op_var39_efect_ult3 imp_op_var39_ult1 imp_sal_var16_ult1 ind_var1_0 ind_var1 ind_var2_0 ind_var2 ind_var5_0 ind_var5 ind_var6_0 ind_var6 ind_var8_0 ind_var8 ind_var12_0 ind_var12 ind_var13_0 ind_var13_corto_0 ind_var13_corto ind_var13_largo_0 ind_var13_largo ind_var13_medio_0 ind_var13_medio ind_var13 ... saldo_medio_var5_ult1 saldo_medio_var5_ult3 saldo_medio_var8_hace2 saldo_medio_var8_hace3 saldo_medio_var8_ult1 saldo_medio_var8_ult3 saldo_medio_var12_hace2 saldo_medio_var12_hace3 saldo_medio_var12_ult1 saldo_medio_var12_ult3 saldo_medio_var13_corto_hace2 saldo_medio_var13_corto_hace3 saldo_medio_var13_corto_ult1 saldo_medio_var13_corto_ult3 saldo_medio_var13_largo_hace2 saldo_medio_var13_largo_hace3 saldo_medio_var13_largo_ult1 saldo_medio_var13_largo_ult3 saldo_medio_var13_medio_hace2 saldo_medio_var13_medio_hace3 saldo_medio_var13_medio_ult1 saldo_medio_var13_medio_ult3 saldo_medio_var17_hace2 saldo_medio_var17_hace3 saldo_medio_var17_ult1 saldo_medio_var17_ult3 saldo_medio_var29_hace2 saldo_medio_var29_hace3 saldo_medio_var29_ult1 saldo_medio_var29_ult3 saldo_medio_var33_hace2 saldo_medio_var33_hace3 saldo_medio_var33_ult1 saldo_medio_var33_ult3 saldo_medio_var44_hace2 saldo_medio_var44_hace3 saldo_medio_var44_ult1 saldo_medio_var44_ult3 var38 TARGET
0 1 2 23 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.00 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 39205.170000 0
1 3 2 34 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 ... 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 300.0 122.22 300.0 240.75 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 49278.030000 0
2 4 2 23 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 3.00 2.07 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.00 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 67333.770000 0
3 8 2 37 0.0 195.0 195.0 0.0 0.0 0 0 0.0 195.0 195.0 0.0 0.0 195.0 0.0 0.0 195.0 0.0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 91.56 138.84 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.00 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 64007.970000 0
4 10 2 39 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 ... 40501.08 13501.47 0.0 0.0 0.0 0.0 0.0 0.0 85501.89 85501.89 0.0 0.00 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 117310.979016 0

5 rows × 371 columns

In [4]:
x = data.drop('TARGET', axis=1)
y = data['TARGET']

x.shape, y.shape
Out[4]:
((20000, 370), (20000,))
In [0]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0, stratify=y)

Remove Constant, Quasi Constant and Duplicate Features

In [0]:
# remove constant and quasi constant features
constant_filter = VarianceThreshold(threshold=0.01)
constant_filter.fit(x_train)
x_train_filter = constant_filter.transform(x_train)
x_test_filter = constant_filter.transform(x_test)
In [7]:
x_train_filter.shape, x_test_filter.shape
Out[7]:
((16000, 245), (4000, 245))
In [0]:
# remove duplicated features
x_train_T = x_train_filter.T
x_test_T = x_test_filter.T
In [0]:
x_train_T = pd.DataFrame(x_train_T)
x_test_T = pd.DataFrame(x_test_T)
In [10]:
x_train_T.duplicated().sum()
Out[10]:
18
In [0]:
duplicated_features = x_train_T.duplicated()
In [0]:
features_to_keep = [not index for index in duplicated_features]

x_train_unique = x_train_T[features_to_keep].T
x_test_unique = x_test_T[features_to_keep].T
In [0]:
# standardization
scaler = StandardScaler().fit(x_train_unique)
x_train_unique = scaler.transform(x_train_unique)
x_test_unique = scaler.transform(x_test_unique)
In [0]:
x_train_unique = pd.DataFrame(x_train_unique)
x_test_unique = pd.DataFrame(x_test_unique)
In [17]:
x_train_unique.shape, x_test_unique.shape
Out[17]:
((16000, 227), (4000, 227))

Removal of coorelated Feature

In [0]:
corrmat = x_train_unique.corr()
In [19]:
# find correlated features
def get_correlation(data, threshold):
    corr_col = set()
    corrmat = data.corr()
    for i in range(len(corrmat.columns)):
        for j in range(i):
            if abs(corrmat.iloc[i, j]) > threshold:
                colname = corrmat.columns[i]
                corr_col.add(colname)
    return corr_col

corr_features = get_correlation(x_train_unique, 0.70)
print('correlated features: ', len(set(corr_features)))
correlated features:  148
In [0]:
x_train_uncorr = x_train_unique.drop(labels=corr_features, axis=1)
x_test_uncorr = x_test_unique.drop(labels=corr_features, axis=1)
In [21]:
x_train_uncorr.shape, x_test_uncorr.shape
Out[21]:
((16000, 79), (4000, 79))

Feature Dimension Reduction by LDA or ls it a Classifier

In [0]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
In [0]:
lda = LDA(n_components=1) # クラス数−1の次元数(このデータは2クラス分類なので1次元)にデータが写像される
# ※ LDAはクラス数に基づき次元を削減する。写像先(n_components)の次元は、クラス数-1次元になる。
x_train_lda = lda.fit_transform(x_train_uncorr, y_train)
In [24]:
x_train_lda.shape
Out[24]:
(16000, 1)
In [0]:
x_test_lda = lda.transform(x_test_uncorr)
In [0]:
def run_randomForest(x_train, x_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print('Accuracy on tes set: ')
    print(accuracy_score(y_test, y_pred))
In [27]:
%%time
run_randomForest(x_train_lda, x_test_lda, y_train, y_test)
Accuracy on tes set: 
0.93025
CPU times: user 2.96 s, sys: 15.7 ms, total: 2.97 s
Wall time: 1.67 s
In [28]:
%%time
run_randomForest(x_train, x_test, y_train, y_test)
Accuracy on tes set: 
0.9585
CPU times: user 4.43 s, sys: 31.4 ms, total: 4.47 s
Wall time: 2.39 s

Feature Reduction by PCA

In [0]:
from sklearn.decomposition import PCA
In [30]:
pca = PCA(n_components=2, random_state=42)
# ※ PCAは、特徴量の次元を指定する次元まで落とすので、n_componetsでその写像先の次元数を指定する。
# ここでは、2次元に指定している。
pca.fit(x_test_uncorr)
Out[30]:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=42,
    svd_solver='auto', tol=0.0, whiten=False)
In [31]:
x_train_pca = pca.transform(x_train_uncorr)
x_test_pca = pca.transform(x_test_uncorr)
x_train_pca.shape, x_train_uncorr.shape
# pcaによりデータセットが79次元から2次元になった。
Out[31]:
((16000, 2), (16000, 79))
In [32]:
%%time
run_randomForest(x_train_pca, x_test_pca, y_train, y_test)
Accuracy on tes set: 
0.9565
CPU times: user 2.69 s, sys: 25.6 ms, total: 2.72 s
Wall time: 1.55 s
In [33]:
# 次に次元数を3次元にすると以下のようなスコアになる
pca = PCA(n_components=3, random_state=42)
pca.fit(x_test_uncorr)
Out[33]:
PCA(copy=True, iterated_power='auto', n_components=3, random_state=42,
    svd_solver='auto', tol=0.0, whiten=False)
In [34]:
x_train_pca = pca.transform(x_train_uncorr)
x_test_pca = pca.transform(x_test_uncorr)
x_train_pca.shape, x_train_uncorr.shape
# pcaによりデータセットが79次元から3次元になった。
Out[34]:
((16000, 3), (16000, 79))
In [35]:
%%time
run_randomForest(x_train_pca, x_test_pca, y_train, y_test)
# 時間はほとんどかわらず、スコアが少し上昇した。
Accuracy on tes set: 
0.95725
CPU times: user 2.63 s, sys: 17.8 ms, total: 2.65 s
Wall time: 1.46 s

上記のように、PCAの場合、最適な写像先の次元がわからないので、以下のようにforループで最もスコアの高い次元を探してみる。

In [36]:
for component in range(1, 79):
    pca = PCA(n_components=component, random_state=42)
    pca.fit(x_test_uncorr)
    x_train_pca = pca.transform(x_train_uncorr)
    x_test_pca = pca.transform(x_test_uncorr)
    print('Selected Comp: ', component)
    run_randomForest(x_train_pca, x_test_pca, y_train, y_test)
    print()
Selected Comp:  1
Accuracy on tes set: 
0.931

Selected Comp:  2
Accuracy on tes set: 
0.9565

Selected Comp:  3
Accuracy on tes set: 
0.95725

Selected Comp:  4
Accuracy on tes set: 
0.956

Selected Comp:  5
Accuracy on tes set: 
0.955

Selected Comp:  6
Accuracy on tes set: 
0.95575

Selected Comp:  7
Accuracy on tes set: 
0.9565

Selected Comp:  8
Accuracy on tes set: 
0.956

Selected Comp:  9
Accuracy on tes set: 
0.957

Selected Comp:  10
Accuracy on tes set: 
0.956

Selected Comp:  11
Accuracy on tes set: 
0.956

Selected Comp:  12
Accuracy on tes set: 
0.9555

Selected Comp:  13
Accuracy on tes set: 
0.95625

Selected Comp:  14
Accuracy on tes set: 
0.956

Selected Comp:  15
Accuracy on tes set: 
0.95525

Selected Comp:  16
Accuracy on tes set: 
0.95575

Selected Comp:  17
Accuracy on tes set: 
0.9555

Selected Comp:  18
Accuracy on tes set: 
0.95475

Selected Comp:  19
Accuracy on tes set: 
0.9565

Selected Comp:  20
Accuracy on tes set: 
0.95625

Selected Comp:  21
Accuracy on tes set: 
0.95575

Selected Comp:  22
Accuracy on tes set: 
0.9555

Selected Comp:  23
Accuracy on tes set: 
0.95575

Selected Comp:  24
Accuracy on tes set: 
0.95525

Selected Comp:  25
Accuracy on tes set: 
0.95525

Selected Comp:  26
Accuracy on tes set: 
0.95525

Selected Comp:  27
Accuracy on tes set: 
0.955

Selected Comp:  28
Accuracy on tes set: 
0.95575

Selected Comp:  29
Accuracy on tes set: 
0.95575

Selected Comp:  30
Accuracy on tes set: 
0.956

Selected Comp:  31
Accuracy on tes set: 
0.956

Selected Comp:  32
Accuracy on tes set: 
0.9565

Selected Comp:  33
Accuracy on tes set: 
0.956

Selected Comp:  34
Accuracy on tes set: 
0.9565

Selected Comp:  35
Accuracy on tes set: 
0.956

Selected Comp:  36
Accuracy on tes set: 
0.956

Selected Comp:  37
Accuracy on tes set: 
0.956

Selected Comp:  38
Accuracy on tes set: 
0.956

Selected Comp:  39
Accuracy on tes set: 
0.956

Selected Comp:  40
Accuracy on tes set: 
0.956

Selected Comp:  41
Accuracy on tes set: 
0.95575

Selected Comp:  42
Accuracy on tes set: 
0.95575

Selected Comp:  43
Accuracy on tes set: 
0.95625

Selected Comp:  44
Accuracy on tes set: 
0.9555

Selected Comp:  45
Accuracy on tes set: 
0.95575

Selected Comp:  46
Accuracy on tes set: 
0.95525

Selected Comp:  47
Accuracy on tes set: 
0.955

Selected Comp:  48
Accuracy on tes set: 
0.956

Selected Comp:  49
Accuracy on tes set: 
0.9555

Selected Comp:  50
Accuracy on tes set: 
0.95625

Selected Comp:  51
Accuracy on tes set: 
0.95675

Selected Comp:  52
Accuracy on tes set: 
0.95575

Selected Comp:  53
Accuracy on tes set: 
0.9555

Selected Comp:  54
Accuracy on tes set: 
0.9555

Selected Comp:  55
Accuracy on tes set: 
0.955

Selected Comp:  56
Accuracy on tes set: 
0.956

Selected Comp:  57
Accuracy on tes set: 
0.9555

Selected Comp:  58
Accuracy on tes set: 
0.95575

Selected Comp:  59
Accuracy on tes set: 
0.95525

Selected Comp:  60
Accuracy on tes set: 
0.95575

Selected Comp:  61
Accuracy on tes set: 
0.9555

Selected Comp:  62
Accuracy on tes set: 
0.956

Selected Comp:  63
Accuracy on tes set: 
0.957

Selected Comp:  64
Accuracy on tes set: 
0.95525

Selected Comp:  65
Accuracy on tes set: 
0.95475

Selected Comp:  66
Accuracy on tes set: 
0.95525

Selected Comp:  67
Accuracy on tes set: 
0.9555

Selected Comp:  68
Accuracy on tes set: 
0.956

Selected Comp:  69
Accuracy on tes set: 
0.9555

Selected Comp:  70
Accuracy on tes set: 
0.956

Selected Comp:  71
Accuracy on tes set: 
0.95525

Selected Comp:  72
Accuracy on tes set: 
0.9555

Selected Comp:  73
Accuracy on tes set: 
0.95575

Selected Comp:  74
Accuracy on tes set: 
0.95625

Selected Comp:  75
Accuracy on tes set: 
0.95575

Selected Comp:  76
Accuracy on tes set: 
0.955

Selected Comp:  77
Accuracy on tes set: 
0.955

Selected Comp:  78
Accuracy on tes set: 
0.9555

結果:78次元から3次元への写像が一番スコアが良かった。

In [0]: