What is LDA (Linear Discriminant Analysis)?¶

The idea behind LDA is simple. Mathematically speaking, we need to find a new feature space to project the data in order to maximize classes separability.

Linear Discriminant Analysis is a supervised algorithm as it take the class label in consideration. It is a way to reduce 'dimensionality' while at the same time preserving as much of the class discrimination information as possible.

Basically LDA finds a centroid of each class datapoints. For example with thirteen different features LDA will find the centroid of each of its class using the thirteen different dataset. Now on the basis of this, It determines a new dimension which is nothing but an axis which should satisfy two criteria:

Maximize the distance between the centroid of each class.
Minimize the variation (which LDA calls scatter and is represented by s2), within each category.

Whati is PCA¶

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.

When to use PCA¶

Data Visualization
Speeding Machine Learning(ML) Algorithm

How to do PCA¶

We can calculate a Principal Component Analysis on a dataset using the PCA() class in the scikit-learn library. The benefit of this approach is that once the projection is calculated, it can be applied to new data again and again quite easily.

When creating the class, the number of components can be specified as a parameter.

The class is first fit on a dataset by calling the fit() function, and then the original dataset or other data can be projected into a subspace with the chosen number of dimensions by calling the transform() function.

Once fit, the eigenvalues and principal components can be accessed on the PCA class via the explained_variance__ and components_attributes.

eigenvales：固有値

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler

data = pd.read_csv('/content/drive/My Drive/kaggle/Feature_Selection_by_Filter_Method/data/santander-train.csv', nrows=20000)
data.head()

x = data.drop('TARGET', axis=1)
y = data['TARGET']

x.shape, y.shape

((20000, 370), (20000,))

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0, stratify=y)

Remove Constant, Quasi Constant and Duplicate Features¶

# remove constant and quasi constant features
constant_filter = VarianceThreshold(threshold=0.01)
constant_filter.fit(x_train)
x_train_filter = constant_filter.transform(x_train)
x_test_filter = constant_filter.transform(x_test)

x_train_filter.shape, x_test_filter.shape

((16000, 245), (4000, 245))

# remove duplicated features
x_train_T = x_train_filter.T
x_test_T = x_test_filter.T

x_train_T = pd.DataFrame(x_train_T)
x_test_T = pd.DataFrame(x_test_T)

x_train_T.duplicated().sum()

18

duplicated_features = x_train_T.duplicated()

features_to_keep = [not index for index in duplicated_features]

x_train_unique = x_train_T[features_to_keep].T
x_test_unique = x_test_T[features_to_keep].T

# standardization
scaler = StandardScaler().fit(x_train_unique)
x_train_unique = scaler.transform(x_train_unique)
x_test_unique = scaler.transform(x_test_unique)

x_train_unique = pd.DataFrame(x_train_unique)
x_test_unique = pd.DataFrame(x_test_unique)

x_train_unique.shape, x_test_unique.shape

((16000, 227), (4000, 227))

Removal of coorelated Feature¶

corrmat = x_train_unique.corr()

# find correlated features
def get_correlation(data, threshold):
    corr_col = set()
    corrmat = data.corr()
    for i in range(len(corrmat.columns)):
        for j in range(i):
            if abs(corrmat.iloc[i, j]) > threshold:
                colname = corrmat.columns[i]
                corr_col.add(colname)
    return corr_col

corr_features = get_correlation(x_train_unique, 0.70)
print('correlated features: ', len(set(corr_features)))

correlated features:  148

x_train_uncorr = x_train_unique.drop(labels=corr_features, axis=1)
x_test_uncorr = x_test_unique.drop(labels=corr_features, axis=1)

x_train_uncorr.shape, x_test_uncorr.shape

((16000, 79), (4000, 79))

Feature Dimension Reduction by LDA or ls it a Classifier¶

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

lda = LDA(n_components=1) # クラス数−１の次元数（このデータは２クラス分類なので１次元）にデータが写像される
# ※ LDAはクラス数に基づき次元を削減する。写像先(n_components)の次元は、クラス数-1次元になる。
x_train_lda = lda.fit_transform(x_train_uncorr, y_train)

x_train_lda.shape

(16000, 1)

x_test_lda = lda.transform(x_test_uncorr)

def run_randomForest(x_train, x_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print('Accuracy on tes set: ')
    print(accuracy_score(y_test, y_pred))

%%time
run_randomForest(x_train_lda, x_test_lda, y_train, y_test)

Accuracy on tes set: 
0.93025
CPU times: user 2.96 s, sys: 15.7 ms, total: 2.97 s
Wall time: 1.67 s

%%time
run_randomForest(x_train, x_test, y_train, y_test)

Accuracy on tes set: 
0.9585
CPU times: user 4.43 s, sys: 31.4 ms, total: 4.47 s
Wall time: 2.39 s

Feature Reduction by PCA¶

from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=42)
# ※ PCAは、特徴量の次元を指定する次元まで落とすので、n_componetsでその写像先の次元数を指定する。
# ここでは、２次元に指定している。
pca.fit(x_test_uncorr)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=42,
    svd_solver='auto', tol=0.0, whiten=False)

x_train_pca = pca.transform(x_train_uncorr)
x_test_pca = pca.transform(x_test_uncorr)
x_train_pca.shape, x_train_uncorr.shape
# pcaによりデータセットが７９次元から２次元になった。

((16000, 2), (16000, 79))

%%time
run_randomForest(x_train_pca, x_test_pca, y_train, y_test)

Accuracy on tes set: 
0.9565
CPU times: user 2.69 s, sys: 25.6 ms, total: 2.72 s
Wall time: 1.55 s

# 次に次元数を３次元にすると以下のようなスコアになる
pca = PCA(n_components=3, random_state=42)
pca.fit(x_test_uncorr)

PCA(copy=True, iterated_power='auto', n_components=3, random_state=42,
    svd_solver='auto', tol=0.0, whiten=False)

x_train_pca = pca.transform(x_train_uncorr)
x_test_pca = pca.transform(x_test_uncorr)
x_train_pca.shape, x_train_uncorr.shape
# pcaによりデータセットが７９次元から３次元になった。

((16000, 3), (16000, 79))

%%time
run_randomForest(x_train_pca, x_test_pca, y_train, y_test)
# 時間はほとんどかわらず、スコアが少し上昇した。

Accuracy on tes set: 
0.95725
CPU times: user 2.63 s, sys: 17.8 ms, total: 2.65 s
Wall time: 1.46 s

上記のように、PCAの場合、最適な写像先の次元がわからないので、以下のようにforループで最もスコアの高い次元を探してみる。

for component in range(1, 79):
    pca = PCA(n_components=component, random_state=42)
    pca.fit(x_test_uncorr)
    x_train_pca = pca.transform(x_train_uncorr)
    x_test_pca = pca.transform(x_test_uncorr)
    print('Selected Comp: ', component)
    run_randomForest(x_train_pca, x_test_pca, y_train, y_test)
    print()

Selected Comp:  1
Accuracy on tes set: 
0.931

Selected Comp:  2
Accuracy on tes set: 
0.9565

Selected Comp:  3
Accuracy on tes set: 
0.95725

Selected Comp:  4
Accuracy on tes set: 
0.956

Selected Comp:  5
Accuracy on tes set: 
0.955

Selected Comp:  6
Accuracy on tes set: 
0.95575

Selected Comp:  7
Accuracy on tes set: 
0.9565

Selected Comp:  8
Accuracy on tes set: 
0.956

Selected Comp:  9
Accuracy on tes set: 
0.957

Selected Comp:  10
Accuracy on tes set: 
0.956

Selected Comp:  11
Accuracy on tes set: 
0.956

Selected Comp:  12
Accuracy on tes set: 
0.9555

Selected Comp:  13
Accuracy on tes set: 
0.95625

Selected Comp:  14
Accuracy on tes set: 
0.956

Selected Comp:  15
Accuracy on tes set: 
0.95525

Selected Comp:  16
Accuracy on tes set: 
0.95575

Selected Comp:  17
Accuracy on tes set: 
0.9555

Selected Comp:  18
Accuracy on tes set: 
0.95475

Selected Comp:  19
Accuracy on tes set: 
0.9565

Selected Comp:  20
Accuracy on tes set: 
0.95625

Selected Comp:  21
Accuracy on tes set: 
0.95575

Selected Comp:  22
Accuracy on tes set: 
0.9555

Selected Comp:  23
Accuracy on tes set: 
0.95575

Selected Comp:  24
Accuracy on tes set: 
0.95525

Selected Comp:  25
Accuracy on tes set: 
0.95525

Selected Comp:  26
Accuracy on tes set: 
0.95525

Selected Comp:  27
Accuracy on tes set: 
0.955

Selected Comp:  28
Accuracy on tes set: 
0.95575

Selected Comp:  29
Accuracy on tes set: 
0.95575

Selected Comp:  30
Accuracy on tes set: 
0.956

Selected Comp:  31
Accuracy on tes set: 
0.956

Selected Comp:  32
Accuracy on tes set: 
0.9565

Selected Comp:  33
Accuracy on tes set: 
0.956

Selected Comp:  34
Accuracy on tes set: 
0.9565

Selected Comp:  35
Accuracy on tes set: 
0.956

Selected Comp:  36
Accuracy on tes set: 
0.956

Selected Comp:  37
Accuracy on tes set: 
0.956

Selected Comp:  38
Accuracy on tes set: 
0.956

Selected Comp:  39
Accuracy on tes set: 
0.956

Selected Comp:  40
Accuracy on tes set: 
0.956

Selected Comp:  41
Accuracy on tes set: 
0.95575

Selected Comp:  42
Accuracy on tes set: 
0.95575

Selected Comp:  43
Accuracy on tes set: 
0.95625

Selected Comp:  44
Accuracy on tes set: 
0.9555

Selected Comp:  45
Accuracy on tes set: 
0.95575

Selected Comp:  46
Accuracy on tes set: 
0.95525

Selected Comp:  47
Accuracy on tes set: 
0.955

Selected Comp:  48
Accuracy on tes set: 
0.956

Selected Comp:  49
Accuracy on tes set: 
0.9555

Selected Comp:  50
Accuracy on tes set: 
0.95625

Selected Comp:  51
Accuracy on tes set: 
0.95675

Selected Comp:  52
Accuracy on tes set: 
0.95575

Selected Comp:  53
Accuracy on tes set: 
0.9555

Selected Comp:  54
Accuracy on tes set: 
0.9555

Selected Comp:  55
Accuracy on tes set: 
0.955

Selected Comp:  56
Accuracy on tes set: 
0.956

Selected Comp:  57
Accuracy on tes set: 
0.9555

Selected Comp:  58
Accuracy on tes set: 
0.95575

Selected Comp:  59
Accuracy on tes set: 
0.95525

Selected Comp:  60
Accuracy on tes set: 
0.95575

Selected Comp:  61
Accuracy on tes set: 
0.9555

Selected Comp:  62
Accuracy on tes set: 
0.956

Selected Comp:  63
Accuracy on tes set: 
0.957

Selected Comp:  64
Accuracy on tes set: 
0.95525

Selected Comp:  65
Accuracy on tes set: 
0.95475

Selected Comp:  66
Accuracy on tes set: 
0.95525

Selected Comp:  67
Accuracy on tes set: 
0.9555

Selected Comp:  68
Accuracy on tes set: 
0.956

Selected Comp:  69
Accuracy on tes set: 
0.9555

Selected Comp:  70
Accuracy on tes set: 
0.956

Selected Comp:  71
Accuracy on tes set: 
0.95525

Selected Comp:  72
Accuracy on tes set: 
0.9555

Selected Comp:  73
Accuracy on tes set: 
0.95575

Selected Comp:  74
Accuracy on tes set: 
0.95625

Selected Comp:  75
Accuracy on tes set: 
0.95575

Selected Comp:  76
Accuracy on tes set: 
0.955

Selected Comp:  77
Accuracy on tes set: 
0.955

Selected Comp:  78
Accuracy on tes set: 
0.9555

結果：７８次元から３次元への写像が一番スコアが良かった。

	ID	var3	var15	imp_op_var39_comer_ult1	imp_op_var39_comer_ult3	imp_op_var41_comer_ult1	imp_op_var41_comer_ult3	imp_op_var41_ult1	imp_op_var39_ult1	ind_var5_0	ind_var5	ind_var12_0	ind_var12	ind_var13_0	ind_var13_corto_0	ind_var13_corto	ind_var13	...	saldo_medio_var5_ult1	saldo_medio_var5_ult3	saldo_medio_var12_ult1	saldo_medio_var12_ult3	saldo_medio_var13_corto_hace2	saldo_medio_var13_corto_hace3	saldo_medio_var13_corto_ult1	saldo_medio_var13_corto_ult3	var38
0	1	2	23	0.0	0.0	0.0	0.0	0.0	0.0	1	0	0	0	0	0	0	0	...	0.00	0.00	0.00	0.00	0.0	0.00	0.0	0.00	39205.170000
1	3	2	34	0.0	0.0	0.0	0.0	0.0	0.0	1	0	0	0	1	1	1	1	...	0.00	0.00	0.00	0.00	300.0	122.22	300.0	240.75	49278.030000
2	4	2	23	0.0	0.0	0.0	0.0	0.0	0.0	1	1	0	0	0	0	0	0	...	3.00	2.07	0.00	0.00	0.0	0.00	0.0	0.00	67333.770000
3	8	2	37	195.0	195.0	195.0	195.0	195.0	195.0	1	1	0	0	0	0	0	0	...	91.56	138.84	0.00	0.00	0.0	0.00	0.0	0.00	64007.970000
4	10	2	39	0.0	0.0	0.0	0.0	0.0	0.0	1	0	1	1	0	0	0	0	...	40501.08	13501.47	85501.89	85501.89	0.0	0.00	0.0	0.00	117310.979016