The idea behind LDA is simple. Mathematically speaking, we need to find a new feature space to project the data in order to maximize classes separability.
Linear Discriminant Analysis is a supervised algorithm as it take the class label in consideration. It is a way to reduce 'dimensionality' while at the same time preserving as much of the class discrimination information as possible.
Basically LDA finds a centroid of each class datapoints. For example with thirteen different features LDA will find the centroid of each of its class using the thirteen different dataset. Now on the basis of this, It determines a new dimension which is nothing but an axis which should satisfy two criteria:
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.
We can calculate a Principal Component Analysis on a dataset using the PCA() class in the scikit-learn library. The benefit of this approach is that once the projection is calculated, it can be applied to new data again and again quite easily.
When creating the class, the number of components can be specified as a parameter.
The class is first fit on a dataset by calling the fit() function, and then the original dataset or other data can be projected into a subspace with the chosen number of dimensions by calling the transform() function.
Once fit, the eigenvalues and principal components can be accessed on the PCA class via the explained_variance__ and components_attributes.
eigenvales:固有値
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
data = pd.read_csv('/content/drive/My Drive/kaggle/Feature_Selection_by_Filter_Method/data/santander-train.csv', nrows=20000)
data.head()
x = data.drop('TARGET', axis=1)
y = data['TARGET']
x.shape, y.shape
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0, stratify=y)
# remove constant and quasi constant features
constant_filter = VarianceThreshold(threshold=0.01)
constant_filter.fit(x_train)
x_train_filter = constant_filter.transform(x_train)
x_test_filter = constant_filter.transform(x_test)
x_train_filter.shape, x_test_filter.shape
# remove duplicated features
x_train_T = x_train_filter.T
x_test_T = x_test_filter.T
x_train_T = pd.DataFrame(x_train_T)
x_test_T = pd.DataFrame(x_test_T)
x_train_T.duplicated().sum()
duplicated_features = x_train_T.duplicated()
features_to_keep = [not index for index in duplicated_features]
x_train_unique = x_train_T[features_to_keep].T
x_test_unique = x_test_T[features_to_keep].T
# standardization
scaler = StandardScaler().fit(x_train_unique)
x_train_unique = scaler.transform(x_train_unique)
x_test_unique = scaler.transform(x_test_unique)
x_train_unique = pd.DataFrame(x_train_unique)
x_test_unique = pd.DataFrame(x_test_unique)
x_train_unique.shape, x_test_unique.shape
corrmat = x_train_unique.corr()
# find correlated features
def get_correlation(data, threshold):
corr_col = set()
corrmat = data.corr()
for i in range(len(corrmat.columns)):
for j in range(i):
if abs(corrmat.iloc[i, j]) > threshold:
colname = corrmat.columns[i]
corr_col.add(colname)
return corr_col
corr_features = get_correlation(x_train_unique, 0.70)
print('correlated features: ', len(set(corr_features)))
x_train_uncorr = x_train_unique.drop(labels=corr_features, axis=1)
x_test_uncorr = x_test_unique.drop(labels=corr_features, axis=1)
x_train_uncorr.shape, x_test_uncorr.shape
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=1) # クラス数−1の次元数(このデータは2クラス分類なので1次元)にデータが写像される
# ※ LDAはクラス数に基づき次元を削減する。写像先(n_components)の次元は、クラス数-1次元になる。
x_train_lda = lda.fit_transform(x_train_uncorr, y_train)
x_train_lda.shape
x_test_lda = lda.transform(x_test_uncorr)
def run_randomForest(x_train, x_test, y_train, y_test):
clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print('Accuracy on tes set: ')
print(accuracy_score(y_test, y_pred))
%%time
run_randomForest(x_train_lda, x_test_lda, y_train, y_test)
%%time
run_randomForest(x_train, x_test, y_train, y_test)
from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=42)
# ※ PCAは、特徴量の次元を指定する次元まで落とすので、n_componetsでその写像先の次元数を指定する。
# ここでは、2次元に指定している。
pca.fit(x_test_uncorr)
x_train_pca = pca.transform(x_train_uncorr)
x_test_pca = pca.transform(x_test_uncorr)
x_train_pca.shape, x_train_uncorr.shape
# pcaによりデータセットが79次元から2次元になった。
%%time
run_randomForest(x_train_pca, x_test_pca, y_train, y_test)
# 次に次元数を3次元にすると以下のようなスコアになる
pca = PCA(n_components=3, random_state=42)
pca.fit(x_test_uncorr)
x_train_pca = pca.transform(x_train_uncorr)
x_test_pca = pca.transform(x_test_uncorr)
x_train_pca.shape, x_train_uncorr.shape
# pcaによりデータセットが79次元から3次元になった。
%%time
run_randomForest(x_train_pca, x_test_pca, y_train, y_test)
# 時間はほとんどかわらず、スコアが少し上昇した。
上記のように、PCAの場合、最適な写像先の次元がわからないので、以下のようにforループで最もスコアの高い次元を探してみる。
for component in range(1, 79):
pca = PCA(n_components=component, random_state=42)
pca.fit(x_test_uncorr)
x_train_pca = pca.transform(x_train_uncorr)
x_test_pca = pca.transform(x_test_uncorr)
print('Selected Comp: ', component)
run_randomForest(x_train_pca, x_test_pca, y_train, y_test)
print()
結果:78次元から3次元への写像が一番スコアが良かった。