Use of mixed in Wrapper Methond

https://github.com/rasbt/mlxtend

In [0]:
!pip install mlxtend

How it works

Sequential feature selection algorithms are a family of greedy search algorithms that are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d.

In as nutshell, SFAs remove or add one feature at the time based on the classifier performance until a feature subset of the disired size k is reached. There are 4 different flavors SFAs available via the SequentialFeatureSelector:

  • Sequential Forward Slection (SFA)
  • Sequential Backward Selection (SBS)
  • Sequential Forward Floating Selection (SFFS)
  • Sequential Backward Floating Selection (SBFS)

Step Foward Selection (SFS)

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [0]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import roc_auc_score
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
In [0]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
In [0]:
data = load_wine()
In [6]:
data.keys()
Out[6]:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
In [7]:
print(data.DESCR)
.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
    ============================= ==== ===== ======= =====
                                   Min   Max   Mean     SD
    ============================= ==== ===== ======= =====
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Flavanoids:                   0.34  5.08    2.03  1.00
    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
    Proanthocyanins:              0.41  3.58    1.59  0.57
    Colour Intensity:              1.3  13.0     5.1   2.3
    Hue:                          0.48  1.71    0.96  0.23
    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
    Proline:                       278  1680     746   315
    ============================= ==== ===== ======= =====

    :Missing Attribute Values: None
    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.

Original Owners: 

Forina, M. et al, PARVUS - 
An Extendible Package for Data Exploration, Classification and Correlation. 
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science. 

.. topic:: References

  (1) S. Aeberhard, D. Coomans and O. de Vel, 
  Comparison of Classifiers in High Dimensional Settings, 
  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Technometrics). 

  The data was used with many others for comparing various 
  classifiers. The classes are separable, though only RDA 
  has achieved 100% correct classification. 
  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) 
  (All results using the leave-one-out technique) 

  (2) S. Aeberhard, D. Coomans and O. de Vel, 
  "THE CLASSIFICATION PERFORMANCE OF RDA" 
  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of 
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Journal of Chemometrics).

In [0]:
x = pd.DataFrame(data.data)
y = data.target
In [10]:
x.columns = data.feature_names
x.head()
Out[10]:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0
In [11]:
x.isnull().sum()
Out[11]:
alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
dtype: int64
In [12]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
x_train.shape, x_test.shape
Out[12]:
((142, 13), (36, 13))

Step Forward Feature Selection (SFS)

In [16]:
# 特徴量7つ抽出(7つ選んだ結果がプロパティに残るので、7つ選んだプロパティが最大とは限らない)
sfs = SFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1),
          k_features = 7,
          forward = True,
          floating = False,
          verbose = 2,
          scoring = 'accuracy',
          cv = 4,
          n_jobs = -1
        ).fit(x_train, y_train)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:   10.7s finished

[2020-04-24 05:35:51] Features: 1/7 -- score: 0.7674603174603174[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    8.7s finished

[2020-04-24 05:36:00] Features: 2/7 -- score: 0.9718253968253968[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    8.5s finished

[2020-04-24 05:36:09] Features: 3/7 -- score: 0.9859126984126985[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    7.3s finished

[2020-04-24 05:36:16] Features: 4/7 -- score: 0.9789682539682539[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    7.2s finished

[2020-04-24 05:36:23] Features: 5/7 -- score: 0.9720238095238095[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    5.8s finished

[2020-04-24 05:36:29] Features: 6/7 -- score: 0.9789682539682539[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:    5.7s finished

[2020-04-24 05:36:35] Features: 7/7 -- score: 0.9791666666666666
In [17]:
# 選択された7つの特徴量
sfs.k_feature_names_
Out[17]:
('alcohol',
 'ash',
 'magnesium',
 'flavanoids',
 'proanthocyanins',
 'color_intensity',
 'proline')
In [18]:
# 選択された7つの特徴量のカラム番号
sfs.k_feature_idx_
Out[18]:
(0, 2, 4, 6, 8, 9, 12)
In [19]:
# スコア
sfs.k_score_
Out[19]:
0.9791666666666666
In [21]:
# 特徴量増加毎のスコアの変化確認
pd.DataFrame.from_dict(sfs.get_metric_dict()).T
Out[21]:
feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err
1 (6,) [0.7222222222222222, 0.8333333333333334, 0.742... 0.76746 (flavanoids,) 0.0670901 0.0418533 0.024164
2 (6, 9) [0.9444444444444444, 1.0, 0.9714285714285714, ... 0.971825 (flavanoids, color_intensity) 0.031492 0.0196459 0.0113425
3 (4, 6, 9) [0.9722222222222222, 1.0, 0.9714285714285714, ... 0.985913 (magnesium, flavanoids, color_intensity) 0.0225862 0.0140901 0.00813492
4 (4, 6, 9, 12) [0.9722222222222222, 0.9722222222222222, 0.971... 0.978968 (magnesium, flavanoids, color_intensity, proline) 0.0194714 0.012147 0.00701308
5 (2, 4, 6, 9, 12) [0.9444444444444444, 0.9722222222222222, 0.971... 0.972024 (ash, magnesium, flavanoids, color_intensity, ... 0.0314903 0.0196449 0.011342
6 (2, 4, 6, 8, 9, 12) [0.9722222222222222, 0.9722222222222222, 0.971... 0.978968 (ash, magnesium, flavanoids, proanthocyanins, ... 0.0194714 0.012147 0.00701308
7 (0, 2, 4, 6, 8, 9, 12) [0.9444444444444444, 0.9722222222222222, 1.0, ... 0.979167 (alcohol, ash, magnesium, flavanoids, proantho... 0.0369201 0.0230321 0.0132976

上記の結果から、最もスコアの高い特徴量の組を選んで予測モデルを作成すれば良い。
(上記では、特徴量数が3-magnesium, flavanoids, color_intensity で、スコア最良)
上記のやり方では、手動で最も高いスコアの組を読み取らなければならないが、以下のやり方だと、k_feature_names_プロパティに最も良い組み合わせが残る。

In [33]:
# 特徴量1つから8つの組み合わせの中でスコアが最大の組み合わせを見つける。
sfs = SFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1),
          k_features = (1,8),
          forward = True,
          floating = False,
          verbose = 2,
          scoring = 'accuracy',
          cv = 4,
          n_jobs = -1
        ).fit(x_train, y_train) 
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:   10.8s finished

[2020-04-24 06:07:55] Features: 1/8 -- score: 0.7674603174603174[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    8.8s finished

[2020-04-24 06:08:04] Features: 2/8 -- score: 0.9718253968253968[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    8.5s finished

[2020-04-24 06:08:12] Features: 3/8 -- score: 0.9859126984126985[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    7.4s finished

[2020-04-24 06:08:20] Features: 4/8 -- score: 0.9789682539682539[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    7.1s finished

[2020-04-24 06:08:27] Features: 5/8 -- score: 0.9720238095238095[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    6.0s finished

[2020-04-24 06:08:33] Features: 6/8 -- score: 0.9789682539682539[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:    5.7s finished

[2020-04-24 06:08:38] Features: 7/8 -- score: 0.9791666666666666[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    4.5s finished

[2020-04-24 06:08:43] Features: 8/8 -- score: 0.9791666666666666
In [23]:
sfs.k_score_
Out[23]:
0.9859126984126985
In [24]:
sfs.k_feature_names_
Out[24]:
('magnesium', 'flavanoids', 'color_intensity')

上のように、最もスコアの高い組み合わせが抽出された。

Step Backward Selection (SBS)

In [39]:
# 特徴量8つから1つまでの組み合わせの中でスコアが最大の組み合わせを見つける。
sbs = SFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1),
          k_features = (1, 8),
          forward = False,
          floating = False,
          verbose = 2,
          scoring = 'accuracy',
          cv = 4,
          n_jobs = -1
          ).fit(x_train.values, y_train)

# ※:StepBackward(forward = False)に変えてpd.DataFrameを渡すとエラーになるので、x_train.valuesで渡す
# エラー詳細url: https://github.com/rasbt/mlxtend/issues/505
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:   10.0s finished

[2020-04-24 06:12:18] Features: 12/1 -- score: 0.9861111111111112[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    8.8s finished

[2020-04-24 06:12:26] Features: 11/1 -- score: 0.9861111111111112[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    8.6s finished

[2020-04-24 06:12:35] Features: 10/1 -- score: 0.9791666666666666[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    7.3s finished

[2020-04-24 06:12:42] Features: 9/1 -- score: 0.9861111111111112[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    7.2s finished

[2020-04-24 06:12:49] Features: 8/1 -- score: 0.9859126984126985[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    5.9s finished

[2020-04-24 06:12:55] Features: 7/1 -- score: 0.978968253968254[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:    5.7s finished

[2020-04-24 06:13:01] Features: 6/1 -- score: 0.9859126984126985[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    4.6s finished

[2020-04-24 06:13:06] Features: 5/1 -- score: 0.9789682539682539[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    4.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    4.2s finished

[2020-04-24 06:13:10] Features: 4/1 -- score: 0.9718253968253968[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    3.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    3.0s finished

[2020-04-24 06:13:13] Features: 3/1 -- score: 0.9718253968253968[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    2.9s finished

[2020-04-24 06:13:16] Features: 2/1 -- score: 0.9718253968253968[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    1.5s finished

[2020-04-24 06:13:17] Features: 1/1 -- score: 0.7674603174603174
In [40]:
sbs.k_score_
Out[40]:
0.9859126984126985
In [41]:
sbs.k_feature_names_
# DataFrameのvaluesを渡しているので、インデックスが出る
Out[41]:
('0', '1', '2', '3', '4', '6', '7', '9')

Exhausitive Feature Selection (EFS)

In [0]:
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
In [47]:
# 最小特徴量4、最大特徴量5で探す
efs = EFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1),
          min_features = 4,
          max_features = 5,
          scoring = 'accuracy',
          cv= None,
          n_jobs = -1
        ).fit(x_train, y_train)
Features: 2002/2002

上記の場合、13個から4つ選ぶ組み合わせと、13個から5つ選ぶ組み合わせを探すので、全部で2002組の組み合わせから探すことになる。なので、計算コストが膨大で時間がかかる。

In [46]:
# 13個から4個選ぶ組み合わせと13個から5つ選ぶ組み合わせの数
from scipy.special import comb
comb(13, 4, exact=True) + comb(13, 5, exact=True)
Out[46]:
2002
In [48]:
efs.best_score_
Out[48]:
1.0
In [49]:
efs.best_feature_names_
Out[49]:
('alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash')
In [50]:
efs.best_idx_
Out[50]:
(0, 1, 2, 3)
In [0]:
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
In [52]:
plot_sfs(efs.get_metric_dict(), kind='std_dev')
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:217: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:209: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Out[52]:
In [0]: