Use of mixed in Wrapper Methond¶

https://github.com/rasbt/mlxtend

!pip install mlxtend

How it works¶

Sequential feature selection algorithms are a family of greedy search algorithms that are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d.

In as nutshell, SFAs remove or add one feature at the time based on the classifier performance until a feature subset of the disired size k is reached. There are 4 different flavors SFAs available via the SequentialFeatureSelector:

Sequential Forward Slection (SFA)
Sequential Backward Selection (SBS)
Sequential Forward Floating Selection (SFFS)
Sequential Backward Floating Selection (SBFS)

Step Foward Selection (SFS)¶

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import roc_auc_score
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

data = load_wine()

data.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

print(data.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
    ============================= ==== ===== ======= =====
                                   Min   Max   Mean     SD
    ============================= ==== ===== ======= =====
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Flavanoids:                   0.34  5.08    2.03  1.00
    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
    Proanthocyanins:              0.41  3.58    1.59  0.57
    Colour Intensity:              1.3  13.0     5.1   2.3
    Hue:                          0.48  1.71    0.96  0.23
    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
    Proline:                       278  1680     746   315
    ============================= ==== ===== ======= =====

    :Missing Attribute Values: None
    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.

Original Owners: 

Forina, M. et al, PARVUS - 
An Extendible Package for Data Exploration, Classification and Correlation. 
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science. 

.. topic:: References

  (1) S. Aeberhard, D. Coomans and O. de Vel, 
  Comparison of Classifiers in High Dimensional Settings, 
  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Technometrics). 

  The data was used with many others for comparing various 
  classifiers. The classes are separable, though only RDA 
  has achieved 100% correct classification. 
  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) 
  (All results using the leave-one-out technique) 

  (2) S. Aeberhard, D. Coomans and O. de Vel, 
  "THE CLASSIFICATION PERFORMANCE OF RDA" 
  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of 
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Journal of Chemometrics).

x = pd.DataFrame(data.data)
y = data.target

x.columns = data.feature_names
x.head()

x.isnull().sum()

alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
dtype: int64

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
x_train.shape, x_test.shape

((142, 13), (36, 13))

Step Forward Feature Selection (SFS)¶

# 特徴量７つ抽出（７つ選んだ結果がプロパティに残るので、７つ選んだプロパティが最大とは限らない）
sfs = SFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1),
          k_features = 7,
          forward = True,
          floating = False,
          verbose = 2,
          scoring = 'accuracy',
          cv = 4,
          n_jobs = -1
        ).fit(x_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:   10.7s finished

[2020-04-24 05:35:51] Features: 1/7 -- score: 0.7674603174603174[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    8.7s finished

[2020-04-24 05:36:00] Features: 2/7 -- score: 0.9718253968253968[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    8.5s finished

[2020-04-24 05:36:09] Features: 3/7 -- score: 0.9859126984126985[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    7.3s finished

[2020-04-24 05:36:16] Features: 4/7 -- score: 0.9789682539682539[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    7.2s finished

[2020-04-24 05:36:23] Features: 5/7 -- score: 0.9720238095238095[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    5.8s finished

[2020-04-24 05:36:29] Features: 6/7 -- score: 0.9789682539682539[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:    5.7s finished

[2020-04-24 05:36:35] Features: 7/7 -- score: 0.9791666666666666

# 選択された7つの特徴量
sfs.k_feature_names_

('alcohol',
 'ash',
 'magnesium',
 'flavanoids',
 'proanthocyanins',
 'color_intensity',
 'proline')

# 選択された７つの特徴量のカラム番号
sfs.k_feature_idx_

(0, 2, 4, 6, 8, 9, 12)

# スコア
sfs.k_score_

0.9791666666666666

# 特徴量増加毎のスコアの変化確認
pd.DataFrame.from_dict(sfs.get_metric_dict()).T

上記の結果から、最もスコアの高い特徴量の組を選んで予測モデルを作成すれば良い。
(上記では、特徴量数が３-magnesium, flavanoids, color_intensity で、スコア最良）
上記のやり方では、手動で最も高いスコアの組を読み取らなければならないが、以下のやり方だと、k_feature_names_プロパティに最も良い組み合わせが残る。

# 特徴量１つから８つの組み合わせの中でスコアが最大の組み合わせを見つける。
sfs = SFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1),
          k_features = (1,8),
          forward = True,
          floating = False,
          verbose = 2,
          scoring = 'accuracy',
          cv = 4,
          n_jobs = -1
        ).fit(x_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:   10.8s finished

[2020-04-24 06:07:55] Features: 1/8 -- score: 0.7674603174603174[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    8.8s finished

[2020-04-24 06:08:04] Features: 2/8 -- score: 0.9718253968253968[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    8.5s finished

[2020-04-24 06:08:12] Features: 3/8 -- score: 0.9859126984126985[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    7.4s finished

[2020-04-24 06:08:20] Features: 4/8 -- score: 0.9789682539682539[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    7.1s finished

[2020-04-24 06:08:27] Features: 5/8 -- score: 0.9720238095238095[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    6.0s finished

[2020-04-24 06:08:33] Features: 6/8 -- score: 0.9789682539682539[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:    5.7s finished

[2020-04-24 06:08:38] Features: 7/8 -- score: 0.9791666666666666[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    4.5s finished

[2020-04-24 06:08:43] Features: 8/8 -- score: 0.9791666666666666

sfs.k_score_

0.9859126984126985

sfs.k_feature_names_

('magnesium', 'flavanoids', 'color_intensity')

上のように、最もスコアの高い組み合わせが抽出された。

Step Backward Selection (SBS)¶

# 特徴量８つから１つまでの組み合わせの中でスコアが最大の組み合わせを見つける。
sbs = SFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1),
          k_features = (1, 8),
          forward = False,
          floating = False,
          verbose = 2,
          scoring = 'accuracy',
          cv = 4,
          n_jobs = -1
          ).fit(x_train.values, y_train)

# ※：StepBackward(forward = False)に変えてpd.DataFrameを渡すとエラーになるので、x_train.valuesで渡す
# エラー詳細url： https://github.com/rasbt/mlxtend/issues/505

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:   10.0s finished

[2020-04-24 06:12:18] Features: 12/1 -- score: 0.9861111111111112[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    8.8s finished

[2020-04-24 06:12:26] Features: 11/1 -- score: 0.9861111111111112[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    8.6s finished

[2020-04-24 06:12:35] Features: 10/1 -- score: 0.9791666666666666[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    7.3s finished

[2020-04-24 06:12:42] Features: 9/1 -- score: 0.9861111111111112[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    7.2s finished

[2020-04-24 06:12:49] Features: 8/1 -- score: 0.9859126984126985[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    5.9s finished

[2020-04-24 06:12:55] Features: 7/1 -- score: 0.978968253968254[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:    5.7s finished

[2020-04-24 06:13:01] Features: 6/1 -- score: 0.9859126984126985[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    4.6s finished

[2020-04-24 06:13:06] Features: 5/1 -- score: 0.9789682539682539[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    4.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    4.2s finished

[2020-04-24 06:13:10] Features: 4/1 -- score: 0.9718253968253968[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    3.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    3.0s finished

[2020-04-24 06:13:13] Features: 3/1 -- score: 0.9718253968253968[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    2.9s finished

[2020-04-24 06:13:16] Features: 2/1 -- score: 0.9718253968253968[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    1.5s finished

[2020-04-24 06:13:17] Features: 1/1 -- score: 0.7674603174603174

sbs.k_score_

0.9859126984126985

sbs.k_feature_names_
# DataFrameのvaluesを渡しているので、インデックスが出る

('0', '1', '2', '3', '4', '6', '7', '9')

Exhausitive Feature Selection (EFS)¶

from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

# 最小特徴量４、最大特徴量５で探す
efs = EFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1),
          min_features = 4,
          max_features = 5,
          scoring = 'accuracy',
          cv= None,
          n_jobs = -1
        ).fit(x_train, y_train)

Features: 2002/2002

上記の場合、１３個から４つ選ぶ組み合わせと、１３個から５つ選ぶ組み合わせを探すので、全部で２００２組の組み合わせから探すことになる。なので、計算コストが膨大で時間がかかる。

# １３個から４個選ぶ組み合わせと１３個から５つ選ぶ組み合わせの数
from scipy.special import comb
comb(13, 4, exact=True) + comb(13, 5, exact=True)

2002

efs.best_score_

1.0

efs.best_feature_names_

('alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash')

efs.best_idx_

(0, 1, 2, 3)

from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

plot_sfs(efs.get_metric_dict(), kind='std_dev')

/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:217: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:209: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

	feature_idx	cv_scores	avg_score	feature_names	ci_bound	std_dev	std_err
1	(6,)	[0.7222222222222222, 0.8333333333333334, 0.742...	0.76746	(flavanoids,)	0.0670901	0.0418533	0.024164
2	(6, 9)	[0.9444444444444444, 1.0, 0.9714285714285714, ...	0.971825	(flavanoids, color_intensity)	0.031492	0.0196459	0.0113425
3	(4, 6, 9)	[0.9722222222222222, 1.0, 0.9714285714285714, ...	0.985913	(magnesium, flavanoids, color_intensity)	0.0225862	0.0140901	0.00813492
4	(4, 6, 9, 12)	[0.9722222222222222, 0.9722222222222222, 0.971...	0.978968	(magnesium, flavanoids, color_intensity, proline)	0.0194714	0.012147	0.00701308
5	(2, 4, 6, 9, 12)	[0.9444444444444444, 0.9722222222222222, 0.971...	0.972024	(ash, magnesium, flavanoids, color_intensity, ...	0.0314903	0.0196449	0.011342
6	(2, 4, 6, 8, 9, 12)	[0.9722222222222222, 0.9722222222222222, 0.971...	0.978968	(ash, magnesium, flavanoids, proanthocyanins, ...	0.0194714	0.012147	0.00701308
7	(0, 2, 4, 6, 8, 9, 12)	[0.9444444444444444, 0.9722222222222222, 1.0, ...	0.979167	(alcohol, ash, magnesium, flavanoids, proantho...	0.0369201	0.0230321	0.0132976