Linear Regression's Basic Assumptuin¶

Linear relationship with target y
Feature space X should have gaussian distribution
Features are not correlated with other
Features are in same scale i.e. have same variance

Lasso (L1) and Ridge (L2) Regularization¶

Regularization is a technique to discourage the complexity of the model. It does this by penalizing the loss function. This helps to solve the overfitting problem.

L1 regularization (also called Lasso)
L2 regularization (also called Ridge)
L1/L2 regularization (also called Elastic net)

A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression.

Bias(誤差) - variance(分散) trade off　について
誤差を小さくしようとすると、分散が大きくなり、分散を小さくしようとすると、誤差が大きくなる。なので、Biasとvarianceがそれぞれ適当なところでモデルを構築することが必要。その調整がパラメーターのλ。

λ大 -> 誤差大、分散小（過小学習）
λ小 -> 誤差小、分散大（過学習）

Difference between L1 and L2 regularization¶

L1 Regularization¶

L1 penalizes sum of abusolute value of weights
L1 has a sparse solution
L1 has multiple solutions
L1 has built in feature selection
L1 is robust to outliers
L1 generates model that simple and interpretable but can't learn complex patterns

L2 Regularization¶

L2 regularization penalizes sum of square weights
L2 has a non sparse solution
L2 has one solution
L2 has no feature selection
L2 is not robust to outliers
L2 gives better prediction when output variable is a function of all input features
L2 regularization is able to learn complex data patterns

Load the dataset¶

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.feature_selection import SelectFromModel

titanic = sns.load_dataset('titanic')

titanic.head()

titanic.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

titanic.drop(labels = ['age', 'deck'], axis = 1, inplace = True)

titanic = titanic.dropna()

titanic.isnull().sum()

survived       0
pclass         0
sex            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

data = titanic[['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'who', 'alone']].copy()

data.head()

data.isnull().sum()

pclass      0
sex         0
sibsp       0
parch       0
embarked    0
who         0
alone       0
dtype: int64

sex = {'male': 0, 'female': 1}
data['sex'] = data['sex'].map(sex)

data.head()

ports = {'S': 0, 'C': 1, 'Q': 2}
data['embarked'] = data['embarked'].map(ports)

who = {'man': 0, 'woman': 1, 'child': 2}
data['who'] = data['who'].map(who)

alone = {True: 1, False: 0}
data['alone'] = data['alone'].map(alone)

data.head()

x = data.copy()
y = titanic['survived']

x.shape, y.shape

((889, 7), (889,))

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=43)

Estimation of coefficients of Linear Regression¶

# 線形回帰モデルの係数が平均より大きい特徴量を抽出
sel = SelectFromModel(LinearRegression())

sel.fit(x_train, y_train)

SelectFromModel(estimator=LinearRegression(copy_X=True, fit_intercept=True,
                                           n_jobs=None, normalize=False),
                max_features=None, norm_order=1, prefit=False, threshold=None)

# 抽出された特徴量（True）
sel.get_support()

array([ True,  True, False, False, False,  True, False])

# 係数の確認
sel.estimator_.coef_

array([-0.13750402,  0.26606466, -0.07470416, -0.0668525 ,  0.04793674,
        0.23857799, -0.12929595])

mean = np.mean(np.abs(sel.estimator_.coef_))

mean

0.13727657291370773

np.abs(sel.estimator_.coef_)

array([0.13750402, 0.26606466, 0.07470416, 0.0668525 , 0.04793674,
       0.23857799, 0.12929595])

# 選択された特徴量を抽出
features = x_train.columns[sel.get_support()]
features

Index(['pclass', 'sex', 'who'], dtype='object')

# データの特徴量を削減
x_train_reg = sel.transform(x_train)
x_test_reg = sel.transform(x_test)

x_test_reg.shape

(267, 3)

def run_randomForest(x_train, x_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf = clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))

%%time
run_randomForest(x_train_reg, x_test_reg, y_train, y_test)

Accuracy:  0.8239700374531835
CPU times: user 263 ms, sys: 56.8 ms, total: 320 ms
Wall time: 348 ms

%%time
run_randomForest(x_train, x_test, y_train, y_test)

Accuracy:  0.8239700374531835
CPU times: user 255 ms, sys: 49.7 ms, total: 304 ms
Wall time: 355 ms

x_train.shape

(622, 7)

Logistic Regression Coefficient with L1 Regularization¶

sel = SelectFromModel(LogisticRegression(penalty='l1', C=0.05, solver='liblinear'))
sel.fit(x_train, y_train)
sel.get_support()
# Cの値で正則の強さが変わってくるので、Cの値が小さくなればなるほど、L1正則が効いてきて0になる係数が発生し、特徴量が選択される。

array([ True,  True,  True, False, False,  True, False])

sel.estimator_.coef_
# 以下のように、Falseの特徴量は、係数が0になっている

array([[-0.54047865,  0.78075177, -0.14072298,  0.        ,  0.        ,
         0.94084999,  0.        ]])

x_train_l1 = sel.transform(x_train)
x_test_l1 = sel.transform(x_test)

%%time
run_randomForest(x_train_l1, x_test_l1, y_train, y_test)

Accuracy:  0.8277153558052435
CPU times: user 239 ms, sys: 66.5 ms, total: 306 ms
Wall time: 355 ms

L2 Regularization¶

sel = SelectFromModel(LogisticRegression(penalty='l2', C=0.05, solver='liblinear'))
sel.fit(x_train, y_train)
sel.get_support()

array([ True,  True, False, False, False,  True, False])

sel.estimator_.coef_

array([[-0.55749685,  0.85692344, -0.30436065, -0.11841967,  0.2435823 ,
         1.00124155, -0.29875898]])

x_train_l2 = sel.transform(x_train)
x_test_l2 = sel.transform(x_test)

%%time
run_randomForest(x_train_l2, x_test_l2, y_train, y_test)

Accuracy:  0.8239700374531835
CPU times: user 230 ms, sys: 48.1 ms, total: 278 ms
Wall time: 347 ms

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True