Linear Regression's Basic Assumptuin

  • Linear relationship with target y
  • Feature space X should have gaussian distribution
  • Features are not correlated with other
  • Features are in same scale i.e. have same variance

Lasso (L1) and Ridge (L2) Regularization

Regularization is a technique to discourage the complexity of the model. It does this by penalizing the loss function. This helps to solve the overfitting problem.

  • L1 regularization (also called Lasso)
  • L2 regularization (also called Ridge)
  • L1/L2 regularization (also called Elastic net)

A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression.

Bias(誤差) - variance(分散) trade off について
誤差を小さくしようとすると、分散が大きくなり、分散を小さくしようとすると、誤差が大きくなる。なので、Biasとvarianceがそれぞれ適当なところでモデルを構築することが必要。その調整がパラメーターのλ。

  • λ大 -> 誤差大、分散小(過小学習)
  • λ小 -> 誤差小、分散大(過学習)

Difference between L1 and L2 regularization

L1 Regularization

  • L1 penalizes sum of abusolute value of weights
  • L1 has a sparse solution
  • L1 has multiple solutions
  • L1 has built in feature selection
  • L1 is robust to outliers
  • L1 generates model that simple and interpretable but can't learn complex patterns

L2 Regularization

  • L2 regularization penalizes sum of square weights
  • L2 has a non sparse solution
  • L2 has one solution
  • L2 has no feature selection
  • L2 is not robust to outliers
  • L2 gives better prediction when output variable is a function of all input features
  • L2 regularization is able to learn complex data patterns
In [0]:
 

Load the dataset

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [0]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.metrics import accuracy_score
In [0]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.feature_selection import SelectFromModel
In [0]:
titanic = sns.load_dataset('titanic')
In [6]:
titanic.head()
Out[6]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
In [7]:
titanic.isnull().sum()
Out[7]:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
In [0]:
titanic.drop(labels = ['age', 'deck'], axis = 1, inplace = True)
In [0]:
titanic = titanic.dropna()
In [10]:
titanic.isnull().sum()
Out[10]:
survived       0
pclass         0
sex            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64
In [0]:
data = titanic[['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'who', 'alone']].copy()
In [12]:
data.head()
Out[12]:
pclass sex sibsp parch embarked who alone
0 3 male 1 0 S man False
1 1 female 1 0 C woman False
2 3 female 0 0 S woman True
3 1 female 1 0 S woman False
4 3 male 0 0 S man True
In [13]:
data.isnull().sum()
Out[13]:
pclass      0
sex         0
sibsp       0
parch       0
embarked    0
who         0
alone       0
dtype: int64
In [0]:
sex = {'male': 0, 'female': 1}
data['sex'] = data['sex'].map(sex)
In [15]:
data.head()
Out[15]:
pclass sex sibsp parch embarked who alone
0 3 0 1 0 S man False
1 1 1 1 0 C woman False
2 3 1 0 0 S woman True
3 1 1 1 0 S woman False
4 3 0 0 0 S man True
In [0]:
ports = {'S': 0, 'C': 1, 'Q': 2}
data['embarked'] = data['embarked'].map(ports)
In [0]:
who = {'man': 0, 'woman': 1, 'child': 2}
data['who'] = data['who'].map(who)
In [0]:
alone = {True: 1, False: 0}
data['alone'] = data['alone'].map(alone)
In [19]:
data.head()
Out[19]:
pclass sex sibsp parch embarked who alone
0 3 0 1 0 0 0 0
1 1 1 1 0 1 1 0
2 3 1 0 0 0 1 1
3 1 1 1 0 0 1 0
4 3 0 0 0 0 0 1
In [0]:
x = data.copy()
y = titanic['survived']
In [21]:
x.shape, y.shape
Out[21]:
((889, 7), (889,))
In [0]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=43)

Estimation of coefficients of Linear Regression

In [0]:
# 線形回帰モデルの係数が平均より大きい特徴量を抽出
sel = SelectFromModel(LinearRegression())
In [24]:
sel.fit(x_train, y_train)
Out[24]:
SelectFromModel(estimator=LinearRegression(copy_X=True, fit_intercept=True,
                                           n_jobs=None, normalize=False),
                max_features=None, norm_order=1, prefit=False, threshold=None)
In [26]:
# 抽出された特徴量(True)
sel.get_support()
Out[26]:
array([ True,  True, False, False, False,  True, False])
In [27]:
# 係数の確認
sel.estimator_.coef_
Out[27]:
array([-0.13750402,  0.26606466, -0.07470416, -0.0668525 ,  0.04793674,
        0.23857799, -0.12929595])
In [0]:
mean = np.mean(np.abs(sel.estimator_.coef_))
In [31]:
mean
Out[31]:
0.13727657291370773
In [32]:
np.abs(sel.estimator_.coef_)
Out[32]:
array([0.13750402, 0.26606466, 0.07470416, 0.0668525 , 0.04793674,
       0.23857799, 0.12929595])
In [33]:
# 選択された特徴量を抽出
features = x_train.columns[sel.get_support()]
features
Out[33]:
Index(['pclass', 'sex', 'who'], dtype='object')
In [0]:
# データの特徴量を削減
x_train_reg = sel.transform(x_train)
x_test_reg = sel.transform(x_test)
In [35]:
x_test_reg.shape
Out[35]:
(267, 3)
In [0]:
def run_randomForest(x_train, x_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf = clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))
In [40]:
%%time
run_randomForest(x_train_reg, x_test_reg, y_train, y_test)
Accuracy:  0.8239700374531835
CPU times: user 263 ms, sys: 56.8 ms, total: 320 ms
Wall time: 348 ms
In [38]:
%%time
run_randomForest(x_train, x_test, y_train, y_test)
Accuracy:  0.8239700374531835
CPU times: user 255 ms, sys: 49.7 ms, total: 304 ms
Wall time: 355 ms
In [39]:
x_train.shape
Out[39]:
(622, 7)

Logistic Regression Coefficient with L1 Regularization

In [42]:
sel = SelectFromModel(LogisticRegression(penalty='l1', C=0.05, solver='liblinear'))
sel.fit(x_train, y_train)
sel.get_support()
# Cの値で正則の強さが変わってくるので、Cの値が小さくなればなるほど、L1正則が効いてきて0になる係数が発生し、特徴量が選択される。
Out[42]:
array([ True,  True,  True, False, False,  True, False])
In [43]:
sel.estimator_.coef_
# 以下のように、Falseの特徴量は、係数が0になっている
Out[43]:
array([[-0.54047865,  0.78075177, -0.14072298,  0.        ,  0.        ,
         0.94084999,  0.        ]])
In [0]:
x_train_l1 = sel.transform(x_train)
x_test_l1 = sel.transform(x_test)
In [46]:
%%time
run_randomForest(x_train_l1, x_test_l1, y_train, y_test)
Accuracy:  0.8277153558052435
CPU times: user 239 ms, sys: 66.5 ms, total: 306 ms
Wall time: 355 ms

L2 Regularization

In [48]:
sel = SelectFromModel(LogisticRegression(penalty='l2', C=0.05, solver='liblinear'))
sel.fit(x_train, y_train)
sel.get_support()
Out[48]:
array([ True,  True, False, False, False,  True, False])
In [49]:
sel.estimator_.coef_
Out[49]:
array([[-0.55749685,  0.85692344, -0.30436065, -0.11841967,  0.2435823 ,
         1.00124155, -0.29875898]])
In [0]:
x_train_l2 = sel.transform(x_train)
x_test_l2 = sel.transform(x_test)
In [51]:
%%time
run_randomForest(x_train_l2, x_test_l2, y_train, y_test)
Accuracy:  0.8239700374531835
CPU times: user 230 ms, sys: 48.1 ms, total: 278 ms
Wall time: 347 ms
In [0]: