๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“• ๋จธ์‹ ๋Ÿฌ๋‹

[05. ํšŒ๊ท€] ์„ ํ˜•ํšŒ๊ท€, ๋‹คํ•ญํšŒ๊ท€, ๊ทœ์ œํšŒ๊ท€, ๋กœ์ง€์Šคํ‹ฑํšŒ๊ท€, ํšŒ๊ท€ํŠธ๋ฆฌ

by isdawell 2022. 3. 25.
728x90

๐Ÿ‘€ ํšŒ๊ท€๋ถ„์„

- ๋ฐ์ดํ„ฐ ๊ฐ’์ด ํ‰๊ท ๊ณผ ๊ฐ™์€ ์ผ์ •ํ•œ ๊ฐ’์œผ๋กœ ๋Œ์•„๊ฐ€๋ ค๋Š” ๊ฒฝํ–ฅ์„ ์ด์šฉํ•œ ํ†ต๊ณ„ํ•™ ๊ธฐ๋ฒ• 

- ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋…๋ฆฝ๋ณ€์ˆ˜์™€ ํ•œ ๊ฐœ์˜ ์ข…์†๋ณ€์ˆ˜ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋ง ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ํ†ต์นญํ•œ๋‹ค. 

- ์ข…์†๋ณ€์ˆ˜๋Š” ์ˆซ์ž๊ฐ’(์—ฐ์†๊ฐ’) ์ด๋‹ค. 

- ๋จธ์‹ ๋Ÿฌ๋‹ ํšŒ๊ท€ ์˜ˆ์ธก์˜ ํ•ต์‹ฌ์€ '์ตœ์ ์˜ ํšŒ๊ท€๊ณ„์ˆ˜' ๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ! 

๋…๋ฆฝ๋ณ€์ˆ˜์˜ ๊ฐœ์ˆ˜  ํšŒ๊ท€ ๊ณ„์ˆ˜์˜ ๊ฒฐํ•ฉ
1๊ฐœ  : ๋‹จ์ผํšŒ๊ท€ ์„ ํ˜• : ์„ ํ˜• ํšŒ๊ท€
์—ฌ๋Ÿฌ๊ฐœ : ๋‹ค์ค‘ ํšŒ๊ท€ ๋น„์„ ํ˜• : ๋น„์„ ํ˜• ํšŒ๊ท€

- ํšŒ๊ท€ ๋ถ„์„์˜ objective : RSS (์˜ค์ฐจ์ œ๊ณฑํ•ฉ) ์„ ์ตœ์†Œ๋กœํ•˜๋Š” ํšŒ๊ท€ ๋ณ€์ˆ˜ (w) ์ฐพ๊ธฐ

 

03. ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• 


๐Ÿ“Œ ๊ฐœ์š” 

 

๐Ÿ’ก ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•

 

  • ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์Šค์Šค๋กœ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฐœ๋…์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์ค€ ํ•ต์‹ฌ ๊ธฐ๋ฒ• 
  • ์ ์ง„์ ์œผ๋กœ ๋ฐ˜๋ณต์ ์ธ ๊ณ„์‚ฐ์„ ํ†ตํ•ด W ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์„ ์—…๋ฐ์ดํŠธํ•˜๋ฉด์„œ ์˜ค๋ฅ˜ ๊ฐ’์ด ์ตœ์†Œ๊ฐ€ ๋˜๋Š” W ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹ 

 

np.random.seed(0)

# y = 4X + 6 ์„ ๊ทผ์‚ฌ(w1=4, w0=6). ์ž„์˜์˜ ๊ฐ’์€ ๋…ธ์ด์ฆˆ๋ฅผ ์œ„ํ•ด ๋งŒ๋“ฆ
X = 2*np.random.rand(100,1)
y = 6 + 4*X + np.random.randn(100,1)

# ๋น„์šฉํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ
def get_cost(y, y_pred) : 
  N = len(y)
  cost = np.sum(np.square(y-y_pred))/N
  return cost 

# ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ
def get_weight_updates(w1, w0, X, y, learning_rate=0.01) : 
  N = len(y) 
  w1_update = np.zeros_like(w1)
  w0_update = np.zeros_like(w0) 
  
  y_pred = np.dot(X, w1.T) + w0 
  diff = y - y_pred
  
  # dot ํ–‰๋ ฌ ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ํ–‰๋ ฌ ์ƒ์„ฑ 
  w0_factors = np.ones((N,1)) 
  
  #์—…๋ฐ์ดํŠธ
  w1_update = -(2/N)*learning_rate*(np.dot(X.T, diff)) 
  w0_update = -(2/N)*learning_rate*(np.dot(w0_factors.T, diff)) 
  
  return w1_update, w0_update

 

# ์ž„๋ ฅ์ธ์ž iters ๋กœ ์ฃผ์–ด์ง„ ํšŸ์ˆ˜๋งŒํผ ๋ฐ˜๋ณต์ ์œผ๋กœ w1, w0 ๋ฅผ ์—…๋ฐ์ดํŠธ ์ ์šฉํ•จ 

def gradient_descent_steps(X,y,iters=1000) : 
   # ์ดˆ๊ธฐํ™”
   w0 = np.zeros((1,1)) 
   w1 = np.zeros((1,1)) 
   
   for ind in range(iters) : 
   	  w1_update, w0_update = get_weight_updates(w1,w0,X,y,learning_rate = 0.01) 
      w1 = w1 - w1_update
      w0 = w0 - w0_update 
   
   return w1, w0
   
   
   w1,w0 = gradient_descent_steps(X,y,iters=1000) 
   
   y_pred = w1[0,0] * X + w0

 

 ๐Ÿ’ก ํ™•๋ฅ ์  ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• Stochasitic Gradient Descent 

 

  • ์ „์ฒด ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹Œ, ์ผ๋ถ€ ๋ฐ์ดํ„ฐ๋งŒ์„ ์ด์šฉํ•ด w ๊ฐ€ ์—…๋ฐ์ดํŠธ ๋˜๋Š” ๊ฐ’์„ ๊ณ„์‚ฐํ•œ๋‹ค.  
  • ๋น ๋ฅธ ์†๋„ 
  • ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์ด๋‚˜ ๋ฏธ๋‹ˆ ๋ฐฐ์น˜ ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์„ ์ด์šฉํ•ด ์ตœ์  ๋น„์šฉํ•จ์ˆ˜๋ฅผ ๋„์ถœํ•œ๋‹ค. 
  • GD ์ฝ”๋“œ์—์„œ batch_size ๋ณ€์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•œ ๊ฒƒ์ด ์ฐจ์ด 

 

04. ์„ ํ˜•ํšŒ๊ท€์‹ค์Šต - ๋ณด์Šคํ„ด ์ฃผํƒ ๊ฐ€๊ฒฉ ์˜ˆ์ธก 


๐Ÿ“Œ ๊ฐœ์š” 

๐Ÿ’ก LinearRegression 

 

๐Ÿ”ธ ํŒŒ๋ผ๋ฏธํ„ฐ

  • fit_intercept : ์ ˆํŽธ ๊ฐ’์„ ์ƒ์„ฑ ํ• ๊ฒƒ์ธ์ง€ ๋ง๊ฒƒ์ธ์ง€ ์ง€์ •ํ•œ๋‹ค (T/F) ๋งŒ์ผ False ๋กœ ์ง€์ •ํ•˜๋ฉด 0์œผ๋กœ ์ง€์ •๋œ๋‹ค. 
  • normalize : T/F ๋กœ ๋””ํดํŠธ๋Š” False ์ด๋‹ค. True ์ด๋ฉด ํšŒ๊ท€๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „์— ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ •๊ทœํ™”ํ•œ๋‹ค. 

 

๐Ÿ”ธ ์†์„ฑ

  • coef_ : ํšŒ๊ท€๊ณ„์ˆ˜ ์ถœ๋ ฅ. Shape ๋Š” (Target ๊ฐ’ ๊ฐœ์ˆ˜, ํ”ผ์ฒ˜๊ฐœ์ˆ˜) 
  • intercept_ : ์ ˆํŽธ๊ฐ’ ์ถœ๋ ฅ 

 

๐Ÿ”ธ ์ฃผ์˜ ์‚ฌํ•ญ 

  • ๋‹ค์ค‘ ๊ณต์„ ์„ฑ ๋ฌธ์ œ : ์ž…๋ ฅ๋ณ€์ˆ˜๋“ค๋ผ๋ฆฌ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋งค์šฐ ๋†’์€ ๊ฒฝ์šฐ ๋ถ„์‚ฐ์ด ๋งค์šฐ ์ปค์ ธ ์˜ค๋ฅ˜์— ๋งค์šฐ ๋ฏผ๊ฐํ•ด์ง ๐Ÿ‘‰ ๋…๋ฆฝ์ ์ธ ์ค‘์š”ํ•œ ํ”ผ์ฒ˜๋งŒ ๋‚จ๊ธฐ๊ณ  ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜ ๊ทœ์ œ๋ฅผ ์ ์šฉ 
  • ๋„ˆ๋ฌด ๋งŽ์€ ํ”ผ์ฒ˜๊ฐ€ ๋‹ค์ค‘ ๊ณต์„ ์„ฑ ๋ฌธ์ œ๋ฅผ ์ง€๋‹ˆ๋ฉด PCA ๋ฅผ ํ†ตํ•ด ์ฐจ์› ์ถ•์†Œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ๋„ ๊ณ ๋ คํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

๐Ÿ”ธ ํ‰๊ฐ€์ง€ํ‘œ 

  • MAE : ์‹ค์ œ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’์˜ ์ฐจ์ด๋ฅผ ์ ˆ๋Œ“๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•ด ํ‰๊ท  
  • MSE : ์‹ค์ œ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’์˜ ์ฐจ์ด๋ฅผ ์ œ๊ณฑํ•˜์—ฌ ํ‰๊ท 
  • RMSE : MSE ์— ๋ฃจํŠธ๋ฅผ ์”Œ์šด ๊ฒƒ 
  • R^2 : ๋ถ„์‚ฐ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•œ๋‹ค. 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์˜ˆ์ธก ์ •ํ™•๋„๊ฐ€ ๋†’๋‹ค. 

 

ํ‰๊ฐ€๋ฐฉ๋ฒ• ์‚ฌ์ดํ‚ท๋Ÿฐ Scoring ํ•จ์ˆ˜ ์ ์šฉ ๊ฐ’
MAE metrics.mean_absolute_error neg_mean_absolute_error
MSE metrics.mean_squared_error neg_mean_squared_error
R^2 metrics.r2_score r2

 

โœ” scoring ํ•จ์ˆ˜ ์ ์šฉ๊ฐ’์ด๋ž€ cross_val_score ๋‚˜ GridSearchCV ์—์„œ ํ‰๊ฐ€ ์‹œ ์‚ฌ์šฉ๋˜๋Š” scoring ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ ์šฉ๊ฐ’์ด๋‹ค. 

โœ” scoring ํ•จ์ˆ˜์— neg, ์ฆ‰ ์Œ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ์ด์œ ๋Š” ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ Scoring ํ•จ์ˆ˜๊ฐ€ score ๊ฐ’์ด ํด์ˆ˜๋ก ์ข‹์€ ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋กœ ์ž๋™ ํ‰๊ฐ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์‹ค์ œ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’์˜ ์˜ค๋ฅ˜ ์ฐจ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ํšŒ๊ท€ํ‰๊ฐ€์ง€ํ‘œ์˜ ๊ฒฝ์šฐ ๊ฐ’์ด ์ปค์ง€๋ฉด ์˜คํžˆ๋ ค ๋‚˜์œ ๋ชจ๋ธ์ด๋ผ๋Š” ์˜๋ฏธ์ด๋ฏ€๋กœ ์Œ์ˆ˜๊ฐ’์œผ๋กœ ๋ณด์ •ํ•ด์ค€ ๊ฒƒ์ด๋‹ค. ๊ทธ๋ž˜์„œ ์ž‘์€ ์˜ค๋ฅ˜ ๊ฐ’์ด ๋” ํฐ ์ˆซ์ž๋กœ ์ธ์‹๋˜๊ฒŒ ํ•œ๋‹ค. 

 

 

๐Ÿ‘€ ๋ฐ์ดํ„ฐ ๋กœ๋“œ 

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.datasets import load_boston
%matplotlib inline

# boston ๋ฐ์ดํƒ€์…‹ ๋กœ๋“œ
boston = load_boston()

# boston ๋ฐ์ดํƒ€์…‹ DataFrame ๋ณ€ํ™˜ 
bostonDF = pd.DataFrame(boston.data , columns = boston.feature_names)

# boston dataset์˜ target array๋Š” ์ฃผํƒ ๊ฐ€๊ฒฉ์ž„. ์ด๋ฅผ PRICE ์ปฌ๋Ÿผ์œผ๋กœ DataFrame์— ์ถ”๊ฐ€ํ•จ. 
bostonDF['PRICE'] = boston.target
print('Boston ๋ฐ์ดํƒ€์…‹ ํฌ๊ธฐ :',bostonDF.shape)
bostonDF.head()

# Boston ๋ฐ์ดํƒ€์…‹ ํฌ๊ธฐ : (506, 14)

 

 

๐Ÿ‘€ ์ข…์†๋ณ€์ˆ˜์™€ ์„ค๋ช…๋ณ€์ˆ˜ ๊ด€๊ณ„ ํŒŒ์•… : sns.regplot 

# PRICE ์™€ ์„ค๋ช…๋ณ€์ˆ˜ ์‚ฌ์ด์˜ ์˜ํ–ฅ์ •๋„๋ฅผ ์‹œ๊ฐํ™”ํ•ด๋ณด๊ธฐ 
fig, axs = plt.subplots(figsize=(16,8), ncols=4, nrows=2) 
lm_features = ['RM','ZN','INDUS','NOX','AGE', 'PTRATIO', 'LSTAT', 'RAD'] 
for i, feature in enumerate(lm_features) : 

  row = int(i/4) 
  col = i%4 

  sns.regplot(x=feature, y='PRICE', data = bostonDF, ax = axs[row][col]) 

  # RM(๋ฐฉ๊ฐœ์ˆ˜) ๋ณ€์ˆ˜์™€ LSTAT ๋ณ€์ˆ˜ (ํ•˜์œ„ ๊ณ„์ธต์˜ ๋น„์œจ) ๊ฐ€ Price ์™€์˜ ์˜ํ–ฅ๋„๊ฐ€ ๊ฐ€์žฅ ๋‘๋“œ๋Ÿฌ์ง€๊ฒŒ ๋‚˜ํƒ€๋‚จ

 

 

๐Ÿ‘€ ์„ ํ˜•ํšŒ๊ท€ ๋ชจ๋ธ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error , r2_score

y_target = bostonDF['PRICE']
X_data = bostonDF.drop(['PRICE'],axis=1,inplace=False)

X_train , X_test , y_train , y_test = train_test_split(X_data , y_target ,test_size=0.3, random_state=156)

# Linear Regression OLS๋กœ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€ ์ˆ˜ํ–‰. 
lr = LinearRegression()
lr.fit(X_train ,y_train )
y_preds = lr.predict(X_test)
mse = mean_squared_error(y_test, y_preds)
rmse = np.sqrt(mse)

print('MSE : {0:.3f} , RMSE : {1:.3F}'.format(mse , rmse))
print('Variance score : {0:.3f}'.format(r2_score(y_test, y_preds))) 


# MSE : 17.297 , RMSE : 4.159
# Variance score : 0.757

 

๐Ÿ‘€ ์ ˆํŽธ๊ณผ ํšŒ๊ท€๊ณ„์ˆ˜ ๊ฐ’ ํ™•์ธ 

print('์ ˆํŽธ ๊ฐ’ : ', lr.intercept_)
print('ํšŒ๊ท€ ๊ณ„์ˆ˜ ๊ฐ’ : ',np.round(lr.coef_ , 1))

# ์ ˆํŽธ ๊ฐ’ :  40.995595172164755
# ํšŒ๊ท€ ๊ณ„์ˆ˜ ๊ฐ’ :  [ -0.1   0.1   0.    3.  -19.8   3.4   0.   -1.7   0.4  -0.   -0.9   0.  -0.6]

 

๐Ÿ‘€ ๊ต์ฐจ๊ฒ€์ฆ

# ๊ต์ฐจ๊ฒ€์ฆ 
from sklearn.model_selection import cross_val_score 

y_target = bostonDF['PRICE'] 
X_data = bostonDF.drop(['PRICE'], axis = 1, inplace = False) 
lr = LinearRegression() 

# 5๊ฐœ ํด๋“œ ์„ธํŠธ๋กœ MSE ๋ฅผ ๊ตฌํ•œ ๋’ค ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์‹œ RMSE ๋ฅผ ๊ตฌํ•จ 
neg_mse_scores = cross_val_score(lr, X_data, y_target, scoring = 'neg_mean_squared_error', cv = 5) 
rmse_scores = np.sqrt(-1*neg_mse_scores) 
avg_rmse = np.mean(rmse_scores) 

# cross_val_score(scoring = 'neg_mean_squared_error') ๋กœ ๋ฐ˜ํ™˜๋œ ๊ฐ’์€ ๋ชจ๋‘ ์Œ์ˆ˜ 
print('5 folds ์˜ ๊ฐœ๋ณ„ Negative MSE scores : ', np.round(neg_mse_scores,2)) 
print('5 folds ์˜ ๊ฐœ๋ณ„ RMSE scores : ', np.round(rmse_scores,2)) 
print('5 folds ์˜ ํ‰๊ท  RMSE : {0:.3f}'.format(avg_rmse))

# 5 folds ์˜ ๊ฐœ๋ณ„ Negative MSE scores :  [-12.46 -26.05 -33.07 -80.76 -33.31]
# 5 folds ์˜ ๊ฐœ๋ณ„ RMSE scores :  [3.53 5.1  5.75 8.99 5.77]
# 5 folds ์˜ ํ‰๊ท  RMSE : 5.829

 

 

 

05. ๋‹คํ•ญํšŒ๊ท€ 


๐Ÿ“Œ ๊ฐœ์š” 

 

๐Ÿ’ก ๋‹คํ•ญํšŒ๊ท€

 

  • ๋‹จํ•ญ์‹์ด ์•„๋‹Œ 2์ฐจ, 3์ฐจ ๋ฐฉ์ •์‹๊ณผ ๊ฐ™์€ ๋‹คํ•ญ์‹์œผ๋กœ ํ‘œํ˜„๋˜๋Š” ๊ฒƒ์„ ๋‹คํ•ญํšŒ๊ท€๋ผ๊ณ  ํ•œ๋‹ค. 
  • ์ฃผ์˜) ๋‹คํ•ญํšŒ๊ท€๋Š” ๋น„์„ ํ˜•ํšŒ๊ท€๋ผ๊ณ  ํ˜ผ๋™ํ•˜๊ธฐ ์‰ฌ์šด๋ฐ, ์„ ํ˜•ํšŒ๊ท€์ด๋‹ค. ์„ ํ˜•/๋น„์„ ํ˜•์„ ๋‚˜๋ˆ„๋Š” ๊ธฐ์ค€์€ ํšŒ๊ท€๊ณ„์ˆ˜๊ฐ€ ์„ ํ˜•/๋น„์„ ํ˜•์ธ์ง€์— ๋”ฐ๋ฅธ ๊ฒƒ์ด์ง€ ๋…๋ฆฝ๋ณ€์ˆ˜์˜ ์„ ํ˜•/๋น„์„ ํ˜• ์—ฌ๋ถ€์™€๋Š” ๋ฌด๊ด€ํ•˜๋‹ค. 
  • ๋‹คํ•ญํšŒ๊ท€๋Š” ์„ ํ˜•ํšŒ๊ท€์ด๊ธฐ ๋•Œ๋ฌธ์— ๋น„์„ ํ˜•ํ•จ์ˆ˜๋ฅผ ์„ ํ˜• ๋ชจ๋ธ์— ์ ์šฉ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด ๊ตฌํ˜„ํ•œ๋‹ค ๐Ÿ‘‰PolynomialFeatures

 

๐Ÿ‘€ ๋น„์„ ํ˜•ํ•จ์ˆ˜ ์ƒ์„ฑ ์˜ˆ์‹œ 

from sklearn.preprocessing import PolynomialFeatures 
import numpy as np 

# ๋‹คํ•ญ์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ ๋‹จํ•ญ์‹ ์ƒ์„ฑ 
X = np.arange(4).reshape(2,2) 
print('์ผ์ฐจ ๋‹จํ•ญ์‹ ๊ณ„์ˆ˜ ํ”ผ์ฒ˜ : \n', X) #[[0,1],[2,3]] 



# degree = 2 ์ธ 2์ฐจ ๋‹คํ•ญ์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด PolynomialFeatures ๋ฅผ ์ด์šฉํ•ด ๋ณ€ํ™˜ 
poly = PolynomialFeatures(degree=2) # [1, x1, x2, x1^2, x1x2, x2^2]
poly.fit(X) 
poly_ftr = poly.transform(X) 
print('๋ณ€ํ™˜๋œ 2์ฐจ ๋‹คํ•ญ์‹ ๊ณ„์ˆ˜ ํ”ผ์ฒ˜ : \n', poly_ftr) 

#  [[1. 0. 1. 0. 0. 1.],
# [1. 2. 3. 4. 6. 9.]]

 

 

๐Ÿ‘€ ๋‹คํ•ญํšŒ๊ท€ ๊ตฌํ˜„ 

# ๋‹คํ•ญํšŒ๊ท€ ๊ตฌํ˜„ํ•ด๋ณด๊ธฐ 
from sklearn.pipeline import Pipeline 
import numpy as np

def polynomial_func(X) : 
  y = 1 + 2*X[:,0] + 3*X[:,0]**2 + 4*X[:,1]**3 # y = 1 + 2*x1 + 3*x1^2 + 4*x2^3 
  return y 

# Pipeline ๊ฐ์ฒด๋กœ ๋‹คํ•ญ๋ณ€ํ™˜๊ณผ ์„ ํ˜•ํšŒ๊ท€๋ฅผ ์—ฐ๊ฒฐ 
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
                  ('linear', LinearRegression())]) 
X = np.arange(4).reshape(2,2) # ((x1,x2),(x1,x2)) 
y = polynomial_func(X) 
model = model.fit(X,y) 

print('Polynomial ํšŒ๊ท€๊ณ„์ˆ˜\n', np.round(model.named_steps['linear'].coef_,2))

#  [0.   0.18 0.18 0.36 0.54 0.72 0.72 1.08 1.62 2.34]

 

 

๐Ÿ“Œ ๊ณผ์†Œ์ ํ•ฉ ๊ณผ๋Œ€์ ํ•ฉ์˜ ์ดํ•ด 

  • ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ํŒจํ„ด์„ ์ž˜ ๋ฐ˜์˜ํ•˜๋ฉด์„œ๋„ ๋ณต์žกํ•˜์ง€ ์•Š์€ ๊ท ํ˜•์žกํžŒ ๋ชจ๋ธ์ด ๊ฐ€์žฅ ์ข‹๋‹ค. 

๊ฐ€์šด๋ฐ๊ฐ€ ๊ฐ€์žฅ ์ผ๋ฐ˜ํ™”๋œ ๋ชจ๋ธ

 

 

๐Ÿ“Œ ํŽธํ–ฅ ๋ถ„์‚ฐ ํŠธ๋ ˆ์ด๋“œ ์˜คํ”„ 

 

  • Degree1 ๊ฐ™์€ ๋ชจ๋ธ์„ '๊ณ ํŽธํ–ฅ' (high Bias) ์„ฑ์„ ๊ฐ€์กŒ๋‹ค๊ณ  ํ‘œํ˜„ ๐Ÿ‘‰ ๋ชจ๋ธ์ด ์ง€๋‚˜์น˜๊ฒŒ ํ•œ ๋ฐฉํ–ฅ์„ฑ์œผ๋กœ ์น˜์šฐ์ณ์ง
  • Degree15 ๊ฐ™์€ ๋ชจ๋ธ์„ '๊ณ ๋ถ„์‚ฐ' (high variance) ์„ฑ์„ ๊ฐ€์กŒ๋‹ค๊ณ  ํ‘œํ˜„ ๐Ÿ‘‰ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ์„ธ๋ถ€์ ์ธ ํŠน์„ฑ๊นŒ์ง€ ํ•™์Šตํ•˜๊ฒŒ ๋˜๋ฉด์„œ ์ง€๋‚˜์น˜๊ฒŒ ๋†’์€ ๋ณ€๋™์„ฑ์„ ๊ฐ€์ง€๊ฒŒ ๋จ 
  • ํŽธํ–ฅ์ด ๋†’์œผ๋ฉด ๋ถ„์‚ฐ์ด ๋‚ฎ์•„์ง€๊ณ  (๊ณผ์†Œ์ ํ•ฉ), ๋ถ„์‚ฐ์ด ๋†’์œผ๋ฉด ํŽธํ–ฅ์ด ๋‚ฎ์•„์ง„๋‹ค. (๊ณผ์ ํ•ฉ) 

 

 

 

 

 

 

06. ๊ทœ์ œ ์„ ํ˜• ๋ชจ๋ธ - ๋ฆฟ์ง€, ๋ผ์˜, ์—˜๋ผ์Šคํ‹ฑ๋„ท  


๐Ÿ“Œ ๊ฐœ์š” 

 

๐Ÿ’ก ๊ทœ์ œ๊ฐ€ ํ•„์š”ํ•œ ์ด์œ  

 

  • ํšŒ๊ท€ ๋ชจ๋ธ์€ ์ ์ ˆํžˆ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•˜๋ฉด์„œ๋„ ํšŒ๊ท€ ๊ณ„์ˆ˜๊ฐ€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ปค์ง€๋Š” ๊ฒƒ์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. 
  • ์ตœ์  ๋ชจ๋ธ์„ ์œ„ํ•œ Cost ํ•จ์ˆ˜ = ํ•™์Šต ๋ฐ์ดํ„ฐ ์ž”์ฐจ ์˜ค๋ฅ˜ ์ตœ์†Œํ™” โž• ํšŒ๊ท€๊ณ„์ˆ˜ ํฌ๊ธฐ ์ œ์–ด 

 

๋น„์šฉํ•จ์ˆ˜ ๋ชฉํ‘œ = MIN( RSS(W) + alpha * |W|2 )  ๋ฅผ ์ตœ์†Œํ™”

โœจ alpha ๋Š” ํŠœ๋‹ ํŒŒ๋ผ๋ฏธํ„ฐ 

๐Ÿ‘‰ ๋น„์šฉํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” W ๋ฒกํ„ฐ๋ฅผ ์ฐพ๊ธฐ 

 

  • alpha ๋ฅผ ํฌ๊ฒŒ ํ•˜๋ฉด ๋น„์šฉํ•จ์ˆ˜๋Š” cost ๋ฅผ ์ตœ์†Œํ™”ํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํšŒ๊ท€ ๊ณ„์ˆ˜ W์˜ ๊ฐ’์„ 0์— ๊ฐ€๊น๊ฒŒ ์ตœ์†Œํ™”ํ•˜์—ฌ ๊ณผ์ ํ•ฉ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค. 
  • alpha ๋ฅผ ์ž‘๊ฒŒํ•˜๋ฉด ํšŒ๊ท€๊ณ„์ˆ˜ W ๊ฐ’์ด ์ปค์ ธ๋„ ์–ด๋Š์ •๋„ ์ƒ์‡„๊ฐ€ ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ ํ•™์Šต ๋ฐ์ดํ„ฐ ์ ํ•ฉ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

๐Ÿ’ก ๊ทœ์ œ

 

  • ๋น„์šฉํ•จ์ˆ˜์— ์•ŒํŒŒ๊ฐ’์œผ๋กœ ํŽ˜๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌํ•ด ํšŒ๊ท€ ๊ณ„์ˆ˜ ๊ฐ’์˜ ํฌ๊ธฐ๋ฅผ ๊ฐ์†Œ์‹œ์ผœ ๊ณผ์ ํ•ฉ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ฐฉ์‹ 
  • L1 ๋ฐฉ์‹๊ณผ L2 ๋ฐฉ์‹์œผ๋กœ ๊ตฌ๋ถ„๋œ๋‹ค. 
  • L1 ๊ทœ์ œ๋Š” W ์˜ ์ ˆ๋Œ€๊ฐ’์— ํŒจ๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, '๋ผ์˜' ํšŒ๊ท€๋ผ ๋ถ€๋ฅธ๋‹ค. L1 ๊ทœ์ œ๋ฅผ ์ ์šฉํ•˜๋ฉด ์˜ํ–ฅ๋ ฅ์ด ํฌ์ง€ ์•Š์€ ํšŒ๊ท€ ๊ณ„์ˆ˜ ๊ฐ’์„ 0์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. 
  • L2 ๊ทœ์ œ๋Š” W์˜ ์ œ๊ณฑ์— ๋Œ€ํ•ด ํŒจ๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ '๋ฆฟ์ง€' ํšŒ๊ท€๋ผ ๋ถ€๋ฅธ๋‹ค. 

 

๐Ÿ“Œ ๋ฆฟ์ง€ํšŒ๊ท€ 

 

๐Ÿ’ก Ridge 

  • ์ฃผ์š” ํŒŒ๋ผ๋ฏธํ„ฐ : alpha 

 

๐Ÿ’ก ์‹ค์Šต

 

๐Ÿ‘€ alpha = 10 ์œผ๋กœ ์„ค์ •ํ•ด ๋ฆฟ์ง€ํšŒ๊ท€ ์ˆ˜ํ–‰ 

from sklearn.linear_model import Ridge 
from sklearn.model_selection import cross_val_score 

# alpha = 10 ์œผ๋กœ ์„ค์ •ํ•ด ๋ฆฟ์ง€ํšŒ๊ท€ ์ˆ˜ํ–‰ 
ridge = Ridge(alpha=10) 
neg_mse_scores = cross_val_score(ridge, X_data, y_target, scoring = 'neg_mean_squared_error', cv=5) 
rmse_scores = np.sqrt(-1*neg_mse_scores) 
avg_rmse = np.mean(rmse_scores)

print('5 folds ์˜ ๊ฐœ๋ณ„ Negative MSE scores : ', np.round(neg_mse_scores,3)) 
print('5 folds์˜ ๊ฐœ๋ณ„ RMSE scores : ', np.round(rmse_scores, 3)) 
print('5 folds์˜ ํ‰๊ท  RMSE : {0:.3f}'.format(avg_rmse))

 # ๊ทœ์ œ๊ฐ€ ์—†๋Š” ํšŒ๊ท€๋Š” 5.829 ์˜€์Œ --> ๊ทœ์ œ๊ฐ€ ๋” ๋›ฐ์–ด๋‚œ ์˜ˆ์ธก์„ฑ๋Šฅ์„ ๋ณด์ž„ 
 
# 5 folds ์˜ ๊ฐœ๋ณ„ Negative MSE scores :  [-11.422 -24.294 -28.144 -74.599 -28.517]
# 5 folds์˜ ๊ฐœ๋ณ„ RMSE scores :  [3.38  4.929 5.305 8.637 5.34 ]
# 5 folds์˜ ํ‰๊ท  RMSE : 5.518

 

๐Ÿ‘€ ์—ฌ๋Ÿฌ๊ฐœ์˜ alpha ๊ฐ’์œผ๋กœ rmse ์ธก์ •ํ•ด๋ณด๊ธฐ 

# ์•ŒํŒŒ๊ฐ’์„ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์ธก์ •ํ•ด๋ณด๊ธฐ 

alphas = [0,0.1,1,10,100] 

# alpha ์— ๋”ฐ๋ฅธ ํ‰๊ท  rmse ๋ฅผ ๊ตฌํ•จ 

for alpha in alphas : 
  ridge = Ridge(alpha = alpha) 

  # cross_val_scores ๋ฅผ ์ด์šฉํ•ด 5 ํด๋“œ์˜ ํ‰๊ท ์„ ๊ณ„์‚ฐ 
  neg_mse_scores = cross_val_score(ridge , X_data, y_target, scoring='neg_mean_squared_error', cv =5) 
  avg_rmse = np.mean(np.sqrt(-1*neg_mse_scores)) 
  print('alpha {0} ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : {1:.3f}'.format(alpha, avg_rmse)) 


  # alpha ๊ฐ€ 100์ผ ๋•Œ ๊ฐ€์žฅ ์ข‹์Œ 
  
  
# alpha 0 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.829
# alpha 0.1 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.788
# alpha 1 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.653
# alpha 10 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.518
# alpha 100 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.330

 

๐Ÿ‘€ alpha ๊ฐ’์— ๋”ฐ๋ฅธ ํšŒ๊ท€ ๊ณ„์ˆ˜ ํฌ๊ธฐ ๋ณ€ํ™” ์‚ดํŽด๋ณด๊ธฐ 

 

์•ŒํŒŒ๊ฐ’์„ ์ฆ๊ฐ€์‹œํ‚ฌ์ˆ˜๋ก ํšŒ๊ท€ ๊ณ„์ˆ˜ ๊ฐ’์€ ์ง€์†์ ์œผ๋กœ ์ž‘์•„์ง„๋‹ค. ํŠนํžˆ NOX ํ”ผ์ฒ˜์˜ ๊ฒฝ์šฐ ํฌ๊ฒŒ ๊ฐ’์ด ์ค„์–ด๋“ฆ์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.
๋ฆฟ์ง€ํšŒ๊ท€์˜ ๊ฒฝ์šฐ ํšŒ๊ท€๊ณ„์ˆ˜๋ฅผ 0์œผ๋กœ ๋งŒ๋“ค์ง„ ์•Š์Œ์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

๐Ÿ“Œ ๋ผ์˜ ํšŒ๊ท€ 

 

๐Ÿ’ก Lasso

  • ์ฃผ์š” ํŒŒ๋ผ๋ฏธํ„ฐ : alpha 
  • L2 ๊ทœ์ œ๋Š” ํšŒ๊ท€๊ณ„์ˆ˜์˜ ํฌ๊ธฐ๋ฅผ ๊ฐ์†Œ์‹œํ‚ค๋Š”๋ฐ ๋ฐ˜ํ•˜์—ฌ L1 ๊ทœ์ œ๋Š” ๋ถˆํ•„์š”ํ•œ ํšŒ๊ท€๊ณ„์ˆ˜๋ฅผ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ๊ฐ์†Œ์‹œ์ผœ 0์œผ๋กœ ๋งŒ๋“ค๊ณ  ์ œ๊ฑฐํ•ด๋ฒ„๋ฆฐ๋‹ค ๐Ÿ‘‰ ์ ์ ˆํ•œ ํ”ผ์ฒ˜๋งŒ ํšŒ๊ท€์— ํฌํ•จ์‹œํ‚ค๋Š” ํ”ผ์ฒ˜ ์„ ํƒ์˜ ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 

 

๐Ÿ’ก ์‹ค์Šต

 

 

๐Ÿ‘€ ๋ชจ๋ธ ์ ์šฉ ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ 

from sklearn.linear_model import Lasso, ElasticNet 

# ์•ŒํŒŒ๊ฐ’์— ๋”ฐ๋ฅธ ํšŒ๊ท€ ๋ชจ๋ธ์˜ ํด๋“œ ํ‰๊ท  RMSE ๋ฅผ ์ถœ๋ ฅํ•˜๊ณ  ํšŒ๊ท€ ๊ณ„์ˆ˜ ๊ฐ’๋“ค์„ DataFrame ์œผ๋กœ ๋ฐ˜ํ™˜ 
def get_linear_reg_eval(model_name , params = None, X_data_n = None, y_target_n = None, verbose =True)  : 
  coeff_df = pd.DataFrame() 
  if verbose : print('#####', model_name, '#####') 

  for param in params : 
    if model_name == 'Ridge' : model = Ridge(alpha=param) 
    elif model_name == 'Lasso' : model = Lasso(alpha=param) 
    elif model_name == 'ElasticNet' : model = ElasticNet(alpha = param, l1_ratio = 0.7) 

    neg_mse_scores = cross_val_score(model, X_data_n, y_target_n, scoring = 'neg_mean_squared_error', cv =5) 
    avg_rmse = np.mean(np.sqrt(-1*neg_mse_scores)) 
    print('alpha {0} ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : {1:.3f}'.format(param, avg_rmse))

    # cross_val_score ๋Š” evaluation metric ๋งŒ ๋ฐ˜ํ™˜ํ•˜๋ฏ€๋กœ ๋ชจ๋ธ์„ ๋‹ค์‹œ ํ•™์Šตํ•˜์—ฌ ํšŒ๊ท€ ๊ณ„์ˆ˜๋ฅผ ์ถ”์ถœํ•œ๋‹ค. 
    model.fit(X_data, y_target) 

    # alpha์— ๋”ฐ๋ฅธ ํ”ผ์ฒ˜๋ณ„ ํšŒ๊ท€ ๊ณ„์ˆ˜๋ฅผ Series ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  df ์นผ๋Ÿผ ์ถ”๊ฐ€ 
    coeff = pd.Series(data = model.coef_, index = X_data.columns) 
    colname = 'alpha:' + str(param) 
    coeff_df[colname] = coeff 

  return coeff_df

 

๐Ÿ‘€ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์•ŒํŒŒ ๊ฐ’์— ๋”ฐ๋ฅธ ๊ฒฐ๊ณผ 

lasso_alphas = [0.07, 0.1, 0.5, 1, 3] 
coeff_lasso_df = get_linear_reg_eval('Lasso', params = lasso_alphas, X_data_n = X_data, y_target_n = y_target)

##### Lasso #####
# alpha 0.07 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.612
# alpha 0.1 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.615
# alpha 0.5 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.669
# alpha 1 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.776
# alpha 3 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 6.189

 

๐Ÿ‘€ ํšŒ๊ท€ ๊ณ„์ˆ˜ ์ถœ๋ ฅ 

# ํšŒ๊ท€๊ณ„์ˆ˜ ์ถœ๋ ฅ 
sort_column = 'alpha:' + str(lasso_alphas[0]) 
coeff_lasso_df.sort_values(by=sort_column, ascending = False)

 

์•ŒํŒŒ๊ฐ€ ์ปค์ง์— ๋”ฐ๋ผ ์ผ๋ถ€ ํ”ผ์ฒ˜๋Š” ๊ณ„์ˆ˜๊ฐ€ ์•„์˜ˆ 0์œผ๋กœ ๋ฐ”๋€Œ๊ณ  ์žˆ๋‹ค.

 

๐Ÿ“Œ ์—˜๋ผ์Šคํ‹ฑ๋„ท ํšŒ๊ท€ 

 

๐Ÿ’ก Elastic Net 

  • L2 ๊ทœ์ œ์™€ L1 ๊ทœ์ œ๋ฅผ ๊ฒฐํ•ฉํ•œ ํšŒ๊ท€ 
  • ๋ผ์˜ํšŒ๊ท€(L1) ์˜ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์€ ํ”ผ์ฒ˜๋“ค์˜ ๊ฒฝ์šฐ์— ์ด๋“ค ์ค‘ ์ค‘์š” ํ”ผ์ฒ˜๋งŒ์„ ์„ ํƒํ•˜๊ณ  ๋‹ค๋ฅธ ํ”ผ์ฒ˜๋“ค์€ ๋ชจ๋‘ ๊ณ„์ˆ˜๋ฅผ 0์œผ๋กœ ๋งŒ๋“œ๋Š” ์„ฑ์งˆ ๋•Œ๋ฌธ์— alpha ๊ฐ’์— ๋”ฐ๋ผ ํšŒ๊ท€ ๊ณ„์ˆ˜์˜ ๊ฐ’์ด ๊ธ‰๊ฒฉํžˆ ๋ณ€๋™ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด L2 ๊ทœ์ œ๋ฅผ ๋ผ์˜ ํšŒ๊ท€์— ์ถ”๊ฐ€ํ–ˆ๋‹ค. 
  • ๋‹จ์  : ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆผ 

 

๐Ÿ’ก ํŒŒ๋ผ๋ฏธํ„ฐ

  • alpha ๐Ÿ‘‰ (a+b) ๊ฐ’์ด๋‹ค :  a*L1 + b*L2 ๐Ÿ‘‰ a ๋Š” L1 ๊ทœ์ œ์˜ ์•ŒํŒŒ๊ฐ’, b ๋Š” L2 ๊ทœ์ œ์˜ ์•ŒํŒŒ๊ฐ’์ด๋‹ค. 
  • l1_ratio ๐Ÿ‘‰ a/(a+b)  
  • l1_ration = 0 ์ด๋ฉด a = 0 ์ด๋ฏ€๋กœ L2 ๊ทœ์ œ์™€ ๋™์ผํ•˜๋‹ค. 1์ด๋ฉด b=0 ์ด๋ฏ€๋กœ L1 ๊ทœ์ œ์™€ ๋™์ผํ•˜๋‹ค.

 

๐Ÿ’ก ์‹ค์Šต

 

๐Ÿ‘€ ๋ชจ๋ธ ํ•™์Šต 

elastic_alphas = [0.07, 0.1, 0.5, 1, 3] 
coeff_elastic_df = get_linear_reg_eval('ElasticNet', params = elastic_alphas, X_data_n = X_data, y_target_n = y_target)

##### ElasticNet #####
# alpha 0.07 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.542
# alpha 0.1 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.526
# alpha 0.5 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.467
# alpha 1 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 5.597
# alpha 3 ์ผ ๋•Œ 5 folds ์˜ ํ‰๊ท  RMSE : 6.068

 

๐Ÿ‘€ ํšŒ๊ท€๊ณ„์ˆ˜ ์‚ดํŽด๋ณด๊ธฐ 

sort_column = 'alpha:'+str(elastic_alphas[0]) 
coeff_elastic_df.sort_values(by=sort_column, ascending=False)

๋ผ์˜ํšŒ๊ท€๋ณด๋‹ค ์ƒ๋Œ€์ ์œผ๋กœ 0์ด ๋˜๋Š” ๊ฐ’์ด ์ ์Œ

 

 

๐Ÿ“Œ ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ 

 

๐Ÿ’ก ์„ ํ˜•ํšŒ๊ท€์—์„  ์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ ๋ชป์ง€์•Š๊ฒŒ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋„์˜ ์ •๊ทœํ™”์™€ ์ธ์ฝ”๋”ฉ ๋ฐฉ๋ฒ•์ด ์ค‘์š”ํ•˜๋‹ค. 

  • ์„ ํ˜•ํšŒ๊ท€ ๋ชจ๋ธ์€ ํ”ผ์ฒ˜๊ณผ ํƒ€๊นƒ ๊ฐ’ ๊ฐ„์— '์„ ํ˜•' ๊ด€๊ณ„๋ฅผ ๊ฐ€์ • 
  • ํ”ผ์ฒ˜๊ฐ’๊ณผ ํƒ€๊นƒ๊ฐ’์˜ ๋ถ„ํฌ๊ฐ€ '์ •๊ทœ๋ถ„ํฌ'์ธ ํ˜•ํƒœ๋ฅผ ๋งค์šฐ ์„ ํ˜ธ ๐Ÿ‘‰ skewness ํ˜•ํƒœ์˜ ๋ถ„ํฌ๋„ (์‹ฌํ•˜๊ฒŒ ์™œ๊ณก๋ฌ์„ ๊ฒฝ์šฐ) ์ผ ๊ฒฝ์šฐ ์˜ˆ์ธก ์„ฑ๋Šฅ์— ๋ถ€์ •์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์น  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ธ ์ ์šฉ ์ด์ „์— '์Šค์ผ€์ผ๋ง/์ •๊ทœํ™”' ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. 

 

๐Ÿ’ก ํ”ผ์ฒ˜๋ณ€์ˆ˜ ์Šค์ผ€์ผ๋ง/์ •๊ทœํ™” 

  • StandardScaler : ํ‰๊ท ์ด 0, ๋ถ„์‚ฐ์ด 1์ธ ํ‘œ์ค€ ์ •๊ทœ ๋ถ„ํฌ๋กœ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ณ€ํ™˜ 
  • MinMaxScaler : ์ตœ์†Ÿ๊ฐ’์ด 0, ์ตœ๋Œ“๊ฐ’์ด 1์ธ ๊ฐ’์œผ๋กœ ์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ 
  • Scaler ๋ฅผ ์ ์šฉํ•ด๋„ ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€ ์•Š์•˜๋‹ค๋ฉด ์Šค์ผ€์ผ๋ง/์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ํ•œ ๋ฐ์ดํ„ฐ ์…‹์— ๋‹ค์‹œ '๋‹คํ•ญํŠน์„ฑ' ์„ ์ ์šฉํ•˜์—ฌ ๋ณ€ํ™˜ํ•˜๊ธฐ๋„ ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‹คํ•ญ ํŠน์„ฑ์„ ์ด์šฉํ•  ๋• ๊ณผ์ ํ•ฉ์„ ์ฃผ์˜ํ•ด์•ผ ํ•œ๋‹ค. 
  • ๋กœ๊ทธ๋ณ€ํ™˜ : ์›๋ž˜ ๊ฐ’์— log ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๋ฉด ๋ณด๋‹ค ์ •๊ทœ๋ถ„ํฌ์— ๊ฐ€๊นŒ์šด ํ˜•ํƒœ๋กœ ๊ฐ’์ด ๋ถ„ํฌ๋œ๋‹ค. โญโญ

 

 

๐Ÿ’ก ํƒ€๊นƒ๋ณ€์ˆ˜ ์Šค์ผ€์ผ๋ง/์ •๊ทœํ™” 

  • ์ผ๋ฐ˜์ ์œผ๋กœ ๋กœ๊ทธ ๋ณ€ํ™˜์„ ์ ์šฉํ•œ๋‹ค. (์ •๊ทœํ™”๋ฅผ ํ•˜๋ฉด ๋‹ค์‹œ ์›๋ณธ ํƒ€๊นƒ๊ฐ’์œผ๋กœ ์›๋ณตํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ) 

 

๐Ÿ’ก ์‹ค์Šต 

 

๐Ÿ‘€ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ ํŒŒ์ดํ”„๋ผ์ธ ๋งŒ๋“ค๊ธฐ 

# Standard, MinMax, Log ์ค‘ ๊ฒฐ์ • 
# p_degree ๋Š” ๋‹คํ•ญ์‹ ํŠน์„ฑ์„ ์ถ”๊ฐ€ํ•  ๋•Œ ์ ์šฉ 2 ์ด์ƒ ๋ถ€์—ฌํ•˜์ง€ ์•Š์Œ 

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures 

def get_scaled_data(method='None', p_degree=None, input_data = None) : 
  if method == 'Standard' : 
    scaled_data = StandardScaler().fit_transform(input_data)
  elif method == 'MinMax' : 
    scaled_data = MinMaxScaler().fit_transform(input_data) 
  elif method == 'Log' : 
    scaled_data = np.log1p(input_data) 
  else : 
    scaled_data = input_data 

  if p_degree != None : 
    scaled_data = PolynomialFeatures(degree=p_degree, include_bias = False).fit_transform(scaled_data) 
  
  return scaled_data

 

๐Ÿ‘€ ๋ฆฟ์ง€ํšŒ๊ท€ ๋ชจ๋ธ ํ•™์Šต 

# ๋ฆฟ์ง€ํšŒ๊ท€ 
alphas = [0.1, 1, 10, 100] 

scale_methods = [(None, None), ('Standard', None), ('Standard',2), ('MinMax',None), ('MinMax',2), ('Log',None)]  

for scale_method in scale_methods : 
  X_data_scaled = get_scaled_data(method = scale_method[0], p_degree = scale_method[1], input_data = X_data) 

  print('\n ๋ณ€ํ™˜ ์œ ํ˜• : {0} , Polynomial : {1}'.format( scale_method[0],  scale_method[1]))

  get_linear_reg_eval('Ridge', params = alphas, X_data_n = X_data_scaled, y_target_n = y_target, verbose = False)

Log ๋ณ€ํ™˜์‹œ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์Œ์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

07. ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ 


๐Ÿ“Œ ๊ฐœ์š” 

 

๐Ÿ’ก ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ 

 

  • ์„ ํ˜•ํšŒ๊ท€ ๋ฐฉ์‹์„ ๋ถ„๋ฅ˜์— ์ ์šฉํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์„ ํ˜•ํšŒ๊ท€ ๊ณ„์—ด์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. 
  • ์„ ํ˜•ํšŒ๊ท€๊ณ„์—ด์˜ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ๋ฐ์ดํ„ฐ์˜ ์ •๊ทœ ๋ถ„ํฌ๋„์— ๋”ฐ๋ผ ์˜ˆ์ธก ์„ฑ๋Šฅ ์˜ํ–ฅ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋ฐ์ดํ„ฐ์— ๋จผ์ € ์ •๊ทœ ๋ถ„ํฌ ํ˜•ํƒœ์˜ ํ‘œ์ค€ ์Šค์ผ€์ผ๋ง์„ ์ ์šฉํ•˜๋Š”๊ฒŒ ์ข‹๋‹ค. 
  • ๊ฐ€๋ณ๊ณ  ๋นจ๋ผ ์ด์ง„ ๋ถ„๋ฅ˜ ์˜ˆ์ธก์˜ ๊ธฐ๋ณธ ๋ชจ๋ธ๋กœ ๋งŽ์ด ์‚ฌ์šฉํ•˜๊ณ , ํฌ์†Œํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ถ„๋ฅ˜์—๋„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜์—์„œ๋„ ์ž์ฃผ ์‚ฌ์šฉ๋œ๋‹ค. 

 

๐Ÿง ์„ ํ˜•ํšŒ๊ท€์™€ ๋‹ค๋ฅธ์  

 

  • ํ•™์Šต์„ ํ†ตํ•ด ์„ ํ˜• ํ•จ์ˆ˜์˜ ํšŒ๊ท€ ์ตœ์ ์„ ์„ ์ฐพ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜ ์ตœ์ ์„ ์„ ์ฐพ๊ณ  ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜์˜ ๋ฐ˜ํ™˜ ๊ฐ’์„ ํ™•๋ฅ ๋กœ ๊ฐ„์ฃผ์— ํ™•๋ฅ ์— ๋”ฐ๋ผ ๋ถ„๋ฅ˜๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค. 

 

๋งŽ์€ ์‚ฌํšŒ ํ˜„์ƒ์—์„œ ํŠน์ • ๋ณ€์ˆ˜์˜ ํ™•๋ฅ ๊ฐ’์€ ์œ„์˜ S ์ž ์ปค๋ธŒ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง„๋‹ค.

 

  • ์˜ˆ์‹œ. ์•…์„ฑ์ข…์–‘ ๋ถ„๋ฅ˜ ๋ฌธ์ œ 

์„ ํ˜• ํšŒ๊ท€์„ ์€ ์œ„์˜ ๋ถ„ํฌ๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ•ด๋‚ด์ง€ ๋ชปํ•œ๋‹ค. S์ž ์ปค๋ธŒ ํ˜•ํƒœ์˜ ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ข€ ๋” ์ •ํ™•ํ•˜๊ฒŒ 0๊ณผ 1์— ๋Œ€ํ•ด ๋ถ„๋ฅ˜ํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

 

 

๐Ÿ’ก ์ฃผ์š” ํŒŒ๋ผ๋ฏธํ„ฐ 

 

  • penalty : ๊ทœ์ œ์˜ ์œ ํ˜•์„ ์„ค์ •ํ•œ๋‹ค. l1 ์ด๋ฉด L1 ๊ทœ์ œ, l2 ์ด๋ฉด L2 ๊ทœ์ œ๋ฅผ ์ ์šฉํ•œ๋‹ค
  • C : ๊ทœ์ œ ๊ฐ•๋„๋ฅผ ์กฐ์ ˆํ•˜๋Š” alpha ๊ฐ’์˜ ์—ญ์ˆ˜์ด๋‹ค. ๊ฐ’์ด ์ž‘์„์ˆ˜๋ก ๊ทœ์ œ ๊ฐ•๋„๊ฐ€ ํฌ๋‹ค. 

 

๐Ÿ“Œ ์‹ค์Šต

 

๐Ÿ‘€ ์œ„์Šค์ฝ˜์‹  ์œ ๋ฐฉ์•” ๋ฐ์ดํ„ฐ ์˜ˆ์ œ 

import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline 

from sklearn.datasets import load_breast_cancer 
from sklearn.linear_model import LogisticRegression 

cancer = load_breast_cancer()

 

๐Ÿ‘€ ๋ณ€์ˆ˜ ์Šค์ผ€์ผ๋ง

from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split 

scaler = StandardScaler() 
data_scaled = scaler.fit_transform(cancer.data) 

X_train, X_test, y_train, y_test = train_test_split(data_scaled, cancer.target, test_size = 0.3, random_state=0)

 

๐Ÿ‘€ ๋ชจ๋ธ ํ•™์Šต ๋ฐ ์„ฑ๋Šฅ ํ‰๊ฐ€ 

# ๋ชจ๋ธ ํ•™์Šต ๋ฐ ์„ฑ๋Šฅ ํ‰๊ฐ€ 
from sklearn.metrics import accuracy_score, roc_auc_score 

# ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ฅผ ์ด์šฉํ•ด ํ•™์Šต ๋ฐ ์˜ˆ์ธก ์ˆ˜ํ–‰ 
lr_clf = LogisticRegression() 
lr_clf.fit(X_train, y_train) 
lr_pred = lr_clf.predict(X_test) 

# ์ •ํ™•๋„์™€ roc_auc ์ธก์ • 
print('accuracy : {:0.3f}'.format(accuracy_score(y_test, lr_pred))) 
print('roc_auc : {:0.3f}'.format(roc_auc_score(y_test, lr_pred)))


# accuracy : 0.977
# roc_auc : 0.972

 

๐Ÿ‘€ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™” 

from sklearn.model_selection import GridSearchCV 

params = {'penalty' : ['l1','l2'], 'C' : [0.01, 0.1, 1, 1, 5, 10]}  

grid_clf = GridSearchCV(lr_clf, param_grid = params, scoring = 'accuracy', cv=3) 

grid_clf.fit(data_scaled, cancer.target) 
print('์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ:{0}, ์ตœ์  ํ‰๊ท  ์ •ํ™•๋„ {1:.3f}'.format(grid_clf.best_params_, grid_clf.best_score_))

# ์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ:{'C': 1, 'penalty': 'l2'}, ์ตœ์  ํ‰๊ท  ์ •ํ™•๋„ 0.975

 

 

 

08. ํšŒ๊ท€ ํŠธ๋ฆฌ 


๐Ÿ“Œ ๊ฐœ์š” 

๐Ÿ’ก 'ํšŒ๊ท€ํ•จ์ˆ˜' ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ํšŒ๊ท€ VS 'ํŠธ๋ฆฌ' ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ํšŒ๊ท€ 

  • ์„ ํ˜• ํšŒ๊ท€๋Š” ํšŒ๊ท€ ๊ณ„์ˆ˜์˜ ๊ด€๊ณ„๋ฅผ ๋ชจ๋‘ ์„ ํ˜•์œผ๋กœ ๊ฐ€์ •ํ•˜์—ฌ, ํšŒ๊ท€ ๊ณ„์ˆ˜๋ฅผ ์„ ํ˜•์œผ๋กœ ๊ฒฐํ•ฉํ•˜๋Š” ํšŒ๊ท€ ํ•จ์ˆ˜๋ฅผ ๊ตฌํ•ด, ๋…๋ฆฝ๋ณ€์ˆ˜๋ฅผ ์ž…๋ ฅํ•˜์—ฌ ๊ฒฐ๊ณผ๊ฐ’์„ ์˜ˆ์ธกํ•œ๋‹ค. ๋น„์„ ํ˜• ํšŒ๊ท€๋Š” ํšŒ๊ท€ ๊ณ„์ˆ˜์˜ ๊ฒฐํ•ฉ์„ ๋น„์„ ํ˜•์œผ๋กœ ํ•˜๊ณ , ๋น„์„ ํ˜• ํšŒ๊ท€ ํ•จ์ˆ˜๋ฅผ ์„ธ์›Œ ๊ฒฐ๊ณผ๊ฐ’์„ ์˜ˆ์ธกํ•œ๋‹ค. 

โž• ๋น„์„ ํ˜• ํšŒ๊ท€ ํ•จ์ˆ˜์™€ ๊ณก์„ ์„ ๋ชจํ˜•ํ™” ํ•˜๋Š” ๊ฒƒ : https://blog.minitab.com/ko/adventures-in-statistics-2/what-is-the-difference-between-linear-and-nonlinear-equations-in-regression-analysis 

 

  • ์ด ํŒŒํŠธ์—์„œ๋Š” ํšŒ๊ท€ ํ•จ์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์ง€ ์•Š๊ณ  ๊ฒฐ์ • ํŠธ๋ฆฌ์™€ ๊ฐ™์ด ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜์˜ ํšŒ๊ท€ ๋ฐฉ์‹์„ ์†Œ๊ฐœํ•˜๊ณ ์ž ํ•œ๋‹ค. 

 

๐Ÿ’ก ํšŒ๊ท€ ํŠธ๋ฆฌ 

  • ๋ฆฌํ”„๋…ธ๋“œ์— ์†ํ•œ ๋ฐ์ดํ„ฐ ๊ฐ’์˜ ํ‰๊ท ๊ฐ’์„ ๊ตฌํ•ด ํšŒ๊ท€ ์˜ˆ์ธก๊ฐ’์„ ๊ณ„์‚ฐํ•œ๋‹ค. 

๊ฐ split ์˜์—ญ์—์„œ์˜ ๋ฐ์ดํ„ฐ ๊ฐ’์˜ y ํ‰๊ท  ๊ฐ’์„ ์˜ˆ์ธก๊ฐ’์œผ๋กœ ํ• ๋‹นํ•œ๋‹ค.

 

  • ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋ถ„๋ฅ˜ ํŠธ๋ฆฌ Classifier ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ๋™์ผํ•˜๋‹ค. 

 

 

 

๐Ÿ’ก CART (classification and regression Trees) 

  • ํŠธ๋ฆฌ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ (๊ฒฐ์ •ํŠธ๋ฆฌ, ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ, GBM, XGBoost, LightGBM ๋“ฑ) ์€ ๋ถ„๋ฅ˜ ๋ฟ ์•„๋‹ˆ๋ผ ํšŒ๊ท€๋„ ๊ฐ€๋Šฅํ•˜๋‹ค. ํŠธ๋ฆฌ ์ƒ์„ฑ์ด CART ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ธฐ๋ฐ˜ํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. 
์•Œ๊ณ ๋ฆฌ์ฆ˜ ํšŒ๊ท€ Estimator ๋ถ„๋ฅ˜ Estimator
Decision Tree DecisionTreeRegressor DecisionTreeClassifier
Gradient Boosting  GraidnetBoostingRegressor  GradientBoostingClassifier
XGBoost XBGRegressor XGBClassifier
RandomForest RandomForestRegressor RandomForestClassifier
LightGBM LGBMRegressor LGBMClassifier

 

 

๐Ÿ’ก ์‹ค์Šต

 

๐Ÿ‘€ ๋ณด์Šคํ„ด ์ฃผํƒ ๊ฐ€๊ฒฉ ์˜ˆ์ธก ์˜ˆ์ œ 

from sklearn.datasets import load_boston 
from sklearn.model_selection import cross_val_score 
from sklearn.ensemble import RandomForestRegressor 
import pandas as pd 
import numpy as np 

# boston ๋ฐ์ดํƒ€์…‹ ๋กœ๋“œ
boston = load_boston()

# boston ๋ฐ์ดํƒ€์…‹ DataFrame ๋ณ€ํ™˜ 
bostonDF = pd.DataFrame(boston.data , columns = boston.feature_names)

# boston dataset์˜ target array๋Š” ์ฃผํƒ ๊ฐ€๊ฒฉ์ž„. ์ด๋ฅผ PRICE ์ปฌ๋Ÿผ์œผ๋กœ DataFrame์— ์ถ”๊ฐ€ํ•จ. 
bostonDF['PRICE'] = boston.target
y_target = bostonDF['PRICE']
X_data = bostonDF.drop(['PRICE'],axis=1,inplace=False)

 

๐Ÿ‘€ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํšŒ๊ท€ ๋ชจ๋ธ ํ›ˆ๋ จ/ํ‰๊ฐ€ 

rf = RandomForestRegressor(random_state=0, n_estimators = 1000) 
neg_mse_scores = cross_val_score(rf, X_data, y_target, scoring = 'neg_mean_squared_error', cv=5) 
rmse_scores = np.sqrt(-1*neg_mse_scores) 
avg_rmse = np.mean(rmse_scores) 

print(np.round(neg_mse_scores,2))  # [ -7.88 -13.14 -20.57 -46.23 -18.88]
print(np.round(rmse_scores,2))  # [2.81 3.63 4.54 6.8  4.34]
print(avg_rmse) # 4.422538982804892

 

๐Ÿ‘€ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ, ๊ฒฐ์ •ํŠธ๋ฆฌ, GBM, XGBoost, LightGBM Regressor ๋ชจ๋‘ ์ด์šฉํ•˜์—ฌ ์˜ˆ์ธกํ•ด๋ณด๊ธฐ 

# ํ•จ์ˆ˜ ์ •์˜ 
def get_model_cv_prediction(model, X_data, y_target) : 
  neg_mse_scores = cross_val_score(model, X_data, y_target, scoring = 'neg_mean_squared_error', cv=5) 
  rmse_scores = np.sqrt(-1*neg_mse_scores) 
  avg_rmse = np.mean(rmse_scores) 
  print('####', model.__class__.__name__, '####') 
  print('5 ๊ต์ฐจ ๊ฒ€์ฆ์˜ ํ‰๊ท  RMSE : {0:.3f}'.format(avg_rmse)) 
  
# ์ ์šฉ 
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import GradientBoostingRegressor 
from xgboost import XGBRegressor 
from lightgbm import LGBMRegressor 

dt_reg = DecisionTreeRegressor(random_state = 0, max_depth = 4) 
rf_reg = RandomForestRegressor(random_state = 0, n_estimators = 1000) 
gb_reg = GradientBoostingRegressor(random_state = 0 , n_estimators = 1000) 
xgb_reg = XGBRegressor(n_estimators=1000) 
lgb_reg = LGBMRegressor(n_estimators=1000) 

# ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜์˜ ํšŒ๊ท€ ๋ชจ๋ธ์„ ๋ฐ˜๋ณตํ•˜๋ฉด์„œ ํ‰๊ฐ€ ์ˆ˜ํ–‰ 

models = [dt_reg, rf_reg, gb_reg, xgb_reg, lgb_reg] 
for model in models : 
   get_model_cv_prediction(model, X_data, y_target)

XBGRegressor ์˜ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์•˜์Œ

 

 

 

๐Ÿ’ก ํšŒ๊ท€ํŠธ๋ฆฌ์˜ ํ”ผ์ฒ˜๋ณ„ ์ค‘์š”๋„

 

  • ํšŒ๊ท€ ํŠธ๋ฆฌ์˜ Regressor ํด๋ž˜์Šค๋Š” ์„ ํ˜• ํšŒ๊ท€์™€ ๋‹ค๋ฅธ ์ฒ˜๋ฆฌ ๋ฐฉ์‹์ด๋ฏ€๋กœ ํšŒ๊ท€ ๊ณ„์ˆ˜๋ฅผ ์ œ๊ณตํ•˜๋Š” coef_ ์†์„ฑ์ด ์—†๋‹ค. ๋Œ€์‹  feature_importances_ ๋ฅผ ์ด์šฉํ•ด ํ”ผ์ฒ˜๋ณ„ ์ค‘์š”๋„๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 
import seaborn as sns 
%matplotlib inline 

rf_reg = RandomForestRegressor(n_estimators=1000) 

rf_reg.fit(X_data, y_target) 

feature_series = pd.Series(data = rf_reg.feature_importances_, index = X_data.columns) 
feature_series = feature_series.sort_values(ascending = False) 
sns.barplot(x = feature_series, y = feature_series.index)

 

 

 

 

๐Ÿ’ก ์„ ํ˜•ํšŒ๊ท€์™€ ํšŒ๊ท€ ํŠธ๋ฆฌ ๋น„๊ตํ•ด๋ณด๊ธฐ 

 

 

 

  • ์„ ํ˜• ํšŒ๊ท€๋Š” ์ง์„ ์œผ๋กœ ์˜ˆ์ธก ํšŒ๊ท€์„ ์„ ํ‘œํ˜„ 
  • ํšŒ๊ท€ ํŠธ๋ฆฌ๋Š” ๋ถ„ํ• ๋˜๋Š” ๋ฐ์ดํ„ฐ ์ง€์ ์— ๋”ฐ๋ผ ๋ธŒ๋žœ์น˜๋ฅผ ๋งŒ๋“ค๋ฉด์„œ ๊ณ„๋‹จ ํ˜•ํƒœ๋กœ ํšŒ๊ท€์„ ์„ ๋งŒ๋“ ๋‹ค. max_depth = 2 ์ธ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด ๊ฐ ๊ณ„๋‹จ ๊ตฌ๊ฐ„๋ณ„๋กœ y ๊ฐ’์˜ ํ‰๊ท ๊ฐ’์„ ์˜ˆ์ธก์น˜๋กœ ๋ฐ˜ํ™˜ํ•œ ๋ถ€๋ถ„์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 
  • ํŠธ๋ฆฌ์˜ ์ตœ๋Œ€ ๊นŠ์ด๋ฅผ 7๋กœ ์„ค์ •ํ•œ ๊ฒฝ์šฐ์—” ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ด์ƒ์น˜๋„ ํ•™์Šตํ•˜๊ฒŒ ๋˜๋ฉด์„œ ๊ณผ์ ํ•ฉ ๋˜์—ˆ์Œ์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 
728x90

๋Œ“๊ธ€