๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“• ๋จธ์‹ ๋Ÿฌ๋‹

[04. ๋ถ„๋ฅ˜] GBM, XGboost

by isdawell 2022. 3. 14.
728x90

05. GBM 


๐Ÿ“Œ ๊ฐœ์š” ๋ฐ ์‹ค์Šต 

๐Ÿ’ก ๋ถ€์ŠคํŒ… ์•Œ๊ณ ๋ฆฌ์ฆ˜ 

  • ์—ฌ๋Ÿฌ๊ฐœ์˜ ์•ฝํ•œ ํ•™์Šต๊ธฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต - ์˜ˆ์ธกํ•˜๋ฉด์„œ ์ž˜๋ชป ์˜ˆ์ธกํ•œ ๋ฐ์ดํ„ฐ์— ๊ฐ€์ค‘์น˜ ๋ถ€์—ฌ๋ฅผ ํ†ตํ•ด ์˜ค๋ฅ˜๋ฅผ ๊ฐœ์„ ํ•ด ๋‚˜๊ฐ€๋ฉด์„œ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. 
  • ๋Œ€ํ‘œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ : AdaBoost, Gradient Booting Machine(GBM), XGBoost, LightGBM, CatBoost 

 

 

1๏ธโƒฃ  AdaBoost

     → ์˜ค๋ฅ˜ ๋ฐ์ดํ„ฐ์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜๋ฉด์„œ ๋ถ€์ŠคํŒ…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ (๊ต์žฌ ๊ทธ๋ฆผ ํ™•์ธ) 

 

from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

clf = AdaBoostClassifier(n_estimators=30, 
                        random_state=10, 
                        learning_rate=0.1)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print('AdaBoost ์ •ํ™•๋„: {:.4f}'.format(accuracy_score(y_test, pred)))

 

2๏ธโƒฃ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŠธ (GBM)

    → AdaBoost ์™€ ์œ ์‚ฌํ•˜๋‚˜ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ๋ฅผ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ด ํฐ ์ฐจ์ด๋‹ค. 

    ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• : (์‹ค์ œ๊ฐ’-์˜ˆ์ธก๊ฐ’) ์„ ์ตœ์†Œํ™”ํ•˜๋„๋ก ํ•˜๋Š” ๋ฐฉํ–ฅ์„ฑ์„ ๊ฐ€์ง€๊ณ  ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ฐ€์ค‘์น˜ ๊ฐ’์„ ์—…๋ฐ์ดํŠธ ํ•œ๋‹ค. 

    → ๋ถ„๋ฅ˜์™€ ํšŒ๊ท€ ๋ชจ๋‘ ๊ฐ€๋Šฅํ•˜๋‹ค. 

 

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data 
y = iris.target

X_train , X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=10)

gb_clf = GradientBoostingClassifier(random_state = 0)
# default ๊ฐ’ max_depth = 3 

gb_clf.fit(X_train, y_train) 

print(gb_clf.score(X_train, y_train)) # .score : ์ •ํ™•๋„ ๋ฐ˜ํ™˜ 
# 0.1
print(gb_clf.score(X_test, y_test))
# 0.9777777777778
from sklearn.metrics import accuracy_score

y_pred = gb_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
# 0.9777777777777777
  • ์ผ๋ฐ˜์ ์œผ๋กœ GBM ์ด ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋ณด๋‹ค๋Š” ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ์กฐ๊ธˆ ๋›ฐ์–ด๋‚œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์•ฝํ•œ ํ•™์Šต๊ธฐ์˜ ์ˆœ์ฐจ์ ์ธ ์˜ˆ์ธก ์˜ค๋ฅ˜ ๋ณด์ • ๋•Œ๋ฌธ์— ์ˆ˜ํ–‰ ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฌ๊ณ , ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์— ๋Œ€ํ•œ ๋…ธ๋ ฅ๋„ ํ•„์š”ํ•˜๋‹ค. 

 

๐Ÿ“Œ GBM ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ 

Parameter ์„ค๋ช… 
n_estimator ์ƒ์„ฑํ•  ์•ฝํ•œ ํ•™์Šต๊ธฐ์˜ ๊ฐœ์ˆ˜ (๋””ํดํŠธ๋Š” 100) 
max_depth ํŠธ๋ฆฌ์˜ ์ตœ๋Œ€ ๊นŠ์ด (๋””ํดํŠธ๋Š” 3)
๊นŠ์–ด์ง€๋ฉด ๊ณผ์ ํ•ฉ ๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์ ์ ˆํ•œ ์ œ์–ด ํ•„์š” 
max_features ์ตœ์ ์˜ ๋ถ„ํ• ์„ ์œ„ํ•ด ๊ณ ๋ คํ•  ์ตœ๋Œ€ feature ๊ฐœ์ˆ˜ 
loss ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์—์„œ ์‚ฌ์šฉํ•  ๋น„์šฉํ•จ์ˆ˜ ์ง€์ • (๋””ํดํŠธ๋Š” deviance)
learning_rate ์•ฝํ•œ ํ•™์Šต๊ธฐ๊ฐ€ ์ˆœ์ฐจ์ ์œผ๋กœ ์˜ค๋ฅ˜ ๊ฐ’์„ ๋ณด์ •ํ•ด ๋‚˜์•„๊ฐ€๋Š”๋ฐ ์ ์šฉํ•˜๋Š” ๊ณ„์ˆ˜๋กœ 0~1 ์‚ฌ์ด์˜ ๊ฐ’ ์ง€์ • ๊ฐ€๋Šฅ. ๋””ํดํŠธ๋Š” 0.1 

๐Ÿ”ธ ๋„ˆ๋ฌด ์ž‘์€ ๊ฐ’์„ ์ง€์ •ํ•˜๋ฉด ์˜ˆ์ธก ์„ฑ๋Šฅ์€ ๋†’์•„์งˆ ์ˆ˜ ์žˆ์ง€๋งŒ ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆด ์ˆ˜ ์žˆ๊ณ  ๋ฐ˜๋ณต์ด ์™„๋ฃŒ๋˜์–ด๋„ ์ตœ์†Œ ์˜ค๋ฅ˜๊ฐ’์„ ์ฐพ์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Œ 

๐Ÿ”ธ ๋„ˆ๋ฌด ํฐ๊ฐ’์„ ์ง€์ •ํ•˜๋ฉด ๋น ๋ฅธ ์ˆ˜ํ–‰์€ ๊ฐ€๋Šฅํ•˜๋‚˜ ์ตœ์†Œ์˜ค๋ฅ˜๊ฐ’์„ ์ฐพ์ง€ ๋ชปํ•˜๊ณ  ์ง€๋‚˜์ณ ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ์ˆ˜ ์žˆ์Œ 
subsample ํ•™์Šต์— ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ์ƒ˜ํ”Œ๋ง ๋น„์œจ๋กœ 0~1 ๊ฐ’ ์ง€์ • ๊ฐ€๋Šฅ (๋””ํดํŠธ๋Š” 1๋กœ ์ฆ‰ ์ „์ฒด ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜) 

 

from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators' : [100,500], 
    'learning_rate' : [0.05, 0.1]
}

grid_cv = GridSearchCV(gb_clf, param_grid = params, cv = 2, verbose = 1) 
# verbose : iteration ๋งˆ๋‹ค ์ˆ˜ํ–‰ ๊ฒฐ๊ณผ ๋ฉ”์‹œ์ง€๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๋ถ€๋ถ„
# verbose = 0 : ๋ฉ”์‹œ์ง€ ์ถœ๋ ฅ ์•ˆํ•จ (๋””ํดํŠธ)
# verbose = 1 : ๊ฐ„๋‹จํ•œ ๋ฉ”์‹œ์ง€ ์ถœ๋ ฅ , verbose = 2 : ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ณ„ ๋ฉ”์‹œ์ง€ ์ถœ๋ ฅ 

grid_cv.fit(X_train, y_train) 

print('์ตœ์ ์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ : \n', grid_cv.best_params_) 
# {'learning_rate': 0.05, 'n_estimators': 100}

print('์ตœ๊ณ  ์˜ˆ์ธก ์ •ํ™•๋„ : {0:.4f}'.format(grid_cv.best_score_))
# ์ตœ๊ณ  ์˜ˆ์ธก ์ •ํ™•๋„ : 0.9334

 

gb_pred = grid_cv.best_estimator_.predict(X_test) # ์ตœ์ ์œผ๋กœ ํ•™์Šต๋œ estimator ๋กœ ์˜ˆ์ธก ์ˆ˜ํ–‰ 
gb_accuracy = accuracy_score(y_test, gb_pred)
print('GBM ์ •ํ™•๋„ : {0:.4f}'.format(gb_accuracy))

# GBM ์ •ํ™•๋„ : 0.9778

 

06. XGBoost


๐Ÿ“Œ ๊ฐœ์š” ๋ฐ ์‹ค์Šต 

๐Ÿ’ก XGBoost 

  • ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜์˜ ์•™์ƒ๋ธ” ํ•™์Šต์—์„œ ๊ฐ€์žฅ ๊ฐ๊ด‘๋ฐ›๊ณ  ์žˆ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. 

๐Ÿ‘€ ์žฅ์  

  1. ๋ถ„๋ฅ˜์— ์žˆ์–ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹ค๋ฅธ ๋จธ์‹ ๋Ÿฌ๋‹๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. 
  2. GBM ์— ๊ธฐ๋ฐ˜ํ•˜๊ณ  ์žˆ์ง€๋งŒ, GBM ์˜ ๋‹จ์ ์ธ ๋Š๋ฆฐ ์ˆ˜ํ–‰์‹œ๊ฐ„ ๋ฐ ๊ณผ์ ํ•ฉ ๊ทœ์ œ ๋ถ€์žฌ ๋“ฑ์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ ๋งค์šฐ ๊ฐ๊ด‘๋ฐ›๊ณ  ์žˆ๋‹ค. ๋ณ‘๋ ฌ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ด ๊ธฐ์กด GBM ๋ณด๋‹ค ๋น ๋ฅด๊ฒŒ ํ•™์Šต์„ ์™„๋ฃŒํ•  ์ˆ˜ ์žˆ๋‹ค. 
  3. ๊ณผ์ ํ•ฉ ๊ทœ์ œ (Regularization) ๊ธฐ๋Šฅ์ด ์žˆ๋‹ค. (GBM ์€ ์—†์Œ) 
  4. Tree prunning (๋‚˜๋ฌด ๊ฐ€์ง€์น˜๊ธฐ) : max_depth ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ถ„ํ•  ๊นŠ์ด๋ฅผ ์กฐ์ •ํ•˜๊ธฐ๋„ ํ•˜์ง€๋งŒ, tree prunning ์œผ๋กœ ๋” ์ด์ƒ ๊ธ์ • ์ด๋“์ด ์—†๋Š” ๋ถ„ํ• ์„ ๊ฐ€์ง€์น˜๊ธฐ ํ•ด์„œ ๋ถ„ํ•  ์ˆ˜๋ฅผ ๋” ์ค„์ด๋Š” ์ถ”๊ฐ€์ ์ธ ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 
  5. ์ž์ฒด ๋‚ด์žฅ๋œ ๊ต์ฐจ ๊ฒ€์ฆ : ๊ต์ฐจ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ํ•ด ์ตœ์ ํ™”๋œ ๋ฐ˜๋ณต ์ˆ˜ํ–‰ ํšŸ์ˆ˜๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์œผ๋ฉฐ ํ‰๊ฐ€ ๊ฐ’์ด ์ตœ์ ํ™” ๋˜๋ฉด ๋ฐ˜๋ณต์„ ์ค‘๊ฐ„์— ๋ฉˆ์ถœ ์ˆ˜ ์žˆ๋Š” ์กฐ๊ธฐ ์ค‘๋‹จ ๊ธฐ๋Šฅ์ด ์žˆ๋‹ค. 
  6. ๊ท ํ˜•ํŠธ๋ฆฌ๋ถ„ํ•  : ์ตœ๋Œ€ํ•œ ๊ท ํ˜•์žกํžŒ ํŠธ๋ฆฌ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ถ„ํ• ํ•œ๋‹ค. (ํŠธ๋ฆฌ์˜ ๊นŠ์ด ์ตœ์†Œํ™”, ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€) 
  7. ๊ฒฐ์†๊ฐ’ ์ž์ฒด ์ฒ˜๋ฆฌ 

 

๐Ÿ“Œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ 

  • GBM ๊ณผ ์œ ์‚ฌํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋™์ผํ•˜๊ฒŒ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ ์—ฌ๊ธฐ์— ์กฐ๊ธฐ์ค‘๋‹จ (early stopping) , ๊ณผ์ ํ•ฉ์„ ๊ทœ์ œํ•˜๊ธฐ ์œ„ํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ๋“ฑ์ด ์ถ”๊ฐ€ ๋˜์—ˆ๋‹ค. 
  • XGBoost ์ž์ฒด์ ์œผ๋กœ ๊ต์ฐจ๊ฒ€์ฆ, ์„ฑ๋Šฅํ‰๊ฐ€, ํ”ผ์ฒ˜ ์ค‘์š”๋„ ๋“ฑ์˜ ์‹œ๊ฐํ™” ๊ธฐ๋Šฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 

 

1๏ธโƒฃ ํŒŒ์ด์ฌ ๋ž˜ํผ XGBoost 

import xgboost

 

Parameter ์„ค๋ช…
eta GBM ์˜ ํ•™์Šต๋ฅ ๊ณผ ๊ฐ™์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ, 0์—์„œ 1 ์‚ฌ์ด์˜ ๊ฐ’์„ ์ง€์ •ํ•œ๋‹ค. ํŒŒ์ด์ฌ ๋ž˜ํผ ๊ธฐ๋ฐ˜์˜ ๋””ํดํŠธ๋Š” 0.3
num_boost_rounds GBM ์˜ n_estimators (์•ฝํ•œ ํ•™์Šต๊ธฐ์˜ ๊ฐœ์ˆ˜) ์™€ ๊ฐ™์€ ํŒŒ๋ผ๋ฏธํ„ฐ์ด๋‹ค. 
min_child_weight (๋””ํดํŠธ 1) ํŠธ๋ฆฌ์—์„œ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ฐ€์ง€๋ฅผ ๋‚˜๋ˆŒ์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ weight ์˜ ์ดํ•ฉ. ๊ฐ’์ด ํด์ˆ˜๋ก ๋ถ„ํ• ์„ ์ž์ œํ•œ๋‹ค. ๊ณผ์ ํ•ฉ์„ ์กฐ์ ˆํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ๋‹ค. 
gamma (๋””ํดํŠธ 0) ๋ฆฌํ”„๋…ธ๋“œ๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ๋‚˜๋ˆŒ์ง€๋ฅผ ๊ฒฐ์ •ํ•  ์ตœ์†Œ ์†์‹ค ๊ฐ์†Œ ๊ฐ’์ด๋‹ค. ํ•ด๋‹น ๊ฐ’๋ณด๋‹ค ํฐ ์†์‹ค์ด ๊ฐ์†Œ๋œ ๊ฒฝ์šฐ์— ๋ฆฌํ”„๋…ธ๋“œ๋ฅผ ๋ถ„๋ฆฌํ•œ๋‹ค. ๊ฐ’์ด ํด์ˆ˜๋ก ๊ณผ์ ํ•ฉ ๊ฐ์†Œ ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค. 
max_depth ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ max_depth ์™€ ๊ฐ™๋‹ค. 0์„ ์ง€์ •ํ•˜๋ฉด ๊นŠ์ด์— ์ œํ•œ์ด ์—†๋‹ค. ๋ณดํ†ต 3~10 ์‚ฌ์ด์˜ ๊ฐ’์„ ์ ์šฉํ•œ๋‹ค. 
sub_sample GBM ์˜ subsample ๊ณผ ๋™์ผํ•˜๋‹ค. ํŠธ๋ฆฌ๊ฐ€ ์ปค์ ธ์„œ ๊ณผ์ ํ•ฉ ๋˜๋Š” ๊ฒƒ์„ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๋น„์œจ์„ ์ง€์ •ํ•œ๋‹ค. 0.5๋กœ ์ง€์ •ํ•˜๋ฉด ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ์ ˆ๋ฐ˜์„ ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•œ๋‹ค. 0์—์„œ 1 ์‚ฌ์ด์˜ ๊ฐ’์ด ๊ฐ€๋Šฅํ•˜๋‚˜, ์ผ๋ฐ˜์ ์œผ๋กœ 0.5~1 ์‚ฌ์ด์˜ ๊ฐ’์„ ์‚ฌ์šฉํ•œ๋‹ค. 
lambda L2 ๊ทœ์ œ ์ ์šฉ๊ฐ’. ํ”ผ์ฒ˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์„ ๊ฒฝ์šฐ ์ ์šฉ์„ ๊ฒ€ํ† ํ•˜๋ฉฐ, ๊ฐ’์ด ํด์ˆ˜๋ก ๊ณผ์ ํ•ฉ ๊ฐ์†Œ ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค. 
alpha L1 ๊ทœ์ œ ์ ์šฉ๊ฐ’. ํ”ผ์ฒ˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์„ ๊ฒฝ์šฐ ์ ์šฉ์„ ๊ฒ€ํ† ํ•˜์—ฌ, ๊ฐ’์ด ํด์ˆ˜๋ก ๊ณผ์ ํ•ฉ ๊ฐ์†Œ ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค. 

๐Ÿ‘‰ ๊ณผ์ ํ•ฉ ๋ฌธ์ œ ๋ฐฉ์ง€ํ•˜๊ธฐ 

  • 0.01~0.1 ์ •๋„ eta ๊ฐ’์„ ๋‚ฎ์ถ˜๋‹ค. 
  • n_estimator ๋ฅผ ๋†’์ธ๋‹ค. (num_round) 
  • max_depth ๊ฐ’์„ ๋‚ฎ์ถ˜๋‹ค. 
  • min_child_weight ๊ฐ’์„ ๋†’์ธ๋‹ค. 
  • gamma ๊ฐ’์„ ๋†’์ธ๋‹ค. 
  • subsample ๊ณผ colsample_bytree ๋ฅผ ์กฐ์ •ํ•˜์—ฌ ํŠธ๋ฆฌ๊ฐ€ ๋„ˆ๋ฌด ๋ณต์žกํ•˜๊ฒŒ ์ƒ์„ฑ๋˜๋Š” ๊ฒƒ์„ ๋ง‰์•„ ๊ณผ์ ํ•ฉ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š๋„๋ก ํ•œ๋‹ค. 

โญ ์ž์ฒด์ ์œผ๋กœ ๊ต์ฐจ ๊ฒ€์ฆ, ์„ฑ๋Šฅ ํ‰๊ฐ€, ํ”ผ์ฒ˜ ์ค‘์š”๋„ ๋“ฑ์˜ ์‹œ๊ฐํ™” ๊ธฐ๋Šฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 

โญ ์กฐ๊ธฐ์ค‘๋‹จ ๊ธฐ๋Šฅ์ด ์žˆ์–ด์„œ ๋ถ€์ŠคํŒ… ๋ฐ˜๋ณต ํšŸ์ˆ˜์— ๋„๋‹ฌํ•˜์ง€ ์•Š์•„๋„ ์˜ˆ์ธก ์˜ค๋ฅ˜๊ฐ€ ๊ฐœ์„ ๋˜์ง€ ์•Š์œผ๋ฉด ์ค‘๋‹จํ•œ๋‹ค. 

 

 

2๏ธโƒฃ ์‚ฌ์ดํ‚ท๋Ÿฐ ๋ž˜ํผ XGBoost 

from xgboost import XGBClassifier

 

Parameter ์„ค๋ช…
learning_rate ํ•™์Šต๋ฅ 
subsample ํŠธ๋ฆฌ๊ฐ€ ์ปค์ ธ์„œ ๊ณผ์ ํ•ฉ ๋˜๋Š” ๊ฒƒ์„ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๋น„์œจ์„ ์ง€์ •ํ•œ๋‹ค. 0.5๋กœ ์ง€์ •ํ•˜๋ฉด ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ์ ˆ๋ฐ˜์„ ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•œ๋‹ค. 0์—์„œ 1 ์‚ฌ์ด์˜ ๊ฐ’์ด ๊ฐ€๋Šฅํ•˜๋‚˜, ์ผ๋ฐ˜์ ์œผ๋กœ 0.5~1 ์‚ฌ์ด์˜ ๊ฐ’์„ ์‚ฌ์šฉํ•œ๋‹ค. 
reg_lambda L2 ๊ทœ์ œ ์ ์šฉ๊ฐ’. ํ”ผ์ฒ˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์„ ๊ฒฝ์šฐ ์ ์šฉ์„ ๊ฒ€ํ† ํ•˜๋ฉฐ, ๊ฐ’์ด ํด์ˆ˜๋ก ๊ณผ์ ํ•ฉ ๊ฐ์†Œ ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค. 
reg_alpha L1 ๊ทœ์ œ ์ ์šฉ๊ฐ’. ํ”ผ์ฒ˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์„ ๊ฒฝ์šฐ ์ ์šฉ์„ ๊ฒ€ํ† ํ•˜์—ฌ, ๊ฐ’์ด ํด์ˆ˜๋ก ๊ณผ์ ํ•ฉ ๊ฐ์†Œ ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค. 

 

โœ” ์กฐ๊ธฐ์ค‘๋‹จ ๊ด€๋ จ ํŒŒ๋ผ๋ฏธํ„ฐ : early_sropping_rounds ๋กœ fit() ์˜ ์ธ์ž๋กœ ์ž…๋ ฅํ•˜๋ฉด ๋œ๋‹ค. 

โœ” eval_metric : ์กฐ๊ธฐ์ค‘๋‹จ์„ ์œ„ํ•œ ํ‰๊ฐ€์ง€ํ‘œ

โœ” eval_set : ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•  ๋ฐ์ดํ„ฐ์…‹ 

 

๐Ÿ“Œ ์œ„์Šค์ฝ˜์‹  ์œ ๋ฐฉ์•” ์˜ˆ์ธก 

โœ” ์•…์„ฑ์ข…์–‘(malignant) , ์–‘์„ฑ์ข…์–‘(benign) ์ธ์ง€ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฌธ์ œ 

โœ” ํ”ผ์ฒ˜ : ์ข…์–‘์˜ ํฌ๊ธฐ, ๋ชจ์–‘ ๋“ฑ ๋‹ค์–‘ํ•œ ์†์„ฑ๊ฐ’์ด ์กด์žฌ 

 

 

1๏ธโƒฃ ํŒŒ์ด์ฌ ๋ž˜ํผ XGBoost 

import xgboost as xgb
from xgboost import plot_importance 
import pandas as pd
import numpy as np 
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import warnings 
warnings.filterwarnings('ignore') 

# 01. ๋ฐ์ดํ„ฐ ๋กœ๋“œ 
dataset = load_breast_cancer() 
X_features = dataset.data  
y_label = dataset.target 

cancer_df = pd.DataFrame(data = X_features, columns = dataset.feature_names) 
cancer_df['target'] = y_label 

# 02. target ๋ถ„ํฌ ํ™•์ธ 
print(cancer_df['target'].value_counts())
### 1    357
### 0    212
# 03. ํ›ˆ๋ จ์…‹, ํ…Œ์ŠคํŠธ์…‹ ์ƒ์„ฑ
X_train, X_test, y_train, y_test = train_test_split(X_features, y_label, test_size = 0.2, random_state = 156) 

## ํŒŒ์ด์ฌ ๋ž˜ํผ์˜ xgboost ์ ์šฉ์—์„œ๋Š” ํ›ˆ๋ จ, ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์…‹์„ DMatrix ๊ฐ์ฒด๋กœ ๋งŒ๋“ค์–ด์•ผ ํ•œ๋‹ค. 

dtrain = xgb.DMatrix(data = X_train, label = y_train) 
dtest = xgb.DMatrix(data = X_test, label = y_test)
# 04. ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •, ํ›ˆ๋ จ 

# ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ • : ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์ง€์ • 

params = {
    'max_depth' : 3 , #ํŠธ๋ฆฌ์˜ ์ตœ๋Œ€ ๊นŠ์ด๋Š” 3 
    'eta' : 0.1 , # ํ•™์Šต๋ฅ ์€ 0.1 
    'objective' : 'binary:logistic', # ์˜ˆ์ œ ๋ฐ์ดํ„ฐ๊ฐ€ 0 ๋˜๋Š” 1 ์ด์ง„๋ถ„๋ฅ˜ ์ด๋ฏ€๋กœ ๋ชฉ์ ํ•จ์ˆ˜๋Š” ์ด์ง„ ๋กœ์ง€์Šคํ‹ฑ
    'eval_metric' : 'logloss', # ์˜ค๋ฅ˜ ํ•จ์ˆ˜์˜ ํ‰๊ฐ€ ์„ฑ๋Šฅ ์ง€ํ‘œ 
    'early_stoppings' : 100 # ์กฐ๊ธฐ์ค‘๋‹จ

}

num_rounds = 400 # ๋ถ€์ŠคํŒ… ๋ฐ˜๋ณต ํšŸ์ˆ˜๋Š” 400ํšŒ 

wlist = [(dtrain, 'train'), (dtest, 'eval')] 

xgb_model = xgb.train(params = params, dtrain = dtrain, num_boost_round = num_rounds, 
                      early_stopping_rounds = 100, evals = wlist)

# ์กฐ๊ธฐ์ค‘๋‹จํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์†Œ๋ฐ˜๋ณตํšŸ์ˆ˜ : 100 

# loss ๊ฐ€ ์ง€์†์ ์œผ๋กœ ๊ฐ์†Œํ•จ์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ์Œ

 

 

๐Ÿง logloss ๋ž€?

๋ชจ๋ธ ์ „์ฒด์˜ Log loss ๋ฅผ ๊ตฌํ•˜๋Š” ์‹

  • ๋‹ค์ค‘ ํด๋ž˜์Šค ๋ถ„๋ฅ˜ ๋ชจ๋ธ (target ์ด 3๊ฐœ ์ด์ƒ) ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. 
  • ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ํ™•๋ฅ ๊ฐ’์„ ์ง์ ‘์ ์œผ๋กœ ๋ฐ˜์˜ํ•˜์—ฌ ํ‰๊ฐ€ํ•œ๋‹ค. ์ฆ‰, ์ •๋‹ต์„ ๋งž์ท„๋‹ค๊ณ  ํ•ด๋„, ์ •๋‹ต์„ ๋” ๋†’์€ ํ™•๋ฅ ๋กœ ์˜ˆ์ธกํ•  ์ˆ˜๋ก ๋” ์ข‹์€ ๋ชจ๋ธ์ด๋ผ๊ณ  ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ
  • logloss ๊ฐ’์ด ์ž‘์„์ˆ˜๋ก (= pij ํ™•๋ฅ ์ด 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก = 0์— ๊ฐ€๊นŒ์šด ๊ฐ’์ด๋ฏ€๋กœ) ์ข‹์€ ๋ชจ๋ธ์ด๋‹ค. 
    • ex. 100% ํ™•๋ฅ (ํ™•์‹ )์œผ๋กœ ๋‹ต์„ ๊ตฌํ•œ ๊ฒฝ์šฐ์˜ logloss ๋Š” -log(1.0) = 0 ์ด๊ณ , 60%์˜ ํ™•๋ฅ ์˜ ๊ฒฝ์šฐ์—๋Š” -log(0.6) = 0.51082 ์ด๋‹ค. 

https://velog.io/@skyepodium/logloss-%EC%95%8C%EC%95%84%EB%B3%B4%EA%B8%B0

 

โž• https://seoyoungh.github.io/machine-learning/ml-logloss/ 

 

 

# 05. ์˜ˆ์ธก 

pred_probs = xgb_model.predict(dtest) 
print('predict ๊ฒฐ๊ณผ 10๊ฐœ ํ‘œ์‹œ, ์˜ˆ์ธก ํ™•๋ฅ ๊ฐ’์œผ๋กœ ํ‘œ์‹œ๋จ')
print(np.round(pred_probs[:10],3))

## [0.934 0.003 0.91  0.094 0.993 1.    1.    0.999 0.997 0.   ]


preds = [1 if x > 0.5 else 0 for x in pred_probs]
# 0.5 ๋ณด๋‹ค ํฌ๋ฉด 1, ์•„๋‹ˆ๋ฉด 0์œผ๋กœ ๋ ˆ์ด๋ธ”๋ง 

print('์˜ˆ์ธก๊ฐ’ 10๊ฐœ๋งŒ ํ‘œ์‹œ : ', preds[:10])
## ์˜ˆ์ธก๊ฐ’ 10๊ฐœ๋งŒ ํ‘œ์‹œ :  [1, 0, 1, 0, 1, 1, 1, 1, 1, 0]
# 06. ์„ฑ๋Šฅํ‰๊ฐ€์ง€ํ‘œ

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, preds)

# array([[35,  2],
#        [ 1, 76]])

from sklearn.metrics import classification_report

print(classification_report(y_test, preds))

# 07. ๋ณ€์ˆ˜ ์ค‘์š”๋„ ์ถœ๋ ฅ 
from xgboost import plot_importance 
import matplotlib.pyplot as plt
%matplotlib inline 

fig, ax = plt.subplots(figsize=(10,12)) 
plot_importance(xgb_model, ax = ax)

# ๋ณ€์ˆ˜ ์ค‘์š”๋„ ์ถœ๋ ฅ : f1 ์Šค์ฝ”์–ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ ํ”ผ์ฒ˜์˜ ์ค‘์š”๋„๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

 

โญ https://wikidocs.net/22881 : ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€(์ด์ง„๋ถ„๋ฅ˜), ์ธ๊ณต ์ง€๋Šฅ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ํ•˜๋Š” ๊ฒƒ์€ ๊ฒฐ๊ตญ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•œ ๊ฐ€์ค‘์น˜ w์™€ b๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

2๏ธโƒฃ ์‚ฌ์ดํ‚ท๋Ÿฐ ๋ž˜ํผ XGBoost 

# ์‚ฌ์ดํ‚ท๋Ÿฐ ๋ž˜ํผ XGBoost ํด๋ž˜์Šค์ธ XGBClassifier ์ž„ํฌํŠธ 
from xgboost import XGBClassifier 

xgb_wrapper = XGBClassifier(n_estimators=400, learning_rate = 0.1, max_depth = 3) 

xgb_wrapper.fit(X_train, y_train) 
w_preds = xgb_wrapper.predict(X_test) 

from sklearn.metrics import classification_report

print(classification_report(y_test, w_preds))

# ์กฐ๊ธฐ์ค‘๋‹จ ์‚ฌ์šฉํ•ด๋ณด๊ธฐ 

from xgboost import XGBClassifier 

xgb_wrapper = XGBClassifier(n_estimators=400, learning_rate = 0.1, max_depth = 3) 
evals = [(X_test, y_test)] # ์„ฑ๋Šฅํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•  ๋ฐ์ดํ„ฐ์…‹ 
xgb_wrapper.fit(X_train, y_train, early_stopping_rounds = 100, eval_metric = 'logloss', eval_set = evals, verbose = True) 
ws100_preds = xgb_wrapper.predict(X_test) 

# 211 ๋ฒˆ ๋ฐ˜๋ณต์‹œ logloss ๊ฐ€ 0.085593 ์ธ๋ฐ, 311๋ฒˆ ๋ฐ˜๋ณต์‹œ logloss๊ฐ€ 0.085948๋กœ ์ง€์ •๋œ 100๋ฒˆ์˜ ๋ฐ˜๋ณต๋™์•ˆ ์„ฑ๋Šฅํ‰๊ฐ€์ง€์ˆ˜๊ฐ€ ํ–ฅ์ƒ๋˜์ง€ ์•Š์•˜๊ธฐ์— ์กฐ๊ธฐ์ข…๋ฃŒ

print(classification_report(y_test, ws100_preds))

xgb_wrapper.fit(X_train, y_train, early_stopping_rounds = 10, eval_metric = 'logloss', eval_set = evals, verbose = True) 
ws10_preds = xgb_wrapper.predict(X_test) 
print(classification_report(y_test, ws10_preds))

# ๋ณ€์ˆ˜ ์ค‘์š”๋„ 
from xgboost import plot_importance
import matplotlib.pyplot as plt 
%matplotlib inline 

fig, ax = plt.subplots(figsize=(10,12)) 
plot_importance(xgb_wrapper, ax = ax)

728x90

๋Œ“๊ธ€