๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“• ๋จธ์‹ ๋Ÿฌ๋‹

[04. ๋ถ„๋ฅ˜] LightGBM, ์Šคํƒœํ‚น ์•™์ƒ๋ธ”, Catboost

by isdawell 2022. 3. 20.
728x90

07. LightGBM 


๐Ÿ“Œ ๊ฐœ์š” 

๐Ÿ’ก LightGBM 

  • XGBoost ์™€ ์˜ˆ์ธก ์„ฑ๋Šฅ์€ ๋น„์Šทํ•˜์ง€๋งŒ ํ•™์Šต์— ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„์ด ํ›จ์”ฌ ์ ์œผ๋ฉฐ ๋‹ค์–‘ํ•œ ๊ธฐ๋Šฅ์„ ๋ณด์œ ํ•˜๊ณ  ์žˆ๋‹ค. 
  • ์นดํ…Œ๊ณ ๋ฆฌํ˜• ํ”ผ์ฒ˜์˜ ์ž๋™ ๋ณ€ํ™˜(์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ํ•˜์ง€ ์•Š์•„๋„ ๋จ) ๊ณผ ์ตœ์  ๋ถ„ํ•  ์ˆ˜ํ–‰ 
  • ๊ท ํ˜• ํŠธ๋ฆฌ ๋ถ„ํ•  ๋ฐฉ์‹์ด ์•„๋‹Œ ๋ฆฌํ”„ ์ค‘์‹ฌ ํŠธ๋ฆฌ ๋ถ„ํ•  ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. 
  • ๊ทธ๋Ÿฌ๋‚˜ ์ ์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ (10,000๊ฑด ์ดํ•˜) ์— ์ ์šฉํ•  ๊ฒฝ์šฐ ๊ณผ์ ํ•ฉ์ด ๋ฐœ์ƒํ•˜๊ธฐ ์‰ฝ๋‹ค. 

 

  • ๋ฆฌํ”„์ค‘์‹ฌ ํŠธ๋ฆฌ ๋ถ„ํ•  Leaf wise : ํŠธ๋ฆฌ์˜ ๊ท ํ˜•์„ ๋งž์ถ”์ง€ ์•Š๊ณ  ์ตœ๋Œ€ ์†์‹ค๊ฐ’์„ ๊ฐ€์ง€๋Š” ๋ฆฌํ”„๋…ธ๋“œ๋ฅผ ์ง€์†์ ์œผ๋กœ ๋ถ„ํ• ํ•œ๋‹ค. ํ•™์Šต์˜ ๋ฐ˜๋ณต์„ ํ†ตํ•ด ๊ฒฐ๊ตญ ๊ท ํ˜•ํŠธ๋ฆฌ ๋ถ„ํ•  ๋ฐฉ์‹๋ณด๋‹ค ์˜ˆ์ธก ์˜ค๋ฅ˜ ์†์‹ค์„ ์ตœ์†Œํ™”ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. 

 

 

๐Ÿ“Œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ 

  • LightGBM ์€ XGBoost ์™€ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋งค์šฐ ์œ ์‚ฌํ•˜์ง€๋งŒ, ์ฃผ์˜ํ• ์ ์€ ๋ฆฌํ”„๋…ธ๋“œ๊ฐ€ ๊ณ„์† ๋ถ„ํ• ๋˜๋ฉฐ ํŠธ๋ฆฌ์˜ ๊นŠ์ด๊ฐ€ ๊นŠ์–ด์ง€๋ฏ€๋กœ max_depth ์™€ ๊ฐ™์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์— ์œ ์˜ํ•ด์•ผ ํ•œ๋‹ค. 
  • ์•„๋ž˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ํŒŒ์ด์ฌ ๋ž˜ํผ LightGBM 
Parameter ๋‚ด์šฉ
num_iterations  ๋””ํดํŠธ 100. ๋ฐ˜๋ณต์ˆ˜ํ–‰ํ•˜๋ ค๋Š” ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •. 
ํฌ๊ฒŒ ์ง€์ •ํ• ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๋†’์•„์งˆ ์ˆœ ์žˆ์œผ๋‚˜ ๊ณผ์ ํ•ฉ์— ์ฃผ์˜ํ•ด์•ผํ•œ๋‹ค. (=n_estimators)
learning_rate ๋””ํดํŠธ 0.1 ๋กœ 0๊ณผ 1 ์‚ฌ์ด์˜ ๊ฐ’์„ ์ง€์ •ํ•˜๋ฉฐ ๋ฐ˜๋ณต์ˆ˜ํ–‰ํ•  ๋•Œ ์—…๋ฐ์ดํŠธ ๋˜๋Š” ํ•™์Šต๋ฅ  ๊ฐ’ 
max_depth  ๋””ํดํŠธ -1. 0๋ณด๋‹ค ์ž‘์€ ๊ฐ’์„ ์ง€์ •ํ•˜๋ฉด ๊นŠ์ด์— ์ œํ•œ์ด ์—†๋‹ค. 
์ง€๊ธˆ๊นŒ์ง€ ์†Œ๊ฐœํ•œ Depth wise ๋ฐฉ์‹๊ณผ ๋‹ค๋ฅด๊ฒŒ LightGBM ์€ Leaf wise ๊ธฐ๋ฐ˜์ด๋ฏ€๋กœ ๊นŠ์ด๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๋” ๊นŠ๋‹ค. 
boosting  * gbdt : ์ผ๋ฐ˜์ ์ธ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑ 
* rf : ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ฐฉ๋ฒ•์œผ๋กœ ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑ 
num_leaves ๋””ํดํŠธ 31. ํ•˜๋‚˜์˜ ํŠธ๋ฆฌ๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ ๋ฆฌํ”„ ๊ฐœ์ˆ˜๋กœ ๋ชจ๋ธ์˜ ๋ณต์žก๋„๋ฅผ ์ œ์–ดํ•˜๋Š” ์ฃผ์š” ํŒŒ๋ผ๋ฏธํ„ฐ์ด๋‹ค. ๊ฐœ์ˆ˜๋ฅผ ๋†’์ด๋ฉด ์ •ํ™•๋„๊ฐ€ ๋†’์•„์ง€๋‚˜, ํŠธ๋ฆฌ์˜ ๊นŠ์ด๊ฐ€ ๊นŠ์–ด์ง€๊ณ  ๋ณต์žก๋„๊ฐ€ ์ปค์ ธ ๊ณผ์ ํ•ฉ ์˜ํ–ฅ์ด ์ปค์ง„๋‹ค. 

 

๐Ÿ“Œ ์‹ค์Šต - ์œ„์Šค์ฝ˜์‹  ์œ ๋ฐฉ์•” ์˜ˆ์ธก 

from lightgbm import LGBMClassifier 

from lightgbm import LGBMClassifier

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

dataset = load_breast_cancer()
ftr = dataset.data
target = dataset.target

X_train,X_test,y_train,y_test = train_test_split(ftr, target, test_size=0.2, random_state=156)

lgbm_wrapper = LGBMClassifier(n_estimators = 400)

evals = [(X_test, y_test)] 

lgbm_wrapper.fit(X_train, y_train, early_stopping_rounds = 100, eval_metric = 'logloss', 
                 eval_set = evals, verbose = True)

preds = lgbm_wrapper.predict(X_test)
from sklearn.metrics import classification_report

print(classification_report(y_test, preds))

from lightgbm import plot_importance 
import matplotlib.pyplot as plt 

fig, ax = plt.subplots(figsize=(10,12)) 
plot_importance(lgbm_wrapper, ax = ax)

 

 

 

10. ์Šคํƒœํ‚น ์•™์ƒ๋ธ” 


๐Ÿ“Œ ๊ฐœ์š” 

  • ๊ฐœ๋ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์˜ˆ์ธกํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ง€๊ณ  ๋‹ค์‹œ ์ตœ์ข… '๋ฉ”ํƒ€ ๋ฐ์ดํ„ฐ ์„ธํŠธ' ๋กœ ๋งŒ๋“ค์–ด, ๋ณ„๋„์˜ ML ์•Œ๊ณ ๋ฆฌ์ฆ˜ (๋ฉ”ํƒ€ ๋ชจ๋ธ์ด๋ผ ๋ถ€๋ฅธ๋‹ค) ์œผ๋กœ ์ตœ์ข… ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ข… ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ์‹ 
  • ๊ฐœ๋ณ„์ ์ธ ์—ฌ๋Ÿฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„œ๋กœ ๊ฒฐํ•ฉํ•ด ์˜ˆ์ธก๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•œ๋‹ค๋Š” ์ ์—์„œ ๋ฐฐ๊น…๊ณผ ๋ถ€์ŠคํŒ… ๋ฐฉ์‹๊ณผ ๊ณตํ†ต์ ์ด ์žˆ์œผ๋‚˜, ๊ฐœ๋ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์˜ˆ์ธกํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์‹œ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•œ๋‹ค๋Š” ๊ฒƒ์ด ํฐ ์ฐจ์ด์ ์ด๋‹ค. 

 

 

 

 

 

๐Ÿ“Œ ๊ธฐ๋ณธ ์Šคํƒœํ‚น ๋ชจ๋ธ ์‹ค์Šต 

* ์œ„์Šค์ฝ˜์‹  ์œ ๋ฐฉ์•” ์˜ˆ์ œ ๋ฐ์ดํ„ฐ ๋กœ๋“œ

import numpy as np

# ์Šคํƒœํ‚น ๋ชจ๋ธ์— ์‚ฌ์šฉํ•  ์•Œ๊ณ ๋ฆฌ์ฆ˜ 

from sklearn.neighbors import KNeighborsClassifier # KNN
from sklearn.ensemble import RandomForestClassifier # ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ
from sklearn.ensemble import AdaBoostClassifier # ๋ถ€์ŠคํŒ…
from sklearn.tree import DecisionTreeClassifier # ๊ฒฐ์ •ํŠธ๋ฆฌ 
from sklearn.linear_model import LogisticRegression # ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ (๋ฉ”ํƒ€๋ชจ๋ธ)


# ์œ„์Šค์ฝ˜์‹  ์œ ๋ฐฉ์•” ์˜ˆ์ œ ๋ฐ์ดํ„ฐ ๋กœ๋“œ
# metrics๋กœ accuracy๋ฅผ ์‚ฌ์šฉ
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


cancer_data = load_breast_cancer() 

X_data = cancer_data.data 
y_label = cancer_data.target 

X_train, X_test, y_train, y_test = train_test_split(X_data, y_label, test_size=0.2)

 

๐Ÿ‘€ ๊ธฐ๋ฐ˜๋ชจ๋ธ๊ณผ ๋ฉ”ํƒ€ ๋ชจ๋ธ ๊ฐ์ฒด ์ƒ์„ฑ , ํ•™์Šต , ์˜ˆ์ธก 

 

# ๊ฐœ๋ณ„ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ๊ฐ์ฒด ์ƒ์„ฑ (๊ธฐ๋ฐ˜๋ชจ๋ธ) 
knn_clf = KNeighborsClassifier(n_neighbors =4) # n_neighbors : ๋ถ„๋ฅ˜์‹œ ๊ณ ๋ คํ•  ์ธ์ ‘ ์ƒ˜ํ”Œ ์ˆ˜ 
rf_clf = RandomForestClassifier(n_estimators=100, random_state=30)
dt_clf = DecisionTreeClassifier() 
ada_clf =  AdaBoostClassifier(n_estimators=100)

# ๋ฉ”ํƒ€๋ชจ๋ธ (์Šคํƒœํ‚น์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ๋ฐ์ดํ„ฐ ํ•™์Šต ๋ฐ ์˜ˆ์ธก) 
lr_final = LogisticRegression(C=10) # C : ๊ณผ๋Œ€์ ํ•ฉ/๊ณผ์†Œ์ ํ•ฉ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ. ๊ฐ’์ด ํฌ๋ฉด ํ›ˆ๋ จ์„ ๋ณต์žกํ•˜๊ฒŒ(์•ฝํ•œ๊ทœ์ œ)


# ๊ฐœ๋ณ„ ๋ชจ๋ธ ํ•™์Šต 

knn_clf.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)
dt_clf.fit(X_train, y_train)
ada_clf.fit(X_train, y_train)

# ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ์˜ˆ์ธก ์„ธํŠธ์™€ ์ •ํ™•๋„ ํ™•์ธ

knn_pred = knn_clf.predict(X_test)
rf_pred = rf_clf.predict(X_test)
dt_pred = dt_clf.predict(X_test)
ada_pred = ada_clf.predict(X_test)

print('KNN ์ •ํ™•๋„ :',accuracy_score(y_test, knn_pred))
print('RF ์ •ํ™•๋„ :',accuracy_score(y_test, rf_pred))
print('DT ์ •ํ™•๋„ :',accuracy_score(y_test, dt_pred))
print('ADA๋ถ€์ŠคํŠธ ์ •ํ™•๋„ :',accuracy_score(y_test, ada_pred))

 

๐Ÿ‘€ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์Šคํƒœํ‚น 

 

# ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์Šคํƒœํ‚น 

stacked_pred = np.array([knn_pred, rf_pred, dt_pred, ada_pred])

# transpose ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ–‰๊ณผ ์—ด์˜ ์œ„์น˜๋ฅผ ๊ตํ™˜
# ์นผ๋Ÿผ ๋ ˆ๋ฒจ๋กœ ๊ฐ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ํ”ผ์ฒ˜๋กœ ์‚ฌ์šฉ 
stacked_pred = np.transpose(stacked_pred) 
stacked_pred.shape # 4 : ๋ชจ๋ธ์˜ ๊ฐœ์ˆ˜ , 114 : ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ 

# (114,4)

 

๐Ÿ‘€ ๋ฉ”ํƒ€๋ชจ๋ธ์˜ ํ•™์Šต , ์ตœ์ข… ์ •ํ™•๋„ ๋„์ถœ 

 

# ๋ฉ”ํƒ€๋ชจ๋ธ์€ ๊ธฐ๋ฐ˜๋ชจ๋ธ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต 

lr_final.fit(stacked_pred, y_test)  
final_pred = lr_final.predict(stacked_pred) 

print('์ตœ์ข… ๋ฉ”ํƒ€ ๋ชจ๋ธ์˜ ์ •ํ™•๋„ : ', accuracy_score(y_test, final_pred))

# ์ตœ์ข… ๋ฉ”ํƒ€ ๋ชจ๋ธ์˜ ์ •ํ™•๋„ :  0.9912280701754386

 

๐Ÿ“Œ CV ์„ธํŠธ ๊ธฐ๋ฐ˜ ์Šคํƒœํ‚น 

 

(์ฐธ๊ณ ) 

 

[Python] ๋จธ์‹ ๋Ÿฌ๋‹ ์™„๋ฒฝ๊ฐ€์ด๋“œ - 04. ๋ถ„๋ฅ˜[์Šคํƒœํ‚น ์•™์ƒ๋ธ”]

์Šคํƒœํ‚น ์•™์ƒ๋ธ”

romg2.github.io

  • ์•ž์„  ๊ธฐ๋ณธ ์Šคํƒœํ‚น ๋ชจ๋ธ์— ์‚ฌ์šฉ๋œ ๋ฉ”ํƒ€ ๋ชจ๋ธ์ธ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ชจ๋ธ์—์„  ๊ฒฐ๊ตญ y_test(ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ)๋ฅผ ํ•™์Šตํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ์ ํ•ฉ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ณผ์ ํ•ฉ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์ข… ๋ฉ”ํƒ€๋ชจ๋ธ์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ๋งŒ๋“ค ๋•Œ, ๊ต์ฐจ ๊ฒ€์ฆ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜ˆ์ธก๋œ ๊ฒฐ๊ณผ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. 

 

โญ ๊ฐœ๋ณ„ ๋ชจ๋ธ์ด ๊ต์ฐจ ๊ฒ€์ฆ์„ ํ†ตํ•ด์„œ ๋ฉ”ํƒ€ ๋ชจ๋ธ์— ์‚ฌ์šฉ๋˜๋Š” ํ•™์Šต, ํ…Œ์ŠคํŠธ์šฉ ์Šคํƒœํ‚น ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์„ฑํ•˜์—ฌ ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฉ”ํƒ€ ๋ชจ๋ธ์ด ํ•™์Šต๊ณผ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ์‹! 

 

1. ํ›ˆ๋ จ set ๋ฅผ N ๊ฐœ์˜ fold ๋กœ ๋‚˜๋ˆˆ๋‹ค. (3๊ฐœ๋ผ๊ณ  ๊ฐ€์ •) 

2. 2๊ฐœ์˜ fold ๋ฅผ ํ•™์Šต์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋กœ, 1๊ฐœ์˜ fold ๋ฅผ ๊ฒ€์ฆ์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉ 

3. 2๊ฐœ ํด๋“œ๋ฅผ ์ด์šฉํ•ด ๊ฐœ๋ณ„ ๋ชจ๋ธ์„ ํ•™์Šต, 1๊ฐœ์˜ ๊ฒ€์ฆ์šฉ fold ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•œ ํ›„ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅ 

4. 3๋ฒˆ ๋‹จ๊ณ„๋ฅผ 3๋ฒˆ ๋ฐ˜๋ณต (๊ฒ€์ฆ์šฉ ํด๋“œ๋ฅผ ๋ณ€๊ฒฝํ•ด๊ฐ€๋ฉด์„œ) , ์ดํ›„ test set ์— ๋Œ€ํ•œ ์˜ˆ์ธก์˜ ํ‰๊ท ์œผ๋กœ ์ตœ์ข… ๊ฒฐ๊ณผ๊ฐ’ ์ƒ์„ฑ

4. 4๋ฒˆ์—์„œ ์ƒ์„ฑ๋œ ์ตœ์ข… ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๋ฉ”ํƒ€ ๋ชจ๋ธ์— ํ•™์Šต ๋ฐ ์˜ˆ์ธก ์ˆ˜ํ–‰ 

 

 

๐Ÿ‘‰ train meta data ์™€ test meta data ๊ฐ€ ๋งŒ๋“ค์–ด์ง„๋‹ค๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ! 

 

 

 

* ์œ„์Šค์ฝ˜์‹  ์œ ๋ฐฉ์•” ์˜ˆ์ œ ๋ฐ์ดํ„ฐ 

 

๐Ÿ‘€ ๋ฉ”ํƒ€ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ 

 

from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error

# ๊ฐœ๋ณ„ ๋ชจ๋ธ๋ณ„ ๋ฉ”ํƒ€ ๋ฐ์ดํ„ฐ
def get_stacking_base_datasets(model, X_train, y_train, X_test, n_folds):
    # KFold ์ƒ์„ฑ
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=0)
    
    # ๋ฉ”ํƒ€ ๋ฐ์ดํ„ฐ ๋ฐ˜ํ™˜์„ ์œ„ํ•œ ๊ธฐ๋ณธ ๋ฐฐ์—ด
    train_cnt = X_train.shape[0]
    test_cnt = X_test.shape[0]
    train_meta = np.zeros((train_cnt, 1))
    test_meta = np.zeros((test_cnt, n_folds))
    
    print(model.__class__.__name__ , ' model ์‹œ์ž‘ ')
    
    # train ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ fold๋ฅผ ๋‚˜๋ˆ  ํ•™์Šต/์˜ˆ์ธก
    for i , (train_fold_idx, test_fold_index) in enumerate(kf.split(X_train)):
        # train, test fold ์ƒ์„ฑ
        print(f'\t ํด๋“œ ์„ธํŠธ: {i+1} ์‹œ์ž‘ ')
        x_train_fold = X_train[train_fold_idx] 
        y_train_fold = y_train[train_fold_idx] 
        x_test_fold = X_train[test_fold_index]  
        
        # train_fold๋กœ ํ•™์Šต
        model.fit(x_train_fold , y_train_fold)       
        
        # train ๋ฉ”ํƒ€ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ (x_test_fold ์˜ˆ์ธก)
        train_meta[test_fold_index, :] = model.predict(x_test_fold).reshape(-1,1)
        
        # test ๋ฉ”ํƒ€ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ (x_test ์˜ˆ์ธก) - ํ‰๊ท  ์ „
        test_meta[:, i] = model.predict(X_test)
            
    # test ๋ฉ”ํƒ€ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ - ํ‰๊ท  ์ง„ํ–‰
    test_meta_mean = np.mean(test_meta, axis=1).reshape(-1,1)    
    
    # train test ๋ฉ”ํƒ€ ๋ฐ์ดํ„ฐ ๋ฐ˜ํ™˜
    return train_meta , test_meta_mean

 

 

๐Ÿ‘€ ํ›ˆ๋ จ 

 

knn_train, knn_test = get_stacking_base_datasets(knn_clf, X_train, y_train, X_test, 7) # knn
rf_train, rf_test = get_stacking_base_datasets(rf_clf, X_train, y_train, X_test, 7) # ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ
dt_train, dt_test = get_stacking_base_datasets(dt_clf, X_train, y_train, X_test,  7) # ๊ฒฐ์ •ํŠธ๋ฆฌ
ada_train, ada_test = get_stacking_base_datasets(ada_clf, X_train, y_train, X_test, 7) # ์—์ด๋‹ค๋ถ€์ŠคํŠธ

 

 

๐Ÿ‘€ ๋ชจ๋ธ๋ณ„ ํ•™์Šต, ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ํ•ฉ์น˜๊ธฐ 

 

Stack_final_X_train = np.concatenate((knn_train, rf_train, dt_train, ada_train), axis=1) 
Stack_final_X_test = np.concatenate((knn_test, rf_test, dt_test, ada_test), axis=1) 
print('์›๋ณธ ํ•™์Šต ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ shape', X_train.shape, '์›๋ณธ ํ…Œ์ŠคํŠธ ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ shape', X_test.shape) 
print('์Šคํƒœํ‚น ํ•™์Šต ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ shape:', Stack_final_X_train.shape, '์Šคํƒœํ‚น ํ…Œ์ŠคํŠธ ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ shape : ', Stack_final_X_test.shape)

 

๐Ÿ‘€ ์˜ˆ์ธก ์ˆ˜ํ–‰ 

 

lr_final.fit(Stack_final_X_train, y_train) 
stack_final = lr_final.predict(Stack_final_X_test) 

print('์ตœ์ข… ๋ฉ”ํƒ€ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ์ •ํ™•๋„ : ', accuracy_score(y_test, stack_final))

# ์ตœ์ข… ๋ฉ”ํƒ€ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ์ •ํ™•๋„ :  0.9736842105263158

 

 

 

catboost 


๐Ÿ“Œ ๊ฐœ์š” 

 

๐Ÿ’ก Boosting ์˜ ๊ธฐ๋ณธ ์•„์ด๋””์–ด = Residual Boosting 

  • ์‹ค์ œ ๊ฐ’๋“ค์˜ ํ‰๊ท ๊ณผ ์‹ค์ œ ๊ฐ’์˜ ์ฐจ์ด์ธ ์ž”์ฐจ๋ฅผ ๊ตฌํ•œ๋‹ค ๐Ÿ‘‰ ์ž”์ฐจ๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ชจ๋ธ ์ƒ์„ฑ ๐Ÿ‘‰ ์˜ˆ์ธก ๐Ÿ‘‰ ์˜ˆ์ธก ๊ฐ’์— ํ•™์Šต๋ฅ ์„ ๊ณฑํ•ด ๊ฐ’์„ ์—…๋ฐ์ดํŠธ ํ•œ๋‹ค (๋ฐ˜๋ณต) 
  • bias ๋ฅผ ์ž‘๊ฒŒ ๋งŒ๋“œ๋Š” ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋‚˜ ๋ถ„์‚ฐ-ํŽธํ–ฅ ํŠธ๋ ˆ์ด๋“œ ์˜คํ”„ ๊ด€๊ณ„์— ์˜ํ•ด high variance ์ฆ‰ ์˜ค๋ฒ„ํ”ผํŒ…์ด ๋ฐœ์ƒํ•  ๊ฒฝ์šฐ๊ฐ€ ๋†’๋‹ค. 

 

๐Ÿ’ก CatBoost 

  • ๊ฒฐ์ •ํŠธ๋ฆฌ์—์„œ์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค.
  • ๊ฒ€์ƒ‰, ์ถ”์ฒœ์‹œ์Šคํ…œ, ๋‚ ์”จ ์˜ˆ์ธก ๋“ฑ์˜ ์ž‘์—…์— ๋งŽ์ด ์‚ฌ์šฉ๋œ๋‹ค. 
  • ๋ฐ์ดํ„ฐ์…‹ ๋Œ€๋ถ€๋ถ„์ด ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์ผ ๋•Œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. 
  • Target leakage ๋กœ ์ธํ•œ ๊ณผ์ ํ•ฉ์„ ๋ง‰์•„์ค€๋‹ค. 

 

โž• Target leakage : https://m.blog.naver.com/hongjg3229/221811766581

 

 

๋ณต์Šต. GBM : ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ํ†ตํ•ด ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๋ฉด์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์•ฝํ•œ ํ•™์Šต๊ธฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต-์˜ˆ์ธกํ•˜๋Š” ์•™์ƒ๋ธ” ํ•™์Šต ๋ฐฉ์‹ 

 

CatBoost ๋Š” "๋Œ€์นญ์ "์œผ๋กœ ๊ท ํ•  ํŠธ๋ฆฌ ๋ถ„ํ•  ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ํŠธ๋ฆฌ๋ฅผ ํ˜•์„ฑํ•œ๋‹ค.

  • ๋Œ€์นญํŠธ๋ฆฌ ๊ตฌ์กฐ๋Š” ์˜ˆ์ธก ์‹œ๊ฐ„์„ ๊ฐ์†Œ์‹œ์ผœ ํ•™์Šต ์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ๋งŒ๋“ ๋‹ค. 

 

 

๐Ÿ’ก ํŠน์ง•

  • GBM ์˜ ๋‹จ์ ์ธ ๊ณผ์ ํ•ฉ ๋ฌธ์ œ์™€ ํ•™์Šต์†๋„ ๋ฌธ์ œ, ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์ง€๋Š” ๋ฌธ์ œ๋ฅผ ๊ฐœ์„ ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ 
  • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ์›ํ•ซ์ธ์ฝ”๋”ฉ, label ์ธ์ฝ”๋”ฉ ๋“ฑ ์ธ์ฝ”๋”ฉ ์ž‘์—…์„ ํ•˜์ง€ ์•Š๊ณ  ๊ทธ๋Œ€๋กœ ๋ชจ๋ธ์˜ input ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. 
  • ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๊ฐ€ ๋งŽ์€ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ํ•™์Šต ์†๋„๊ฐ€ ๋Š๋ฆฌ๋‹ค. ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๊ฐ€ ๋งŽ์„ ๋•Œ ์ ํ•ฉํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. 
  • Level-wise tree : Feature ๋ฅผ ๋ชจ๋‘ ๋™์ผํ•˜๊ฒŒ ๋Œ€์นญ์ ์ธ ํŠธ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ํ˜•์„ฑํ•˜๊ฒŒ ๋œ๋‹ค ๐Ÿ‘‰ ์˜ˆ์ธก ์‹œ๊ฐ„์„ ๊ฐ์†Œ์‹œํ‚ด 
  • Ordered boosting : ๊ธฐ์กด ๋ถ€์ŠคํŒ… ๋ชจ๋ธ์ด ์ผ๊ด„์ ์œผ๋กœ ๋ชจ๋“  ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์ž”์ฐจ๋ฅผ ๊ณ„์‚ฐํ–ˆ๋‹ค๋ฉด, catboost ๋Š” ์ผ๋ถ€๋งŒ ๊ฐ€์ง€๊ณ  ๊ณ„์‚ฐํ•œ ๋’ค, ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ๊ทธ ๋’ค์— ๋ฐ์ดํ„ฐ์˜ ์ž”์ฐจ๋Š” ๋ชจ๋ธ๋กœ ์˜ˆ์ธกํ•œ ๊ฐ’์„ ์‚ฌ์šฉํ•œ๋‹ค. 

 

  • Random Permutation : ordered boosting ์„ ํ•  ๋•Œ ๋ฐ์ดํ„ฐ ์ˆœ์„œ๋ฅผ ์„ž์ง€ ์•Š์œผ๋ฉด ๋งค๋ฒˆ ๊ฐ™์€ ์ˆœ์„œ๋Œ€๋กœ ์ž”์ฐจ๋ฅผ ์˜ˆ์ธกํ•˜๊ฒŒ ๋˜๋ฏ€๋กœ ์ด๋Ÿฐ ๊ฒƒ์„ ๊ฐ์•ˆํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์…”ํ”Œ๋งํ•˜์—ฌ ๋ฝ‘์•„๋‚ธ๋‹ค ๐Ÿ‘‰ ํŠธ๋ฆฌ๋ฅผ ๋‹ค๊ฐ์ ์œผ๋กœ ๋งŒ๋“ค์–ด ์˜ค๋ฒ„ํ”ผํŒ…์„ ๋ฐฉ์ง€ 
  • Orderd Target Encoding : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ์ˆ˜์น˜ํ˜•์œผ๋กœ ๋ณ€ํ™˜ํ•  ๋•Œ ์ž„์˜๋กœ ์‹œ๊ณ„์—ด์ ์ธ (์ˆœ์ฐจ์ ์ธ) ์•„์ด๋””์–ด๋ฅผ ์ ์šฉํ•ด  ์ธ์ฝ”๋”ฉ ํ•œ๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด ์•„๋ž˜์˜ ํ‘œ์—์„œ ๊ธˆ์š”์ผ์— ํ•ด๋‹นํ•˜๋Š” Cloudy ๋ฅผ ์ธ์ฝ”๋”ฉ ํ• ๋•Œ ์•ž์„œ ์–ป์€ Tues, Wed ์— ํ•ด๋‹นํ•˜๋Š” Target ๊ฐ’์˜ ํ‰๊ท ์„ ์ด์šฉํ•˜์—ฌ ์ธ์ฝ”๋”ฉ ํ•œ๋‹ค. (15+14)/2 = 14.5 ๋™์ผํ•˜๊ฒŒ ์ผ์š”์ผ์— ๋Œ€ํ•œ Cloudy ๋Š” (15+14+20)/3 = 16.3 ์ด๋ ‡๊ฒŒ ๊ณ„์‚ฐํ•œ๋‹ค. ์ฆ‰, ๊ณผ๊ฑฐ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ํ˜„์žฌ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ธ์ฝ”๋”ฉ ํ•˜๋Š” ์›๋ฆฌ์ด๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด data leakage ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

data leakage ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด Target encoding ๋Œ€์‹  Orderd Target Encoding ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค.

 

 

  • Categorical Feature Combination : ์ •๋ณด๊ฐ€ ์ค‘๋ณต๋˜๋Š” ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ฒ˜๋ฆฌ. ์ •๋ณด๊ฐ€ ๋™์ผํ•œ ๋‘ feature ๋ฅผ ํ•˜๋‚˜์˜ feature ๋กœ ๋ฌถ์–ด์„œ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์— ์žˆ์–ด feature selection ๋ถ€๋‹ด์„ ์ค„์ธ๋‹ค.
  • One-Hot Encoding : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋“ค ์ค‘ ๊ฐ’์˜ level ์ˆ˜๊ฐ€ ์ผ์ • ๊ฐœ์ˆ˜๋ณด๋‹ค ์ž‘์œผ๋ฉด ํ•ด๋‹น ๋ฒ”์ฃผํ˜•์€ ์›ํ•ซ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์œผ๋กœ ์ฒ˜๋ฆฌํ•ด์ค€๋‹ค. ์ฆ‰ ์ค‘๋ณต๋„๊ฐ€ ๋†’์€ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋“ค์—๊ฒŒ ์ ์šฉํ•˜๋Š” ๋ฐฉ์‹ 
  • Optimized Parameter Tuning : ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ์— ์˜ํ•ด ํฌ๊ฒŒ ์„ฑ๋Šฅ์ด ์ขŒ์šฐ๋˜์ง€ ์•Š๋Š”๋‹ค. 

 

 

  • catboost ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์— ํฌ๊ฒŒ ์‹ ๊ฒฝ์“ฐ์ง€ ์•Š์•„๋„ ๋œ๋‹ค. ๋ถ€์ŠคํŒ… ๋ชจ๋ธ๋“ค์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์˜ ์ด์œ ๋Š” ํŠธ๋ฆฌ์˜ ๋‹คํ˜•์„ฑ๊ณผ ์˜ค๋ฒ„ํ”ผํŒ… ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•จ์ธ๋ฐ, catboost ๋Š” ๋‚ด๋ถ€์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ํ•ด์ค€๋‹ค๋ฉด learning_rate, random_strength, L2_regulariser ์ •๋„

 

๐Ÿ’ก ํ•œ๊ณ„ 

  • Sparse ํ•œ ํ–‰๋ ฌ์€ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•œ๋‹ค. (์ฆ‰, ๊ฒฐ์ธก์น˜๊ฐ€ ๋งค์šฐ ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹์—๋Š” ๋ถ€์ ํ•ฉํ•œ ๋ชจ๋ธ์ด๋‹ค) 
  • ๋ฐ์ดํ„ฐ ๋Œ€๋ถ€๋ถ„์ด ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜์ธ ๊ฒฝ์šฐ ํ•™์Šต ์†๋„๊ฐ€ lightGBM ๋ณด๋‹ค ๋Š๋ฆฌ๋‹ค. 

 

์ฐธ๊ณ ์ž๋ฃŒ 

โž• https://dailyheumsi.tistory.com/136

โž• https://techblog-history-younghunjo1.tistory.com/199

โž• https://hyewon328.tistory.com/entry/CatBoost-CatBoost-%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98%EC%97%90-%EB%8C%80%ED%95%9C-%EC%9D%B4%ED%95%B4

โž• https://www.kaggle.com/code/prashant111/catboost-classifier-in-python/notebook 

 

 

728x90

๋Œ“๊ธ€