๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“• ๋จธ์‹ ๋Ÿฌ๋‹

๋ฐ์ดํ„ฐ๋งˆ์ด๋‹ Classification (decision tree)

by isdawell 2023. 4. 15.
728x90

 

1. Basic Concepts 


 

โ‘   Definition 

 

•  Classification task 

 

 

 

•  Given a collection of records (training set), we find a model for the class attribute as a function of the values of other attributes. Each record contains a set of attributes, and one of the attributes is the class.

 

 

•  Previously unseen records (test set) should be assigned a class as accurately as possible

โ†ช A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with the training set used to build the model and the test set used to validate it

 

 

โ‘ก  Example 

 

•  Predicting tumor cells as benign or malignant (์ข…์–‘ ์–‘์„ฑ์Œ์„ฑ)

•  Classifying credit card transactions as legitimate or fraudulent (์‚ฌ๊ธฐ๊ฑฐ๋ž˜ํƒ์ง€)

•  Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil 

•  Categorizing news stories as finance, weather, entertainment, sports, etc

 

โ‘ข  techniques 

 

 

 

 

2. Decision Tree Induction 


 

โ‘   Example 

 

•  All the leaf node have a single same label 

 

 

 

•  There could be more than one tree that fits the same data

 

 

 

โ‘ก  Decision tree classification 

 

•  Train, test dataset 

 

 

 

ํŠธ๋ฆฌ๋ฅผ ์–ด๋–ป๊ฒŒ ๋งŒ๋“ค๊นŒ

 

 

•  Tree Induction: Greedy strategy 

   โ†ช Split the records based on an attribute test that optimizes certain criterion (gini index, gain ratio...)

   โ†ช ๋ฐ์ดํ„ฐ๋ฅผ split ํ•˜๋Š” ๋ฐฉ๋ฒ• 

       (1) attribute ์— ๋Œ€ํ•ด condition ์„ ์–ด๋–ป๊ฒŒ ๊ตฌ์ฒดํ™” ํ• ๊นŒ (test ์กฐ๊ฑด์„ ์–ด๋–ป๊ฒŒ ์„ค์ •ํ• ๊นŒ) 

       (2) best split ์„ ์–ด๋–ป๊ฒŒ ๊ฒฐ์ •ํ• ๊นŒ, ์–ธ์ œ split ์„ ๋ฉˆ์ถœ๊นŒ 

 

 

โ‘ข  How to Specify Test Condition 

 

•  How to Specify Test Condition

   โ†ช (1) Depend on attributes types : nominal, ordinal, continuous 

   โ†ช (2) Depends on number of ways to split : 2-way split, multi-way split 

 

 

A. Splitting based on nominal attributes 

   โ†ช Multi-way split : use as many partitions as distinct values 

 

   โ†ช Binary split: Divide values into two subsets, need to find optimal partitioning 

 

 

B. Splitting based on Ordinal attributes 

   โ†ช Multi-way split : Use as many partitions as distinct values

 

   โ†ช Binary split: Divide values into two subsets; need to find optimal partitioning

 

Ordinal attributes ์—์„œ binary split ์„ ํ•  ๋•Œ, ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜๋ˆ„๋Š” ๊ฒƒ์€ ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค. {Small, Large} ๋Š” consecutive ํ•œ ๊ฐ’์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ฐ branch ์˜ child node ๋Š” ์˜๋ฏธ์žˆ๋Š” combination ์ด์—ฌ์•ผ ํ•œ๋‹ค. 

 

 

C. Splitting based on Continuous attributes 

 

Binary split ์˜ ๊ฒฝ์šฐ์—๋Š”, ๊ฐ€์žฅ ๋ถ„๋ฅ˜๋ฅผ ์ž˜ ํ•  ์ˆ˜ ์žˆ๋Š” best cut point ๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๊ณ , Multiway split ์˜ ๊ฒฝ์šฐ์—๋Š” discretization (์ด์‚ฐํ™”, ๊ตฌ๊ฐ„ํ™”) ๋ฅผ ์ž˜ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•œ๋‹ค. 

 

 

 

โ‘ฃ  How to Determine the Best Split

 

•  How to Determine the Best Split

 

 

   โ†ช ์–ด๋–ค test condition ์ด ์ตœ์„ ์ผ๊นŒ 

 

   โ†ช greedy ์ ‘๊ทผ์„ ์‚ฌ์šฉํ•  ๋•Œ, homogeneous class distribution ์„ ๊ฐ€์ง„ ๋…ธ๋“œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํŠธ๋ฆฌ๋ฅผ ํ˜•์„ฑํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. 

   โ†ช Non-Homogeneous  =  high degree of impurity 

   โ†ช Homogeneous  = low degree of impurity

 

 

•  Example 

 

 

   โ†ช M0 : split ์ „์˜ ๋…ธ๋“œ ๋ถˆ์ˆœ๋„ , M12 : A ๋ณ€์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ split ํ•œ ํ›„์˜ ๋…ธ๋“œ ๋ถˆ์ˆœ๋„, M34 : B ๋ณ€์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ split ํ•œ ํ›„์˜ ๋…ธ๋“œ ๋ถˆ์ˆœ๋„ 

   โ†ช Gain = M0 - M12   vs    M0 - M34  โ‡จ (M0-M12) ๊ฐ€ ํฌ๋ฉด A๋ฅผ ์„ ํƒ, (M0-M34) ๊ฐ€ ๋” ํฌ๋ฉด B๋ฅผ ์„ ํƒ 

 

 

 

 

 

 

3. GINI Index (CART) 


 

โ‘   Measure of Impurity  : GINI

 

•  ์ฃผ์–ด์ง„ ๋…ธ๋“œ t ์—์„œ ์ง€๋‹ˆ๊ณ„์ˆ˜ ๊ตฌํ•˜๊ธฐ 

 

โ‡จ  P(j | t) = ๋…ธ๋“œ t์—์„œ class j ์˜ ์ƒ๋Œ€๋นˆ๋„ 

โ‡จ  ์ง€๋‹ˆ๊ณ„์ˆ˜์˜ ๊ฐ’์ด ์ž‘์„์ˆ˜๋ก ์ข‹๋‹ค. 

โ‡จ  ์ง€๋‹ˆ๊ณ„์ˆ˜ ์ตœ๋Œ“๊ฐ’ = 1 - 1/nc , nc = # of records : ๋ชจ๋“  records ๊ฐ€ ๋ชจ๋“  class ์— ๋™๋“ฑํ•˜๊ฒŒ ๋ถ„ํฌ๋˜์—ˆ์„ ๋•Œ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ด๋•Œ interesting information ์€ least ํ•œ ์ƒํƒœ์ด๋‹ค. (์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ •๋ณด๊ฐ€ ์—†๋Š” ์ตœ์•…์˜ case). P(j | t) = 1/nc ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

โ‡จ  ์ง€๋‹ˆ๊ณ„์ˆ˜ ์ตœ์†Ÿ๊ฐ’ = 0.0 : ๋ชจ๋“  recoreds ๊ฐ€ ํ•˜๋‚˜์˜ class ๋กœ๋งŒ ํ• ๋‹น๋œ ์ƒํƒœ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ด๋•Œ interesting information ์€ most ํ•œ ์ƒํƒœ์ด๋‹ค. (์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ •๋ณด๊ฐ€ ๋งŽ์€ case) 

 

 

•  Example 

 

 

 

•  ์ง€๋‹ˆ๊ณ„์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ split ํ•˜๋Š” ๋ฐฉ๋ฒ• 

 

โ†ช Used in CART (Classification and Regression Tree), SLIQ, SPRINT

โ†ช ๋…ธ๋“œ p ๋ฅผ k ์˜ partition (children) ์œผ๋กœ ์ชผ๊ฐœ๊ณ ์ž ํ•  ๋•Œ, split ์˜ quality ๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค. 

 

 

โ†ช child i ์—์„œ์˜ record ์˜ ๊ฐฏ์ˆ˜๋ฅผ ni, node p ์—์„œ์˜ record ์˜ ๊ฐฏ์ˆ˜๋ฅผ n ์ด๋ผ๊ณ  ํ•  ๋•Œ, ๊ฐ child node ์— ๋Œ€ํ•ด ์ง€๋‹ˆ๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  weighted sum ์„ ํ•œ๋‹ค. 

 

 

 

โ‘ก  Attributes 

 

•  Binary 

โ†ช two partition ์œผ๋กœ ๋ถ„๋ฅ˜ 

โ†ช Larger and purer partitions are sought for

 

 

 

•  Categorical 

โ†ช For each distinct value, we gather counts for each class in the dataset and use the count matrix to make decisions

 

 

 

 

 

•  Continuous 

โ†ช Several choices for the splitting value v

 

โ†ช ๊ฐ€๋Šฅํ•œ splitting values ์˜ ๊ฐœ์ˆ˜ = distinct values ์˜ ๊ฐœ์ˆ˜ 

 

 

โ†ช ์ตœ์„ ์˜ v ๋ฅผ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ• : ๊ฐ v ๊ฐ’์— ๋Œ€ํ•ด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์Šค์บ”ํ•œ๋‹ค. count matrix ๋ฅผ ์„ค๊ณ„ํ•˜๊ณ  ์ง€๋‹ˆ๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค → ๊ทธ๋Ÿฌ๋‚˜ ์ผ์ผํžˆ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ๊ณ„์‚ฐ์ ์œผ๋กœ ๋น„ํšจ์œจ์ ์ด๋‹ค. 

 

 

 

โ†ช ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ• : ๊ฐ attribute ์— ๋Œ€ํ•ด, value ๋ฅผ ์ •๋ ฌํ•˜๊ณ , ์„ ํ˜•์ ์œผ๋กœ ์ด value ๋“ค์„ ์Šค์บ”ํ•˜์—ฌ, ๊ฐ time ๋งˆ๋‹ค count matrix ์™€ ์ง€๋‹ˆ๊ณ„์ˆ˜๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ๊ฐ€์žฅ ์ตœ์†Œ์˜ ์ง€๋‹ˆ๊ณ„์ˆ˜ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ์œ„์น˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ split ์„ ํ•œ๋‹ค. 

 

์˜ˆ๋ฅผ๋“ค์–ด ์ฒ˜์Œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ Taxable income ๋ณ€์ˆ˜๋Š” continous ๋ณ€์ˆ˜์— ํ•ด๋‹นํ•˜๋ฏ€๋กœ ์œ„์˜ ๋ฐฉ๋ฒ•์ฒ˜๋Ÿผ best split position ์„ ์ฐพ์•„๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ผ๋‹จ taxable income ๋ณ€์ˆ˜ ๊ฐ’๋“ค์„ ์ •๋ ฌํ•œ ํ›„์—, ๊ฐ ๊ฐ’๋“ค์˜ ์ค‘๊ฐ„ ๊ฐ’์„ split position ์œผ๋กœ ์žก๋Š”๋‹ค. ํ•ด๋‹น ์ค‘๊ฐ„ ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ์ดˆ๊ณผ, ์ดํ•˜๋กœ split ์„ ์žก์€ ํ›„์—, ๊ฐ split position ์„ ๊ธฐ์ค€์œผ๋กœ ์ง€๋‹ˆ๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. cheat ๋ณ€์ˆ˜์˜ ๋ ˆ์ด๋ธ”์„ ๊ธฐ์ค€์œผ๋กœ Yes, No ์— ํ• ๋‹นํ•˜๋Š” ๊ฒฝ์šฐ์˜ ์ˆ˜๋ฅผ ์‰ฝ๊ฒŒ ๊ณ„์‚ฐํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. (์ง์ ‘ ํ•ด๋ณด๊ธฐ) 

 

 

 

โ‘ข  Drawback of the Gini Index 

 

•  ์ง€๋‹ˆ๊ณ„์ˆ˜์˜ ํ•œ๊ณ„์  

 

A. Bias towards equally-sized subsets

 

 

•  greedy ํ•œ ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ดํ›„์˜ efficient ๋Š” ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ , ํ•ด๋‹น step ์—์„œ ๊ฐ€์žฅ ์ตœ์„ ์„ ์„ ํƒํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์œ„์˜ ๊ทธ๋ฆผ์—์„  x6 ์ด ์ข€ ๋” balance ๋˜๊ฒŒ ๋‚˜๋‰˜์–ด์ € ์™ผ์ชฝ ๋ถ„๋ฅ˜๋ฅผ ์ตœ์ข…์ ์œผ๋กœ ์„ ํƒํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜  x1 ์ด ์ดํ›„ ๋ถ„๋ฅ˜์—์„œ ์ข€ ๋” homogeneous ํ•˜๊ฒŒ (balance ๋˜๊ฒŒ) ๋‚˜๋ˆ ์ ธ ์ตœ์ข… ๋ถˆ์ˆœ๋„์˜ ๊ฒฐ๊ณผ๋Š” x1 ์ด ๋” ์ข‹๋‹ค. 

 

•  pure offspring 

 

์ด 7๊ฐœ์˜ class ๊ฐ€ ์กด์žฌํ•  ๋•Œ, class1 ๊ณผ class2 ์— ์˜ํ•ด ๋ฐ์ดํ„ฐ์…‹์ด ๋‚˜๋ˆ„์–ด์ง„๋‹ค๊ณ  ํ•œ๋‹ค๋ฉด 

 

- option 1 : class1 ๊ณผ 2๋กœ ๊ฐ€๋Šฅํ•œ ๋นจ๋ฆฌ ๋ถ„๋ฆฌ โ‡จ favored 

- option 2 : (class1,2) ์™€ ๋‹ค๋ฅธ class ๋“ค๋กœ ๋ถ„๋ฆฌ  โ‡จ could be better : ์ตœ์ข…์ ์ธ ๊ฒฐ๊ณผ์—์„œ๋Š” ๋” ์ข‹๊ฒ ์ง€๋งŒ, class1 ๊ณผ 2๊ฐ€ ์„ž์—ฌ ์žˆ๋Š” ๊ฒƒ์€ pure ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์„ ํ˜ธํ•˜์ง„ ์•Š์Œ 

 

 

 

 

โ‘ฃ  Algorithm CART 

 

1. ๊ฐ attribute์˜ best split ์„ ์ฐพ๋Š”๋‹ค : ๋…ธ๋“œ๊ฐ€ ๋ถ„๋ฆฌ๋  ๋•Œ ์ง€๋‹ˆ๊ณ„์ˆ˜๋ฅผ ๋‚ฎ์ถ”๋Š” point ๊ฐ€ best split point ์ด๋‹ค. 

 

2.  node ์˜ best split ์„ ์ฐพ๋Š”๋‹ค : step1 ์—์„œ ์ฐพ์€ best split ๊ฐ’๋“ค ์ค‘์— ์ง€๋‹ˆ ๊ณ„์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋‚ฎ์€ ๊ฒƒ์„ ์„ ํƒํ•œ๋‹ค. 

 

3. stopping rule ์ด ๋งŒ์กฑ๋˜์ง€ ์•Š์„ ๋•Œ๊นŒ์ง€ step2 ๋ฅผ ๋ฐ˜๋ณตํ•œ๋‹ค. 

 

 

 

 

 

 

728x90

๋Œ“๊ธ€