๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐ŸฅŽ Casual inference

์ธ๊ณผ์ถ”๋ก ์˜ ๋ฐ์ดํ„ฐ ๊ณผํ•™ - ์ธ๊ณผ์ถ”๋ก  ๊ด€์ ์—์„œ์˜ ํšŒ๊ท€๋ถ„์„

by isdawell 2023. 4. 21.
728x90

 

 

์ฐธ๊ณ ์˜์ƒ : Bootcamp 2-3. ์ธ๊ณผ์ถ”๋ก  ๊ด€์ ์—์„œ์˜ ํšŒ๊ท€๋ถ„์„ 

 

 

 

1. Casual Hierarchy 


 

 

•  ์–ด๋–ค ์ข…๋ฅ˜์˜ selection bias๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ๊ธฐ์ค€ 

•  Selection on Observables strategies : ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•œ ๋ณ€์ˆ˜๋“ค์— ์˜ํ•ด์„œ๋งŒ treatment์™€ control ์ด ์„ ํƒ๋œ๋‹ค๋Š” ๊ฐ€์ •์„ ๊ฐ€์ง€๊ณ  ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•œ ๋ณ€์ˆ˜๋“ค๋งŒ์„ ๊ฐ€์ง€๊ณ  selection bias๋ฅผ ์„ค๋ช…ํ•˜๋ ค๋Š” ๊ฒฝํ–ฅ 

•  Selection on Unobservables strategies : ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•˜์ง€ ์•Š์€ ๊ต๋ž€ ์š”์ธ๋“ค๋„ ์ ์ ˆํ•œ ์‹คํ—˜๋””์ž์ธ์„ ํ†ตํ•ด ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•˜๋Š” ์ „๋žต โ‡จ ์ข€ ๋” powerful ํ•œ ์ „๋žต 

 

 

 

 

 

2.  How to balance between treatment and control groups 


 

โ—ฏ  Regression adjustment 

 

 

•  ํ†ต์ œ๋ณ€์ˆ˜์˜ ํ™œ์šฉ์„ ํ†ตํ•ด selection bias ๋ฅผ ๋ชจ๋‘ ์„ค๋ช…ํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฐฉ๋ฒ• 

•  ๊ฐ€์žฅ ๊ฐ„๋‹จํ•˜๊ณ  ์œ ์—ฐํ•œ ๋ชจ๋ธ์ด๋ผ, ์ธ๊ณผ์ถ”๋ก ์— ํ˜ผ๋™์„ ์ค„ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด๊ธฐ๋„ ํ•˜๋‹ค. 

 

 

 

โ—ฏ  Matching 

 

 

•  treatment group ๊ณผ control group ์ด ์„œ๋กœ ๋น„๊ต ๊ฐ€๋Šฅํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•œ ๋ณ€์ˆ˜๋“ค์˜ ๊ฐ’์ด ์„œ๋กœ ์œ ์‚ฌํ•œ ๋ฐ์ดํ„ฐ๋“ค๋ผ๋ฆฌ ๋งค์นญํ•˜๋Š” ์ „๋žต 

 

 

 

โ—ฏ  Weighting 

 

 

•  treatment ๋ฅผ ๋ฐ›์„ ํ™•๋ฅ ์˜ ์—ญ์ˆ˜๋งŒํผ์„ ๊ฐ ๋ฐ์ดํ„ฐ์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•ด, ๊ฒฐ๊ณผ์ ์œผ๋กœ random assignment ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ• 

 

 

 

 

 

 

3.  Regression from the perspective of potential outcomes 


 

โ—ฏ  ์ „ํ†ต์ ์ธ ๋ฐฉ์‹์˜ ํšŒ๊ท€๋ถ„์„ 

 

 

•  ์ข…์†๋ณ€์ˆ˜๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๋ชจ๋“  ๋…๋ฆฝ๋ณ€์ˆ˜๋ฅผ ์„ค๋ช…ํ•จ์œผ๋กœ์จ , ์ข…์†๋ณ€์ˆ˜๋ฅผ ์˜จ์ „ํžˆ ์„ค๋ช…ํ•˜๋Š” true model์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ๋ชฉ์  : R-squared 

•  EX. 4๊ฐ€์ง€์˜ ๋…๋ฆฝ๋ณ€์ˆ˜๋“ค์ด Studnet achievement ๋ฅผ ์ž˜ ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋‹ค. 

 

 

โ—ฏ  ์ธ๊ณผ์ถ”๋ก  ๊ด€์ ์—์„œ์˜ ํ˜„๋Œ€ ํšŒ๊ท€๋ถ„์„ 

 

 

•  ์ธ๊ณผ์ถ”๋ก  ๊ด€์ ์—์„œ์˜ ํšŒ๊ท€๋ถ„์„์˜ ์—ญํ• ์€, selection bias ๋ฅผ ์•ผ๊ธฐํ•˜๋Š” confounding factor ๋ฅผ ํ†ต์ œํ•˜๊ณ  ํ•˜๋Š” ๊ฒƒ : control variable ํ†ต์ œ๋ณ€์ˆ˜ 

 

•  R-squared ๋ณด๋‹จ, ํ†ต์ œ๋ณ€์ˆ˜๊ฐ€ selection bias ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ํ†ต์ œํ•˜๋Š๋ƒ๊ฐ€ ์ค‘์š”

 

 

 

โ—ฏ  ์ธ๊ณผ์ถ”๋ก  ๊ด€์ ์œผ๋กœ ํšŒ๊ท€๋ถ„์„์„ ์žฌ์ •์˜ํ•ด๋ณด๊ธฐ 

 

 

•  ํŽธ์˜์ƒ treatment variable ์ธ X๋ฅผ bianry variable ๋กœ ๋ฐ”๊ฟ”๋ณด์ž. True causal coefficient ๋ฅผ β๋ผ ํ•˜์ž. 

•  α : ์ „์ฒด ์ƒ˜ํ”Œ์— ๋Œ€ํ•œ ํ‰๊ท  

 

 

•  Potential outcome ๊ด€์ ์—์„œ Regression model ์„ ๋‹ค์‹œ ์จ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

•  ε1i ์™€ ε0i ์˜ ์ฐจ์ด๊ฐ€ ํ•จ๊ป˜ ์ถ”์ •๋จ → β ๋ฅผ ์ œ์™ธํ•œ ์ฐจ์ด (treatment ์—ฌ๋ถ€๋ฅผ ์ œ์™ธํ•œ ์ฐจ์ด) ์ด๋ฏ€๋กœ (ε1i -  ε0i) ๋Š” selection bias ๋ฅผ ๋œปํ•œ๋‹ค. 

 

 

 

•  ATE ๋ฅผ ์ •๋ฆฌํ•˜๋ฉด ์šฐ๋ฆฌ๊ฐ€ ๊ตฌํ•˜๊ณ ์ž ํ•˜๋Š” True causal effect ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, selection bias ๊ฐ€ ๊ทธ๋Œ€๋กœ ๊ปด์„œ ๋‚˜์˜จ๋‹ค. selection bias = 0 ์ด๋ฉด ๋น„๋กœ์†Œ causal effect ๋ผ๊ณ  ๋ถ€๋ฅผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

•  selection bias ๋ฅผ ๋ชจ๋‘ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ๋ณ€์ˆ˜ (control variable) Ci ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด๋ณด์ž (Selection on Observables strategies). ์ด Ci ์™€ selection bias ๊ฐ€ ์„ ํ˜• ๊ด€๊ณ„์‹์œผ๋กœ ์ •์˜๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€์ •์„ ์ถ”๊ฐ€ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

•  ei ์™€ X=0 ์ผ ๋•Œ์˜ control variable ์˜ ๊ฐ’ C0i, X=1 ์ผ ๋•Œ์˜ control variable ์˜ ๊ฐ’ C1i 

 

 

•  Control variable ์„ ํฌํ•จํ•ด์„œ ๋‹ค์‹œ ์‹์„ ์žฌ์ •์˜ํ•˜๋ฉด, treatment variable ์ธ X ์˜ coefficient ์— ๋” ์ด์ƒ selection bias ๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์Œ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

•  ๊ทธ๋Ÿฌ๋‚˜ ์—ฌ์ „ํžˆ ATE ๋Š” ์˜จ์ „ํ•œ β ๊ฐ€ ๋˜์ง€ ๋ชปํ•˜๊ณ , Control variable ์—์„œ์˜ ์ฐจ์ด๊นŒ์ง€ ๊ณ„์‚ฐ๋œ๋‹ค. ๋งŒ์•ฝ control variable ์ด conditioning ํ•œ๋‹ค๋ฉด (๊ณ ์ •ํ•œ๋‹ค๋ฉด), potential outcome ์˜ ์ฐจ์ด๊ฐ€ ์‚ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค. Conditional ATE ๊ฐ€ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” True casual effect ๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด๋‹ค โ‡จ Regression ์—์„œ selection bias ๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ๋ฒ• 

 

 

โ—ฏ [์ •๋ฆฌ]  Control variable ๋กœ selection bias ๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•œ 2๊ฐ€์ง€ ๊ฐ€์ • 

 

1. selection bias ๋ฅผ ๋ชจ๋‘ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” control variable ์„ ์•Œ๊ณ  ์žˆ์–ด์•ผ ํ•œ๋‹ค. 

2. ๊ทธ ๊ด€๊ณ„๊ฐ€ ์–ด๋–ค functional form ์„ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์•ผ ํ•œ๋‹ค. (์œ„์—์„œ๋Š” linear function ์„ ๊ฐ€์ •ํ•จ) 

 

โ‡จ Control variable ์„ ๋ช…์‹œ์ ์œผ๋กœ ํšŒ๊ท€๋ถ„์„์—์„œ ํ†ต์ œํ•จ์œผ๋กœ์จ ํšŒ๊ท€๋ถ„์„์„ ํ†ตํ•ด ์ธ๊ณผ๊ด€๊ณ„๋ฅผ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

โ—ฏ Conditional Independence 

 

 

•  Identification assumption : ์ธ๊ณผ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•œ ๊ฐ€์ •์„ Identification ๊ฐ€์ •์ด๋ผ ๋ถ€๋ฅธ๋‹ค. 

•  Conditional independence : control variable C ๊ฐ€ conditioning ๋˜์–ด์žˆ๋Š” ์ƒํƒœ์—์„œ ์›์ธ ๋ณ€์ˆ˜์ธ X ์—ฌ๋ถ€์— ์ƒ๊ด€์—†์ด error term ์ธ e์˜ ํ‰๊ท ๊ฐ’์ด ๋™์ผํ•ด์•ผ ํ•œ๋‹ค. ์ฆ‰, control variable ์ด conditioning ๋˜์–ด์žˆ๋Š” ์ƒํƒœ์—์„œ ์›์ธ๋ณ€์ˆ˜์ธ X์™€ error term ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์—†์–ด์•ผ ํ•œ๋‹ค. control variable ์€ selection bias ๋ฅผ ์„ค๋ช…ํ•˜๋ฉด์„œ conditional independence ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. 

 

•  R-squared ๊ฐ€ ๋†’๊ณ  ๋‚ฎ์Œ์€ ์ธ๊ณผ์ถ”๋ก ๊ณผ ์ƒ๊ด€์—†๋‹ค. X์˜ ์ธ๊ณผ์ ์ธ ํšจ๊ณผ์— ๊ด€์‹ฌ์ด ์žˆ๊ณ , Control variable ์˜ ์—ญํ• ์€ R-square ๋ฅผ ๋†’์ด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, Control variable  ์ด treatment ๊ฐ€ 1์ผ๋•Œ์™€ 0์ผ๋•Œ์˜ ์ฐจ์ด์ธ selection bias ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์„ค๋ช…ํ•˜๋Š”์ง€ ์—ฌ๋ถ€๊ฐ€ ์ธ๊ณผ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•œ์ง€์— ๋Œ€ํ•œ ์—ฌ๋ถ€๊ฐ€ ๋œ๋‹ค. 

 

 

 

โ—ฏ [์ •๋ฆฌ] ์ธ๊ณผ์ถ”๋ก ์„ ์œ„ํ•ด ํšŒ๊ท€๋ถ„์„์—์„œ ๊ธฐ์–ตํ•ด์•ผ ํ•  3๊ฐ€์ง€

 

1. There should be a clear distinction between causes and controls : Regression ์— ์žˆ๋Š” ์šฐํ•ญ์— ์กด์žฌํ•˜๋Š” ๋ชจ๋“  ๋…๋ฆฝ๋ณ€์ˆ˜๋“ค์ด ๋™์ผํ•œ ์—ญํ• ์„ ํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค. ์ธ๊ณผ๊ด€๊ณ„๋ฅผ ์ถ”์ •ํ•˜๊ณ ์ž ํ•˜๋Š” ์›์ธ๋ณ€์ˆ˜์™€ ๋‚˜๋จธ์ง€ ํ†ต์ œ ๋ณ€์ˆ˜์˜ ์—ญํ• ์„ ๋ช…ํ™•ํžˆ ๊ตฌ๋ถ„ํ•˜๋Š”๊ฒŒ ์ค‘์š”ํ•˜๋‹ค. 

 

2. The role of control variables is to account for the selection bias : ๊ตฌ๋ถ„ํ•˜๋Š” ๋ชฉ์ ์€ ํ†ต์ œ๋ณ€์ˆ˜์˜ ์—ญํ• ์— ๋Œ€ํ•ด ํŒ๋‹จํ•˜๊ธฐ ์œ„ํ•จ์ธ๋ฐ, ํ†ต์ œ๋ณ€์ˆ˜์˜ ์—ญํ• ์€ selection bias ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์„ค๋ช…ํ•˜๋Š๋ƒ์ด๋‹ค. ํ†ต์ œ๋ณ€์ˆ˜๋ฅผ conditioning ํ•œ ์ƒํƒœ์—์„œ, treatment ๊ทธ๋ฃน๊ณผ Control ๊ทธ๋ฃน ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ ์—†์–ด์„œ  apples vs apples (Ceteris Paribus ๋ฅผ ๋œปํ•˜๋Š” ์šฉ์–ด) ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ํ†ต์ œ๋ณ€์ˆ˜์˜ ์—ญํ• ์ด๋‹ค.

 

3. Don't interpret the coefficients of controls in a causal manner : Control variable ์— ๋Œ€ํ•ด์„œ๋Š” ์ธ๊ณผ์ ์ธ ํšจ๊ณผ๋กœ ํ•ด์„ํ•˜์ง€ ์•Š๋„๋ก ์ฃผ์˜ํ•ด์•ผ ํ•œ๋‹ค. control variable ์˜ ์—ญํ• ์€ ์›์ธ๋ณ€์ˆ˜์ธ X์— ๋Œ€ํ•ด์„œ conditional independence ๋ฅผ ๋งŒ์กฑ์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๋ณด์กฐ์ ์ธ ๋„๊ตฌ์— ๋ถˆ๊ณผํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. 

 

 

 

 

 

4.  Regression is Analogous to Matching 


 

ํšŒ๊ท€๋ถ„์„์€ matching ๊ณผ ๋™์ผํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ํšŒ๊ท€๋ถ„์„์€ "automated matchmaker" ์ด๋‹ค. 

 

728x90

๋Œ“๊ธ€