๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐ŸฅŽ Casual inference

[The Brave and True] 10. Matching

by isdawell 2023. 7. 11.
728x90

 

 

 

๐Ÿ‘€ ์ธ๊ณผ์ถ”๋ก  ๊ฐœ์ธ ๊ณต๋ถ€์šฉ ํฌ์ŠคํŠธ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ถœ์ฒ˜๋Š” ์ฒจ๋ถ€ํ•œ ๋งํฌ๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”!

 

 

 

โ€ป ์ •๋ฆฌ 

 

 

 

๐Ÿ“œ ์ •๋ฆฌ 

 

•   ํšŒ๊ท€ : ๋ฐ์ดํ„ฐ๋ฅผ ์…€๋กœ ๋ถ„ํ• ํ•˜๊ณ , ๊ฐ ์…€์—์„œ ATE ๋ฅผ ๊ณ„์‚ฐํ•œ ๋‹ค์Œ, ์…€์˜ ATE ๋ฅผ ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ๋‹จ์ผ ATE ๋กœ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ 

•   ๋งค์นญ estimator 

 

 

 

 

 

 

โ‘   What is Regression Doing After All?


 

โ—ฏ   ํšŒ๊ท€๋ถ„์„ 

 

•  ํšŒ๊ท€๋ถ„์„์„ ์ ์šฉํ•˜๋ฉด Treatment group ๊ณผ Control group ์„ ๋น„๊ตํ•  ๋•Œ, ์ถ”๊ฐ€์ ์ธ ๋ณ€์ˆ˜๋“ค์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, X๋ฅผ ํ†ต์ œํ•จ์œผ๋กœ์จ ATE ๋ฅผ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ๋‹ค : (Y0, Y1) ⊥ T | X  โ‡จ ์กฐ๊ฑด๋ถ€ ๋…๋ฆฝ์„ฑ ๊ฐ€์ • 

 

•  ํšŒ๊ท€๋ถ„์„๊ณผ Matching ์€ functional form ์„ ๊ฐ€์ •ํ•˜๋Š๋ƒ ์•ˆ ํ•˜๋Š๋ƒ์˜ ์ฐจ์ด๋งŒ ์กด์žฌํ•œ๋‹ค. 

 

 

 

 

 

โ‘ก  The Subclassification Estimator


 

 

•  ์ถ”์ •ํ•˜๊ณ ์ž ํ•˜๋Š” ์ธ๊ณผํšจ๊ณผ๊ฐ€ ์žˆ์œผ๋‚˜ ๊ต๋ž€์š”์ธ X ๋•Œ๋ฌธ์— ์ถ”์ •์ด ์–ด๋ ค์šด ๊ฒฝ์šฐ, ๊ต๋ž€์š”์ธ์˜ ํšจ๊ณผ๊ฐ€ ๋™์ผํ•œ ์†Œ๊ทธ๋ฃน ๋‚ด์—์„œ Treatment ์™€ control group ์„ ๋น„๊ตํ•ด์•ผ ํ•œ๋‹ค. ์กฐ๊ฑด๋ถ€ ๋…๋ฆฝ ๊ฐ€์ •์œผ๋กœ ๋งŒ์กฑํ•œ๋‹ค๋ฉด ATE ๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ๊ณ„์‚ฐ๋  ์ˆ˜ ์žˆ๋‹ค. 

 

 

โ†ช  ๋ณ€์ˆ˜ X๊ฐ€ K ๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์…€ {X1, X2, ... , Xk} ์„ ์ทจํ•œ๋‹ค๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ฐ ์…€์˜ treatment ํšจ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์ด๋ฅผ ATE ๋กœ ๊ฒฐํ•ฉํ•œ๋‹ค. 

 

 

 

 

 

 

 

โ‘ข  Matching Estimator


 

โ—ฏ   ์˜ˆ์ œ1

 

•  ex. ์—ฐ์ˆ˜ ํ”„๋กœ๊ทธ๋žจ์ด ์ˆ˜์ž…์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ ์ถ”์ • 

 

 

...

 

•  ํ‰๊ท ์ ์œผ๋กœ ๋‹จ์ˆœ ๋น„๊ตํ•ด๋ณด๋ฉด ์—ฐ์ˆ˜์ƒ์ด ์—ฐ์ˆ˜๋ฅผ ๋ฐ›์ง€ ์•Š์€ ์‚ฌ๋žŒ๋ณด๋‹ค ๋ˆ์„ ๋” ๋ฒˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

trainee.query("trainees==1")["earnings"].mean() - trainee.query("trainees==0")["earnings"].mean()

# -4297.49373433584

 

 

•  ๊ทธ๋Ÿฌ๋‚˜ ํ‘œ๋ฅผ ๋ณด๋ฉด, ์—ฐ์ˆ˜์ƒ์ด ์—ฐ์ˆ˜์ƒ์ด ์•„๋‹Œ ์‚ฌ๋žŒ๋ณด๋‹ค ํ›จ์”ฌ ์–ด๋ฆฌ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฅผ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•ด ๋‚˜์ด๋ฅผ ๋™์ผํ•˜๊ฒŒ ๋งž์ถฐ์ค€๋‹ค. ๊ฐ€๋ น 28์„ธ๋กœ ๋‚˜์ด๊ฐ€ ๋™์ผํ•œ unit 1 ๊ณผ unit 27 ์„ ๋งค์นญํ•˜๋Š” ์‹์œผ๋กœ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

•  1๊ฐœ ์ด์ƒ์˜ unit ์ด ์ผ์น˜ํ•˜๋Š” ๊ฒฝ์šฐ, ํ•ด๋‹น ๊ทธ๋ฃน ์ค‘์—์„œ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

# make dataset where no one has the same age
unique_on_age = (trainee
                 .query("trainees==0")
                 .drop_duplicates("age"))

matches = (trainee
           .query("trainees==1")
           .merge(unique_on_age, on="age", how="left", suffixes=("_t_1", "_t_0"))
           .assign(t1_minuts_t0 = lambda d: d["earnings_t_1"] - d["earnings_t_0"]))

matches.head(7)

๊ฐ™์€ age ๋ฅผ ๊ฐ€์ง„ unit ๋ผ๋ฆฌ matching

 

 

•   ๋งˆ์ง€๋ง‰ ์—ด (t1-t0) ์˜ ํ‰๊ท ์„ ์ทจํ•˜๋ฉด ์—ฐ๋ น์„ ํ†ต์ œํ•˜๋ฉด์„œ ATET ์ถ”์ •์น˜๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. 

 

matches["t1_minuts_t0"].mean()

# 2457.8947368421054

 

 

•   ํ•˜์ง€๋งŒ ์‹ค์ œ๋กœ ๋งค์นญ์—์„  ์ผ๋ฐ˜์ ์œผ๋กœ ํ•˜๋‚˜ ์ด์ƒ์˜ ๋ณ€์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์œผ๋ฉฐ unit ์˜ ๊ฐ’์ด ์™„๋ฒฝํ•˜๊ฒŒ ์ผ์น˜ํ•˜๋Š” ๊ฒฝ์šฐ๋Š” ๋“œ๋ฌผ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ unit ์ด ์„œ๋กœ ์–ผ๋งˆ๋‚˜ ๊ฐ€๊นŒ์šด์ง€ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด ์ธก์ •ํ•˜๋Š” ๋ฐฉ์‹์ด ํ•„์š”ํ•˜๊ณ  ์ผ๋ฐ˜์ ์œผ๋กœ ์œ ํด๋ฆฌ๋””์•ˆ ๊ฑฐ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋•Œ ์œ ํด๋ฆฌ๋””์•ˆ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋ ค๋ฉด ๋ณ€์ˆ˜๊ฐ€ ๋Œ€๋žต ๊ฐ™์€ ์Šค์ผ€์ผ์ด ๋˜๋„๋ก ์กฐ์ •ํ•ด์•ผ ํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค. 

 

 

 

โ—ฏ   ์˜ˆ์ œ2

 

•  ex. ํ™˜์ž๊ฐ€ ํšŒ๋ณต๊นŒ์ง€ ๋ฉฐ์น ์ด ๊ฑธ๋ฆฌ๋Š”์ง€๋ฅผ ํ†ตํ•ด ์•ฝ๋ฌผ์˜ ํšจ๊ณผ๋ฅผ ๊ณ„์‚ฐ 

 

med = pd.read_csv("./data/medicine_impact_recovery.csv")
med.head()

 

 

•  ๋‹จ์ˆœํžˆ ํ‰๊ท  ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค๋ฉด E(Y|T=1) - E(Y|T=0) ์ด๊ณ , ์ด๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด ์•ฝ๋ฌผ์„ ํˆฌ์•ฝํ•œ ํ™˜์ž๊ฐ€ ํšŒ๋ณตํ•˜๋Š”๋ฐ ํ‰๊ท  16.9์ผ์ด ๋” ๊ฑธ๋ฆฐ๋‹ค๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•œ๋‹ค. ๊ต๋ž€์š”์ธ์œผ๋กœ ์ธํ•ด ์ด๋Ÿฌํ•œ (์ง๊ด€๊ณผ ๋ฐ˜๋Œ€๋œ) ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ ๊ฒƒ์œผ๋กœ ์ถ”์ธกํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

med.query("medication==1")["recovery"].mean() - med.query("medication==0")["recovery"].mean()

# 16.895799546498726

 

•  ์ด๋Ÿฌํ•œ ํŽธํ–ฅ์„ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด X๋ฅผ ํ†ต์ œํ•œ๋‹ค. ์ผ๋‹จ ๋ณ€์ˆ˜๋ฅผ ์Šค์ผ€์ผ๋งํ•ด์ค€๋‹ค. 

 

# scale features
X = ["severity", "age", "sex"]
y = "recovery"

med = med.assign(**{f: (med[f] - med[f].mean())/med[f].std() for f in X})
med.head()

 

 

 

•   KNN ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•ด ๋งค์นญ์„ ์ง„ํ–‰ํ•œ๋‹ค. mt0 ๋Š” ์ฒ˜์น˜๋˜์ง€ ์•Š์€ ๊ด€์ธก์น˜๋ฅผ ์ €์žฅํ•˜๊ณ , mt1 ์€ ์ฒ˜์น˜๋œ ๊ด€์ธก์น˜๋ฅผ ์ €์žฅํ•œ๋‹ค. 

 

 

from sklearn.neighbors import KNeighborsRegressor

treated = med.query("medication==1")
untreated = med.query("medication==0")

mt0 = KNeighborsRegressor(n_neighbors=1).fit(untreated[X], untreated[y])
mt1 = KNeighborsRegressor(n_neighbors=1).fit(treated[X], treated[y])

predicted = pd.concat([
    # find matches for the treated looking at the untreated knn model
    treated.assign(match=mt0.predict(treated[X])),
    
    # find matches for the untreated looking at the treated knn model
    untreated.assign(match=mt1.predict(untreated[X]))
])

predicted.head()

 

 

•   ๋งค์นญ์„ ํ†ตํ•ด ATE ๋ฅผ ๊ณ„์‚ฐํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค. 

 

np.mean((2*predicted["medication"] - 1)*(predicted["recovery"] - predicted["match"]))

# -0.9954

 

๋งค์นญ์„ ํ†ตํ•ด, X๋ฅผ ํ†ต์ œํ•  ๋•Œ, ์•ฝ์ด ํ‰๊ท ์ ์œผ๋กœ ์•ฝ 1์ผ์ •๋„ ํšŒ๋ณต์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 

 

โ‘ฃ Matching bias 


 

โ—ฏ   ols ๋ฅผ ํ™œ์šฉํ•œ bias ๋ณด์ • 

 

•   ATET ๋Š” ํŽธํ–ฅ๋œ estimator ์ด๋‹ค. (์ฆ๋ช… ๊ณผ์ • ์ƒ๋žต)

•   Bias ๋Š” ๋งค์นญ ๋ถˆ์ผ์น˜์˜ ์ •๋„๊ฐ€ ํด ๋•Œ ๋ฐœ์ƒํ•œ๋‹ค. bias ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ATET ๋ฅผ ์•„๋ž˜์˜ ์ˆ˜์‹๊ณผ ๊ฐ™์ด ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค. μ^0(x) ๋Š” E[Y|X,T=0] ์œผ๋กœ ์ฒ˜์น˜๋˜์ง€ ์•Š์€ ์ƒ˜ํ”Œ์— ํ”ผํŒ…ํ•œ ์„ ํ˜•ํšŒ๊ท€์™€ ๊ฐ™๋‹ค. 

 

 

•   OLS ๋Š” ์ถ”์ •๊ธฐ์˜ ๋ถ€์ฐจ์ ์ธ ์š”์†Œ์ด๋‹ค. 

 

from sklearn.linear_model import LinearRegression

# fit the linear regression model to estimate mu_0(x)
ols0 = LinearRegression().fit(untreated[X], untreated[y])
ols1 = LinearRegression().fit(treated[X], treated[y])

# find the units that match to the treated
treated_match_index = mt0.kneighbors(treated[X], n_neighbors=1)[1].ravel()

# find the units that match to the untreatd
untreated_match_index = mt1.kneighbors(untreated[X], n_neighbors=1)[1].ravel()

predicted = pd.concat([
    (treated
     # find the Y match on the other group
     .assign(match=mt0.predict(treated[X])) 
     
     # build the bias correction term
     .assign(bias_correct=ols0.predict(treated[X]) - ols0.predict(untreated.iloc[treated_match_index][X]))),
    (untreated
     .assign(match=mt1.predict(untreated[X]))
     .assign(bias_correct=ols1.predict(untreated[X]) - ols1.predict(treated.iloc[untreated_match_index][X])))
])

predicted.head()

 

 

 

•   ๋˜ํ•œ ๋งค์นญ์€ ๋น„๋ชจ์ˆ˜์  ์ถ”์ •๋Ÿ‰์ด๋‹ค. ์„ ํ˜•์„ฑ์ด๋‚˜ ์–ด๋–ค ์ข…๋ฅ˜์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ชจ๋ธ๋„ ๊ฐ€์ •ํ•˜์ง€ ์•Š๋Š”๋‹ค. ๋”ฐ๋ผ์„œ ์„ ํ˜•ํšŒ๊ท€๋ณด๋‹ค ์œ ์—ฐํ•œ ๋ฐฉ์‹์ด๋ฉฐ ๋น„์„ ํ˜•์„ฑ์ด ๋งค์šฐ ๊ฐ•ํ•œ ์ƒํ™ฉ์—์„œ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

np.mean((2*predicted["medication"] - 1)*((predicted["recovery"] - predicted["match"])-predicted["bias_correct"]))


## -7.36266090614141

 

 

 

 

โ—ฏ   CausalModel ์„ ํ™œ์šฉํ•œ ์ถ”์ • 

 

from causalinference import CausalModel

cm = CausalModel(
    Y=med["recovery"].values, 
    D=med["medication"].values, 
    X=med[["severity", "age", "sex"]].values
)

cm.est_via_matching(matches=1, bias_adj=True)

print(cm.estimates)

 

โ‡จ  ์•ฝ๋ฌผ์ด ์‹ค์ œ๋กœ ํ™˜์ž์˜ ๋ณ‘์› ์ž…์› ๊ธฐ๊ฐ„์„ ์ค„์—ฌ์ค€๋‹ค๊ณ  ์ž์‹  ์žˆ๊ฒŒ ๋งํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 

 

โ‘ค The Curse of Dimensionality


 

•  ๋งค์นญ๋œ ๊ด€์ธก์น˜๋“ค์ด ์œ ์‚ฌํ•˜์ง€ ์•Š์„ ๋•Œ ํŽธํ–ฅ์ด ๋ฐœ์ƒํ•œ๋‹ค. ๋” ๋งŽ์€ ๋ณ€์ˆ˜๊ฐ€ ์กด์žฌํ•  ์ˆ˜๋ก ํ•ด๋‹น ๊ด€์ธก์น˜์™€ ๋งค์นญ๋˜๋Š” ๊ด€์ธก์น˜ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋” ๋ฉ€์–ด์ง„๋‹ค. ์ฆ‰, ์ฐจ์›์˜ ์ €์ฃผ ํ˜„์ƒ์ด ๋ฐœ์ƒํ•œ๋‹ค. 

 

•  ์„ ํ˜•ํšŒ๊ท€์˜ ๊ฒฝ์šฐ ์ด ๋ฌธ์ œ๋ฅผ ์ž˜ ์ฒ˜๋ฆฌํ•œ๋‹ค. ๋ชจ๋“  ๋ณ€์ˆ˜ X๋ฅผ ๋‹จ์ผ ์ฐจ์› Y๋กœ ํˆฌ์˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ํ•ด๋‹น ํˆฌ์˜์— ๋Œ€ํ•œ ์ฒ˜์น˜ ๋ฐ ํ†ต์ œ์— ๋Œ€ํ•œ ๋น„๊ต๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. 

 

 

 

 

728x90

๋Œ“๊ธ€