๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐ŸฅŽ Casual inference

[The Brave and True] 12. Doubly Robust Estimation

by isdawell 2023. 7. 14.
728x90

 

 

 

๐Ÿ‘€ ์ธ๊ณผ์ถ”๋ก  ๊ฐœ์ธ ๊ณต๋ถ€์šฉ ํฌ์ŠคํŠธ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ถœ์ฒ˜๋Š” ์ฒจ๋ถ€ํ•œ ๋งํฌ๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”!

 

 

 

 

๐Ÿ“œ ์ •๋ฆฌ 

 

•   Doubly robust estimator = ์„ ํ˜•ํšŒ๊ท€ + ๊ฒฝํ–ฅ์ ์ˆ˜ 
•   ๋‘˜ ์ค‘ ํ•˜๋‚˜๊ฐ€ ๋ถˆ์™„์ „ํ•ด๋„ ์ ๋‹นํ•œ ์ถ”์ •์น˜๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 

 

โ‘   Introduction 


 

โ—ฏ   Doubly Robust Estimation 

 

•   E[Y|T=1] - E[Y|T=0] | X ๋ฅผ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ์„ ํ˜•ํšŒ๊ท€, Propensity score weighting ๋ฐฉ๋ฒ•์„ ๋ฐฐ์› ๋‹ค. 

•   ์ด ๋‘˜์„ ๊ฒฐํ•ฉํ•ด์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด Doubly Robust Estimation ์ด๋‹ค. 

 

 

 

โ—ฏ   ์˜ˆ์ œ 

 

•   chapter 11 ์˜ˆ์ œ์™€ ๋™์ผ 

•   ๋ถ„์„ํ•˜๊ธฐ ์ „์— ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋“ค์„ dummy ์ฒ˜๋ฆฌํ•œ๋‹ค. 

 

 

categ = ["ethnicity", "gender", "school_urbanicity"]
cont = ["school_mindset", "school_achievement", "school_ethnic_minority", "school_poverty", "school_size"]

data_with_categ = pd.concat([
    data.drop(columns=categ), # dataset without the categorical features
    pd.get_dummies(data[categ], columns=categ, drop_first=False) # categorical features converted to dummies
], axis=1)

 

 

 

 

 

 

โ‘ก  Doubly robust estimation 


 

โ—ฏ   ATE

 

doubly_robust ATE

 

•   ์ถ”์ • ์ฝ”๋“œ โญ

 

def doubly_robust(df, X, T, Y):
    ps = LogisticRegression(C=1e6, max_iter=1000).fit(df[X], df[T]).predict_proba(df[X])[:, 1]
    mu0 = LinearRegression().fit(df.query(f"{T}==0")[X], df.query(f"{T}==0")[Y]).predict(df[X])
    mu1 = LinearRegression().fit(df.query(f"{T}==1")[X], df.query(f"{T}==1")[Y]).predict(df[X])
    return (
        np.mean(df[T]*(df[Y] - mu1)/ps + mu1) -
        np.mean((1-df[T])*(df[Y] - mu0)/(1-ps) + mu0)
    )
T = 'intervention'
Y = 'achievement_score'
X = data_with_categ.columns.drop(['schoolid', T, Y])

doubly_robust(data_with_categ, X, T, Y)


## 0.38822197203025405

 

โ†ช  Doubly robust estimator ๋Š” ์„ธ๋ฏธ๋‚˜์— ์ฐธ์„ํ•œ ํ•™์ƒ๋“ค์ด ์ฐธ์„ํ•˜์ง€ ๋ชปํ•œ ํ•™์ƒ๋“ค์— ๋น„ํ•ด ์„ฑ์ทจ ์ ์ˆ˜๊ฐ€ 0.388 ๋†’๋‹ค๊ณ  ๊ธฐ๋Œ€ํ•œ๋‹ค. ์‹ ๋ขฐ๊ตฌ๊ฐ„์„ ์œ„ํ•ด bootstrap ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

from joblib import Parallel, delayed # for parallel processing

np.random.seed(88)

# run 1000 bootstrap samples

bootstrap_sample = 1000
ates = Parallel(n_jobs=4)(delayed(doubly_robust)(data_with_categ.sample(frac=1, replace=True), X, T, Y)
                          for _ in range(bootstrap_sample))
ates = np.array(ates)


print(f"ATE 95% CI:", (np.percentile(ates, 2.5), np.percentile(ates, 97.5)))

## ATE 95% CI: (0.35365379802081925, 0.41978432347111305)

 

 

 

 

โ—ฏ   Doubly robust estimator ๊ฐ€ ์ข‹์€ ์ด์œ  

 

•   Propensity score ์ถ”์ •์น˜ P^(x) ํ˜น์€ ์„ ํ˜•ํšŒ๊ท€ ์ถ”์ •์น˜ μ^(x) ์ค‘ ํ•˜๋‚˜๋งŒ ์ž˜ ์ถ”์ •ํ•˜๋ฉด ์ž˜ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— doubly robust ํ•˜๋‹ค๊ณ  ๋ถ€๋ฅธ๋‹ค. 

 

 

•   μ1^(x) ์ด ์ •ํ™•ํ•˜๊ฒŒ ์ถ”์ •๋˜์—ˆ๋‹ค๋ฉด ์ž”์ฐจ Yi - μ1^(x)  ์˜ ํ•ฉ์€ 0์ด ๋œ๋‹ค. 

 

 

 

•   ์•„๋ž˜ ์ฝ”๋“œ์™€ ๊ฐ™์ด ๊ฒฝํ–ฅ ์ ์ˆ˜๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ถ„์„์„ 0.1 ~ 0.9 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ–๋Š” ๋ฌด์ž‘์œ„ ๊ท ์ผ๋ถ„ํฌ์˜ ๋‚œ์ˆ˜๋กœ ๋Œ€์ฒดํ•˜์—ฌ ๊ฒฝํ–ฅ ์ ์ˆ˜ ๋ชจ๋ธ์˜ ์‹ ๋ขฐ๋„๋ฅผ ๋–จ์–ดํŠธ๋ ค๋ด๋„, doubly robust estimator ๋Š” ์—ฌ์ „ํžˆ ์œ ์‚ฌํ•œ ์ถ”์ •์น˜๋ฅผ ๊ฐ–๋Š”๋‹ค. Bootstrap ์œผ๋กœ ์ถ”์ •์น˜์˜ ๋ถ„์‚ฐ์„ ํ™•์ธํ•ด๋ณด๋ฉด ๋ถ„์‚ฐ์€ ์•ฝ๊ฐ„ ํผ์„ ํ™•์ธํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

from sklearn.linear_model import LogisticRegression, LinearRegression

def doubly_robust_wrong_ps(df, X, T, Y):
    # wrong PS model
    np.random.seed(654)
    ps = np.random.uniform(0.1, 0.9, df.shape[0]) # ๐ŸŸก
    mu0 = LinearRegression().fit(df.query(f"{T}==0")[X], df.query(f"{T}==0")[Y]).predict(df[X])
    mu1 = LinearRegression().fit(df.query(f"{T}==1")[X], df.query(f"{T}==1")[Y]).predict(df[X])
    return (
        np.mean(df[T]*(df[Y] - mu1)/ps + mu1) -
        np.mean((1-df[T])*(df[Y] - mu0)/(1-ps) + mu0)
    )
    
    
doubly_robust_wrong_ps(data_with_categ, X, T, Y) # 0.3796984428841887

 

 

 

 

•   μ1^(x) ์„ ํ‹€๋ฆฌ๊ฒŒ ์„ค์ •ํ•ด๋„ (์•„๋ž˜ ์ฝ”๋“œ์—์„œ ํšŒ๊ท€๋ถ„์„ ๋ชจ๋ธ์„ ์ •๊ทœ๋ถ„ํฌ๋‚œ์ˆ˜๋กœ ๋Œ€์ฒดํ•จ) ์—ฌ์ „ํžˆ doubly robust estimation ์€ 0.38๋กœ ๋™์ผํ•˜๋‹ค. ๋ถ„์‚ฐ์€ ์•ฝ๊ฐ„ ๋†’๋‹ค. (by bootstrap) 

 

from sklearn.linear_model import LogisticRegression, LinearRegression

def doubly_robust_wrong_model(df, X, T, Y):
    np.random.seed(654)
    ps = LogisticRegression(C=1e6, max_iter=1000).fit(df[X], df[T]).predict_proba(df[X])[:, 1]
    
    # wrong mu(x) model
    mu0 = np.random.normal(0, 1, df.shape[0])
    mu1 = np.random.normal(0, 1, df.shape[0])
    return (
        np.mean(df[T]*(df[Y] - mu1)/ps + mu1) -
        np.mean((1-df[T])*(df[Y] - mu0)/(1-ps) + mu0)
    )
    
    
doubly_robust_wrong_model(data_with_categ, X, T, Y) # 0.3981405305433191

 

 

 

 

 

 

 

 

 

 

728x90

๋Œ“๊ธ€