[The Brave and True] 12. Doubly Robust Estimation

728x90

👀 인과추론 개인 공부용 포스트 글입니다. 출처는 첨부한 링크를 참고해주세요!

📜 정리

• Doubly robust estimator = 선형회귀 + 경향점수
• 둘 중 하나가 불완전해도 적당한 추정치를 얻을 수 있다.

① Introduction

◯ Doubly Robust Estimation

• E[Y|T=1] - E[Y|T=0] | X 를 추정하기 위해 선형회귀, Propensity score weighting 방법을 배웠다.

• 이 둘을 결합해서 사용하는 방법이 Doubly Robust Estimation 이다.

◯ 예제

• chapter 11 예제와 동일

• 분석하기 전에 범주형 변수들을 dummy 처리한다.

categ = ["ethnicity", "gender", "school_urbanicity"]
cont = ["school_mindset", "school_achievement", "school_ethnic_minority", "school_poverty", "school_size"]

data_with_categ = pd.concat([
    data.drop(columns=categ), # dataset without the categorical features
    pd.get_dummies(data[categ], columns=categ, drop_first=False) # categorical features converted to dummies
], axis=1)

② Doubly robust estimation

◯ ATE

• 추정 코드 ⭐

def doubly_robust(df, X, T, Y):
    ps = LogisticRegression(C=1e6, max_iter=1000).fit(df[X], df[T]).predict_proba(df[X])[:, 1]
    mu0 = LinearRegression().fit(df.query(f"{T}==0")[X], df.query(f"{T}==0")[Y]).predict(df[X])
    mu1 = LinearRegression().fit(df.query(f"{T}==1")[X], df.query(f"{T}==1")[Y]).predict(df[X])
    return (
        np.mean(df[T]*(df[Y] - mu1)/ps + mu1) -
        np.mean((1-df[T])*(df[Y] - mu0)/(1-ps) + mu0)
    )

T = 'intervention'
Y = 'achievement_score'
X = data_with_categ.columns.drop(['schoolid', T, Y])

doubly_robust(data_with_categ, X, T, Y)


## 0.38822197203025405

↪ Doubly robust estimator 는 세미나에 참석한 학생들이 참석하지 못한 학생들에 비해 성취 점수가 0.388 높다고 기대한다. 신뢰구간을 위해 bootstrap 방법을 적용해 볼 수 있다.

from joblib import Parallel, delayed # for parallel processing

np.random.seed(88)

# run 1000 bootstrap samples

bootstrap_sample = 1000
ates = Parallel(n_jobs=4)(delayed(doubly_robust)(data_with_categ.sample(frac=1, replace=True), X, T, Y)
                          for _ in range(bootstrap_sample))
ates = np.array(ates)


print(f"ATE 95% CI:", (np.percentile(ates, 2.5), np.percentile(ates, 97.5)))

## ATE 95% CI: (0.35365379802081925, 0.41978432347111305)

◯ Doubly robust estimator 가 좋은 이유

• Propensity score 추정치 P^(x) 혹은 선형회귀 추정치 μ^(x) 중 하나만 잘 추정하면 잘 작동하기 때문에 doubly robust 하다고 부른다.

• μ1^(x) 이 정확하게 추정되었다면 잔차 Yi - μ1^(x) 의 합은 0이 된다.

• 아래 코드와 같이 경향 점수를 추정하는 로지스틱 회귀분석을 0.1 ~ 0.9 사이의 값을 갖는 무작위 균일분포의 난수로 대체하여 경향 점수 모델의 신뢰도를 떨어트려봐도, doubly robust estimator 는 여전히 유사한 추정치를 갖는다. Bootstrap 으로 추정치의 분산을 확인해보면 분산은 약간 큼을 확인해 볼 수 있다.

from sklearn.linear_model import LogisticRegression, LinearRegression

def doubly_robust_wrong_ps(df, X, T, Y):
    # wrong PS model
    np.random.seed(654)
    ps = np.random.uniform(0.1, 0.9, df.shape[0]) # 🟡
    mu0 = LinearRegression().fit(df.query(f"{T}==0")[X], df.query(f"{T}==0")[Y]).predict(df[X])
    mu1 = LinearRegression().fit(df.query(f"{T}==1")[X], df.query(f"{T}==1")[Y]).predict(df[X])
    return (
        np.mean(df[T]*(df[Y] - mu1)/ps + mu1) -
        np.mean((1-df[T])*(df[Y] - mu0)/(1-ps) + mu0)
    )
    
    
doubly_robust_wrong_ps(data_with_categ, X, T, Y) # 0.3796984428841887

• μ1^(x) 을 틀리게 설정해도 (아래 코드에서 회귀분석 모델을 정규분포난수로 대체함) 여전히 doubly robust estimation 은 0.38로 동일하다. 분산은 약간 높다. (by bootstrap)

from sklearn.linear_model import LogisticRegression, LinearRegression

def doubly_robust_wrong_model(df, X, T, Y):
    np.random.seed(654)
    ps = LogisticRegression(C=1e6, max_iter=1000).fit(df[X], df[T]).predict_proba(df[X])[:, 1]
    
    # wrong mu(x) model
    mu0 = np.random.normal(0, 1, df.shape[0])
    mu1 = np.random.normal(0, 1, df.shape[0])
    return (
        np.mean(df[T]*(df[Y] - mu1)/ps + mu1) -
        np.mean((1-df[T])*(df[Y] - mu0)/(1-ps) + mu0)
    )
    
    
doubly_robust_wrong_model(data_with_categ, X, T, Y) # 0.3981405305433191

728x90

'1️⃣ AI•DS > 🥎 Casual inference' 카테고리의 다른 글

[The Brave and True] 14. Panel data and fixed effects (0)	2023.07.26
[The Brave and True] 13. Difference-in-Differences (0)	2023.07.20
[The Brave and True] 11. Propensity score (0)	2023.07.13
[The Brave and True] 10. Matching (0)	2023.07.11
[The Brave and True] 9. Non Compliance and LATE (0)	2023.07.04

Getting better

[The Brave and True] 12. Doubly Robust Estimation

'1️⃣ AI•DS > 🥎 Casual inference' 카테고리의 다른 글

댓글

티스토리툴바

[The Brave and True] 12. Doubly Robust Estimation

'1️⃣ AI•DS > 🥎 Casual inference' 카테고리의 다른 글

관련글

댓글

티스토리툴바