Matching is a class of observational study methods that reduces the influence of covariate bias by matching each treatment individual to one or more control units. In the presence of many confounding variables, matching is facilitated by the propensity score, a balancing score that takes into account all measured covariates and assigns a probability of the unit being in either the treatment or control group. From here, units can be matched in a variety of ways: nearest neighbor matching, full matching, and mahalanobis distance matching are some. In each of these methods, the goal is to “match” subjects on observable characteristics, making the treatment and control groups as similar as possible.
Matching on observable characteristics is crucial to achieving a balanced treatment and control sample, but trade offs of the matching method should be considered. For one, the study sample is now a fraction of the overall data in the population. Indeed, only units with a match are included in the analyses. Additionally, the Average Treatment Effect of the Treated (ATT) is, under these methods, only internally valid for the matched sample. Even more, the higher probability that some subjects have of being placed in the treatment group over other subjects being placed in the treatment group is inappropriately ignored. Such trade offs in matching might be addressed with a more nuanced treatment of the propensity score.
Using data from NYC public schools, we compare the capabilities of various weighting methods, noting the shared technical foundations and general limitations from weighting methods as used in the survey literature.
head(working_data)
#> # A tibble: 6 × 15
#> ...1 DBN TotalEnrollment ENI PercentBlack PercentSWD PercentPoverty
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 01M015 190 0.89 0.274 0.258 0.847
#> 2 2 01M019 257 0.679 0.191 0.35 0.77
#> 3 3 01M020 497 0.8 0.103 0.237 0.736
#> 4 4 01M034 333 0.937 0.318 0.372 0.979
#> 5 5 01M063 203 0.762 0.182 0.33 0.818
#> 6 6 01M064 245 0.882 0.208 0.298 0.922
#> # … with 8 more variables: self_contained <dbl>, AllStudents_CA <dbl>,
#> # AllStudents_PA <dbl>, SWD_CA <dbl>, SWD_PA <dbl>, gifted <dbl>,
#> # gifted_only <dbl>, treatment <dbl>
In the sample analysis, the statistical quantity of interest is the causal effect of a treatment (inclusion models) on percent attendance (PA) and chronic absenteeism (CA) for third grade public school students in NYC. In the sample dataset above, each school is identified by a unique 6 digit DBN. The additional variables, such as Percent Minority students and Percent Students in Poverty, serve as pre-treatment covariates. In what follows, I will use the MatchIt
package along with various weighting packages to compare how estimates of the ATE and the ATT vary based on weighting scheme.
623 schools with non-inclusion models had an averdage attendance of 92.92% and a mean chronic absenteeism rate of 24.91%, compared with the 147 inclusion schools with a mean attendance of 94.55% and chronic absenteeism rate of 14.46%.
%>%
working_data group_by(treatment) %>%
summarise(mean_attendance = mean(AllStudents_PA),
mean_chronic_absent = mean(AllStudents_CA),
n=n())
#> # A tibble: 2 × 4
#> treatment mean_attendance mean_chronic_absent n
#> <dbl> <dbl> <dbl> <int>
#> 1 0 92.9 24.9 623
#> 2 1 94.6 15.5 147
This is only a naive simple comparison, however, since the treatment (inclusion) is not randomized. Low-income, high-minority schools are more likely to have self-contained classrooms than high-income, low-minority schools. Additionally, schools with more students with special needs are more likely to have a self-contained option. These differences could be driving the attendance difference, rather than the effect of inclusion itself.
%>%
working_data group_by(treatment) %>%
summarise(mean_PercentBlack = mean(PercentBlack),
mean_PercentSWD = mean(PercentSWD),
mean_PercentPoverty = mean(PercentPoverty),
mean_TotalEnrollment = mean(TotalEnrollment),
mean_ENI = mean(ENI))
#> # A tibble: 2 × 6
#> treatment mean_PercentBlack mean_PercentSWD mean_PercentPove… mean_TotalEnrol…
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 0.276 0.218 0.805 655.
#> 2 1 0.162 0.188 0.544 476.
#> # … with 1 more variable: mean_ENI <dbl>
Simple Regression
A simple unmatched regression shows the treatment effect of inclusion as a 1.63% increase in attendance rates for all students.
### ummatched effect
<- lm(AllStudents_PA ~ treatment, data = working_data)
mod1summary(mod1)
#>
#> Call:
#> lm(formula = AllStudents_PA ~ treatment, data = working_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -8.9225 -1.6225 0.2775 1.6775 5.6775
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 92.92247 0.09386 989.966 < 2e-16 ***
#> treatment 1.62923 0.21483 7.584 9.69e-14 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.343 on 768 degrees of freedom
#> Multiple R-squared: 0.06967, Adjusted R-squared: 0.06846
#> F-statistic: 57.52 on 1 and 768 DF, p-value: 9.693e-14
Propensity Score Matching
To isolate the effect of inclusion, we create an artificial control group that resembles the treatment group in terms of the five key covariates. One way to create this group is by hand calculating the propensity score through a logistic regression model that predicts the treatment by five covariates, resulting in a propensity score for each subject (each school). The propensity score represents the probability that the school is in the treatment group.
<- glm(treatment ~ PercentBlack + PercentSWD + PercentPoverty + TotalEnrollment +
m_psfamily = binomial(), data = working_data)
ENI, summary(m_ps)
#>
#> Call:
#> glm(formula = treatment ~ PercentBlack + PercentSWD + PercentPoverty +
#> TotalEnrollment + ENI, family = binomial(), data = working_data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.15968 -0.49568 -0.28753 -0.09002 2.99702
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 7.462e+00 7.884e-01 9.465 < 2e-16 ***
#> PercentBlack -1.657e+00 5.572e-01 -2.974 0.00294 **
#> PercentSWD -9.987e+00 2.116e+00 -4.720 2.36e-06 ***
#> PercentPoverty -1.301e+01 2.010e+00 -6.472 9.66e-11 ***
#> TotalEnrollment -5.979e-03 6.926e-04 -8.633 < 2e-16 ***
#> ENI 8.768e+00 1.951e+00 4.494 6.98e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 750.81 on 769 degrees of freedom
#> Residual deviance: 477.26 on 764 degrees of freedom
#> AIC: 489.26
#>
#> Number of Fisher Scoring iterations: 6
# the PS is the predicted probability
<- data.frame(pr_score = predict(m_ps, type = "response"),
prs_df treatment = m_ps$model$treatment)
head(prs_df)
#> pr_score treatment
#> 1 0.51944953 1
#> 2 0.12431495 0
#> 3 0.35216718 0
#> 4 0.03577569 1
#> 5 0.21224056 1
#> 6 0.16978541 1
The matchit
package facilitates the creation of this propensity score, and subsequent matching of groups. Our control group (original N >600) shrinks to the size of the treatment (N=147), so only observations with similar features remain.
## propensity score model
<- matchit(treatment ~ PercentBlack + PercentSWD + PercentPoverty + TotalEnrollment +
prop_matched family = binomial(), data = working_data)
ENI, <- match.data(prop_matched)
prop_matched2
# simple comparison
%>%
prop_matched2 group_by(treatment) %>%
summarise(mean_attendance = mean(AllStudents_PA),
mean_chronic_absent = mean(AllStudents_CA),
n=n())
#> # A tibble: 2 × 4
#> treatment mean_attendance mean_chronic_absent n
#> <dbl> <dbl> <dbl> <int>
#> 1 0 93.3 22.5 147
#> 2 1 94.6 15.5 147
Balance in the propensity score can be further assessed through the cobalt
package. Simple propensity score matching reduces the difference in the distributions of all covariates from baseline, but tweaking the ratio and caliper of the matching method will improve this balance.
library(cobalt)
bal.tab(prop_matched, m.threshold = 0.1)
#> Call
#> matchit(formula = treatment ~ PercentBlack + PercentSWD + PercentPoverty +
#> TotalEnrollment + ENI, data = working_data, family = binomial())
#>
#> Balance Measures
#> Type Diff.Adj M.Threshold
#> distance Distance 0.6218
#> PercentBlack Contin. -0.1707 Not Balanced, >0.1
#> PercentSWD Contin. -0.2868 Not Balanced, >0.1
#> PercentPoverty Contin. -0.5000 Not Balanced, >0.1
#> TotalEnrollment Contin. -0.0509 Balanced, <0.1
#> ENI Contin. -0.4758 Not Balanced, >0.1
#>
#> Balance tally for mean differences
#> count
#> Balanced, <0.1 1
#> Not Balanced, >0.1 4
#>
#> Variable with the greatest mean difference
#> Variable Diff.Adj M.Threshold
#> PercentPoverty -0.5 Not Balanced, >0.1
#>
#> Sample sizes
#> Control Treated
#> All 623 147
#> Matched 147 147
#> Unmatched 476 0
bal.plot(prop_matched, var.name = 'TotalEnrollment', which = "both")
bal.plot(prop_matched, var.name = 'PercentPoverty', which = "both")
# p3<- bal.plot(prop_matched, var.name = 'PercentSWD', which = "both")
# p4<- bal.plot(prop_matched, var.name = 'PercentBlack', which = "both")
# p5<- bal.plot(prop_matched, var.name = 'ENI', which = "both")
# gl<- list(p1,p2, p3,p4, p5)
# grid.arrange(
# grobs = gl,
# top=textGrob("Propensity Score Matched"))
Nearest Neighbor Matching
By increasing the ratio of matches to 4 control units: 1 treatment unit, and specifying the calpier (maximum width of the match) to 0.25, we improve the difference in groups for all covaraites. All five covariates now cross the 0.1 standardized mean difference threshold.
<- matchit(formula = treatment ~ PercentBlack + PercentSWD + PercentPoverty + TotalEnrollment +
school_nearest
ENI, data = working_data,
method = "nearest",
family = "binomial",
caliper = 0.25,
ratio = 4)
library(cobalt)
bal.tab(school_nearest, m.threshold = 0.1)
#> Call
#> matchit(formula = treatment ~ PercentBlack + PercentSWD + PercentPoverty +
#> TotalEnrollment + ENI, data = working_data, method = "nearest",
#> caliper = 0.25, ratio = 4, family = "binomial")
#>
#> Balance Measures
#> Type Diff.Adj M.Threshold
#> distance Distance 0.0595 Balanced, <0.1
#> PercentBlack Contin. 0.0058 Balanced, <0.1
#> PercentSWD Contin. 0.0148 Balanced, <0.1
#> PercentPoverty Contin. -0.0328 Balanced, <0.1
#> TotalEnrollment Contin. -0.0661 Balanced, <0.1
#> ENI Contin. -0.0445 Balanced, <0.1
#>
#> Balance tally for mean differences
#> count
#> Balanced, <0.1 6
#> Not Balanced, >0.1 0
#>
#> Variable with the greatest mean difference
#> Variable Diff.Adj M.Threshold
#> TotalEnrollment -0.0661 Balanced, <0.1
#>
#> Sample sizes
#> Control Treated
#> All 623. 147
#> Matched (ESS) 177.34 102
#> Matched (Unweighted) 262. 102
#> Unmatched 361. 45
bal.plot(school_nearest, var.name = 'TotalEnrollment', which = "both")
bal.plot(school_nearest, var.name = 'PercentPoverty', which = "both")
# p3<- bal.plot(school_nearest, var.name = 'PercentSWD', which = "both")
# p4<- bal.plot(school_nearest, var.name = 'PercentBlack', which = "both")
# p5<- bal.plot(school_nearest, var.name = 'ENI', which = "both")
# gl<- list(p1,p2)
# grid.arrange(
# grobs = gl,
# top=textGrob("Nearest Neighbor with Caliper Matched"))
Matched Regression
Compared to the simple regression, from above, the matched estimated yields a lower standard error. Sample size has now reduced to 102 units; therefore our quantity of interest is the ATT, or the Average Treatment Effect Among the Treated. The quantity estimated refers to the percent attendance where the target population is the treated population. Compare this to the Average Treatment Effect (ATE), where the target population is the whole population, as in a randomized study, simulated by the regular regression estimate in the previous section.
#create the matched set, only 364 schools are matched
<- match.data(school_nearest)
nearest_matched ## estimating treatment effects
<- lm(AllStudents_PA ~ treatment, data = nearest_matched)
mod3 summary(mod3)
#>
#> Call:
#> lm(formula = AllStudents_PA ~ treatment, data = nearest_matched)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -8.9656 -1.5906 0.3344 1.7344 5.5344
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 92.9656 0.1474 630.767 < 2e-16 ***
#> treatment 1.2432 0.2784 4.465 1.07e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.386 on 362 degrees of freedom
#> Multiple R-squared: 0.0522, Adjusted R-squared: 0.04958
#> F-statistic: 19.94 on 1 and 362 DF, p-value: 1.071e-05
Weighting by the propensity score will help to retain all of the subject in the analysis. By assigning a higher weight to subjects with a low propensity score, we give more weight to units who are not likely to be in the treatment class. Subjects who are not likely to be in the treatment are rare, and valuable. On the other hand, units who are more likely to be in the treatment have repeated information. These observations will be down-weighted. In the same spirit of the manipulation that underlies much of causal inference, weighting helps creates a more balanced “pseudo-population” to proceed with a balanced analysis.
Weighting of the propensity score is directly parallel to the Horvitz Thompson estimator (Horvitz and Thompson (1952)) in survey literature. The Horvitz Thompson estimator performs inverse probability weighting (by giving each observation a weight which is the inverse of its probability of inclusion), and provides an unbiased estimator of the population total and population mean under unequal probability sampling. Using such survey weights, the estimate for the population mean becomes
\[\hat{y}HT = \frac{1}{N}\sum\limits_{i=1}^{n} w_i y_i \]
Similarly, Inverse Probability Treatment Weighting (Rosenbaum (1987)), will assign a higher weight to those units which are less likely to be included, just as units underrepresented in the sample compared to the population, are assigned a higher weight in the Horvitz-Thompson approach. The formula for the IPTW weights, where each subject is weighted by the inverse of their treatment probability, is given by \[ p(x)=P(Z=1 \mid X) \\ \begin{equation} \hat{w}(IPTW) = \begin{cases} \frac{1}{\pi}\ \ for\ \ T_ij =1\\ \frac{1}{1-\pi} \ \ for \ \ T_ij = 0\\ \end{cases} \end{equation} \]
A second type of propensity score matching, Standardized IP-weighting, deals with the issues in IPTW weighting where individuals with propensity scores close to 0 (those extremely unlikely to be treated) end up with a very large weight. Large weights result in unstable estimators. Standardized weights use the marginal probability of treatment instead of 1 in the weight numerator. This example is parallel to standardized weights in survey literature, where weighting techniques like standardized weighting are used to address issues with low response rates, caused by survey coverage and unit nonresponse.
Referring back to our population of NYC public schools, the set of five measured covaraites will inform the weights. In this case, a very small number of schools (high poverty, high minority schools) are unlikely to be in the treatment condition. When these schools do receive the treatment, weighting gives their information as much influence as possible. Recall that IPTW addresses one of the main limitations of matching- reduction in sample size. Using IPTW, all subjects (schools) remain in the analysis. The same is true of overlap weights.
Inverse Probability Treatment Weighting
Given the non-linear relationship of three covariates with treatment assignment, I use a generalized additive model (Hastie and Tibshirani (2017)) to identify the propensity score. IPTW and SW weights are calculated and attached to each observation. Note that the full sample size (N=623,147) is retained.
library(gam)
<- gam(as.factor(treatment) ~ s(PercentBlack) + s(PercentSWD) + (PercentPoverty) + s(TotalEnrollment) +
mod
(ENI), data = working_data, family = "binomial")
par(mfrow=c(3,2))
plot(mod, residuals = TRUE, se= TRUE, pch = ".")
<- working_data %>%
working_data mutate(propensity_gam = predict(mod, type = "response"))
## iptw weights
$treatment_identifier <- ifelse(working_data$treatment == 1, "inclusion", "non-inclusion")
working_data$iptw <- ifelse(working_data$treatment_identifier == 'inclusion', 1/(working_data$propensity_gam),
working_data1/(1-working_data$propensity_gam))
#stabilized weights
$stable.iptw <- ifelse(working_data$treatment_identifier == 'inclusion',
working_datamean(working_data$propensity_gam))/working_data$propensity_gam,
(mean(1-working_data$propensity_gam)/(1-working_data$propensity_gam))
%>%
working_data group_by(treatment) %>%
summarise(iptw = weighted.mean(AllStudents_PA, iptw),
stable_iptw = weighted.mean(AllStudents_PA, stable.iptw),
n=n())
#> # A tibble: 2 × 4
#> treatment iptw stable_iptw n
#> <dbl> <dbl> <dbl> <int>
#> 1 0 93.0 93.0 623
#> 2 1 94.1 94.1 147
<- lm(AllStudents_PA ~ treatment, weights = working_data$iptw, data = working_data)
mod4 summary(mod4)
#>
#> Call:
#> lm(formula = AllStudents_PA ~ treatment, data = working_data,
#> weights = working_data$iptw)
#>
#> Weighted Residuals:
#> Min 1Q Median 3Q Max
#> -15.898 -1.829 0.382 1.958 17.220
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 93.0280 0.1123 828.709 < 2e-16 ***
#> treatment 1.0730 0.1550 6.925 9.24e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.065 on 768 degrees of freedom
#> Multiple R-squared: 0.05877, Adjusted R-squared: 0.05754
#> F-statistic: 47.95 on 1 and 768 DF, p-value: 9.235e-12
Weight Trimming
One concern regarding propensity score weighting is that observations with extremely large weights might over-influence results and yield estimates with high variance. In that case, a common recommendation is to trim observations with large survey weights. Again, the suggestion of weight trimming comes directly from survey design methods, such as Potter (1993), where weight trimming has been shown to reduce the sampling variance estimate. The tradeoffs suggested from the survey literature parallel the tradeoffs for weight trimming in PS methods: larger sample variance or substantial bias might result for some survey estimates, despite the overall decrease of variance.
Trimmed weighting generally takes two approaches. One approach, Crump et al. (2009), employs a “min max” approach, excluding subjects whose propensity score is outside of the range of this cutoff. Another approach is to exclude subjects who fall below a certain quantile of the propensity score distribution in either treatment of control group. For instance, Sturmer et al 2010 trim the propensity scores in treated patients at the lower end of the propensity score distribution and in untreated patients at the higher end of the propensity score distribution. In either approach, weight trimming improves the accuracy and precision of the final parameter estimates, as in Lee, Lessler, and Stuart (2011).
\[ \hat{w}_{ij}, IPW-T = \hat{w}_{ij},IPW{\hat{w}_{ij},IPW<c} \]
Determining the optimal amount of trimming is unfortunately arbitrary, and bias is likely to result. Analysts are advised to investigate the procedures that led to the generation of the weights (proper specification of the PS). Better specification of the PS model, popularly done by machine learning methods, such as Hill (2011), is an option. Also mentioned, in both trimming examples, the choice of threshold might be arbitrary. More importantly, trimming results in increased bias in estimates, as well as a reduced sample size.
As a final conceptual challenge to trimming methods, trimming results in an ambiguous target population that is difficult for stakeholders to interpret. In the NYC Schools example below, would generalizing the effect of gifted/talented programs to “schools that fall within the 85th quantile of the propensity score” be interpretable to policy makers? Taking the steps to translate the generalizability of an effect based on a certain subset of the PS distribution quickly becomes quantitatively convenient, but conceptually daunting at the policy level. This problem is especially of note under situations of heterogeneous treatment effects. When the treatment effects are constant across the distribution of the covariates, then weight trimming might be ok, but when the treatment effects vary across the distribution of covariates (such as lower income schools that benefit more from GT programs), then weight trimming is less than ideal.
The distribution of the weights, the points to the right are extremely unlikely to have inclusion. We might remove these by truncating the weights at a maximum of 10.
## distribution of weights
ggplot(working_data, aes(x = iptw, fill = as.factor(treatment))) +
geom_density(alpha = 0.5, colour = "grey50") +
geom_rug() +
scale_x_log10(breaks = c(1, 5, 10, 20, 40)) +
theme(panel.background = element_rect(fill='white'),
legend.position = "none")+
ggtitle("Distribution of inverse probability weights: \n full data")
# truncate weights
<- working_data %>%
trimmed_data filter(iptw<10)
## distribution of truncated weights
ggplot(trimmed_data, aes(x = iptw, fill = as.factor(treatment))) +
geom_density(alpha = 0.5, colour = "grey50") +
geom_rug() +
scale_x_log10(breaks = c(1, 5, 10, 20, 40)) +
theme(panel.background = element_rect(fill='white'),
legend.position = "none")+
ggtitle("Distribution of inverse probability weights: \n trimmed data")
#
# gl<- list(p1,p2)
# grid.arrange(
# grobs = gl,
# top=textGrob("Weight Comparison"))
<- lm(AllStudents_PA ~ treatment, weights = trimmed_data$iptw, data = trimmed_data)
mod4 summary(mod4)
#>
#> Call:
#> lm(formula = AllStudents_PA ~ treatment, data = trimmed_data,
#> weights = trimmed_data$iptw)
#>
#> Weighted Residuals:
#> Min 1Q Median 3Q Max
#> -15.7828 -1.7621 0.3892 1.9763 10.0759
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 93.0213 0.1042 892.821 < 2e-16 ***
#> treatment 1.0341 0.1854 5.577 3.41e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.82 on 752 degrees of freedom
#> Multiple R-squared: 0.03972, Adjusted R-squared: 0.03844
#> F-statistic: 31.11 on 1 and 752 DF, p-value: 3.409e-08
An advantage of overlap weighting is that overlap weights lead to exact balance on the mean of every measured covariate when the PS is estimated by logistic regression. All weights are bounded between 0 and 1 by design, eliminating the need for weight trimming. The approach might be particularly useful in the era of big data, where inclusion criteria is defined more broadly. Large data sources, with many possible covariates, also provoke the desire to clarify best practices for handling extreme propensity scores.
A limitation, as with all other PS methods, is that researchers are at the whim of which covariates are available and included in the model. Additionally, the ATE is now in reference to the target population, which is the overlap population. The estimation of the treatment could be for a sub-population that does not reflect people who receive the treatment in routine service. In the NYC schools example, we might be estimating people who have no chance of the treatment.
<- trt~PercentBlack + PercentSWD + PercentPoverty+TotalEnrollment+
ps.formula
ENI<- SumStat(ps.formula, trtgrp="1", data=data1,
msstat weight=c("IPW","overlap","treated","entropy","matching"))
plot(msstat, type="balance", metric = "PSD")
### PSweight to identify ATE for Overlap
<- PSweight(ps.formula = ps.formula,
ate.any yname = "Y", data = data1,
weight= "overlap")
ate.any#> Original group value: 0, 1
#>
#> Point estimate:
#> 93.3388, 94.0635
summary(ate.any)
#>
#> Closed-form inference:
#>
#> Original group value: 0, 1
#>
#> Contrast:
#> 0 1
#> Contrast 1 -1 1
#>
#> Estimate Std.Error lwr upr Pr(>|z|)
#> Contrast 1 0.72475 0.17300 0.38568 1.0638 2.798e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Weighting methods improve upon naive matching, as we retain a majority of the sample. However, weighting methods are still vulnerable to model mis-specification. If the causal model is improperly specified, weighting won’t do much to help the final causal estimation.
The doubly-robust approach only requires one of the models (either the outcome or treatment model) to be correctly specified. The equation for the doubly robust estimator is given by
\[ ATE = \frac{1}{N}\sum(\frac{T_i(Y_i-\hat\mu_1(X_i))}{\hat{P}(X_i)} +\hat\mu(X_i)) - \frac{1}{N}\sum(\frac{(1-T_i)(Y_i - \hat\mu_0(X_i))}{1-\hat{P}(X_i)} + \hat\mu_0(X_i)) \]
where \(\hat{P}\) represents the propensity score the estimation equation, and \(\hat{\mu_0}\) represents the treatment effect, or the estimation of E[Y|X,T=0]. If the out model is correctly specified, then the expectation of \(T_i(Y_i-\hat\mu(X_i))\) is 0. Multiplying by only the treated (\(T_i\), the residuals on the estimation of \(u_i\) become 0. With the numerator falling to 0, we correctly estimate E(Y) with \(\hat\mu_1(X_i)\). When the outcome model is wrong but the propensity score model is correct, the equation again reduces, this time to only the propensity score estimator. In either case, when the propensity or outcome model is not correctly specified, the doubly robust estimatr still recovers the same ATE.