This vignette demonstrates how to estimate the controlled direct effect of some treatment, net the effect of some mediator using sequential g-estimation approach as described by Acharya, Blackwell, and Sen (2016). With standard regression and matching approaches, estimating the controlled direct effect is difficult because it often requires adjusting for covariates that causally affect the mediator and are causally affected by the treatment. Including such variables in a regression model can lead to post-treatment bias when estimating the direct effect of treatment. A large class of models and estimation approaches have been developed to avoid this problem, but here we focus on one called sequential g-estimation. This approach uses a standard linear model to estimate the effect of the mediator, and then removes that effect from effect from the dependent variable to estimate the direct effect of treatment without post-treatment bias.

Here we briefly introduce some of the statistical theory behind estimating direct effects. The main quantity of interest is the average controlled direct effect or ACDE. This can be illustrated from the following causal structure (Acharya, Blackwell, and Sen 2016, fig. 3):

The average controlled direct effect defined for a given treatment (\(A_i = a\) vs. \(A_i = a^\prime\)) and a given value of the mediator (\(M_i = m\)) is \[ACDE(a, a^\prime, m) = E[Y_i(a, m) - Y_i(a^\prime, m)]\] and is the total of the dashed lines in the figure above. Thus, we hold the mediator constant when comparing the outcome under each treatment. The ACDE is useful for discriminating between causal mechanisms, because the average total effect of treatment, \(\tau(a, a^\prime) \equiv E[Y_i(a) - Y_i(a^\prime)]\), can be decomposed as the sum of three quantities: (1) the ACDE, (2) the average natural indirect effect, and (3) a reference interaction that measures of how much the direct effect of \(A_i\) depends on a particular \(M_i = m\). Thus, if the ACDE is non-zero, it is evidence that the effect of \(A_i\) is not entirely explained by \(M_i\). We illustrate this with an empirical example. For more information on this, see Acharya, Blackwell, and Sen (2016).

This vignette introduces a new way to design, specify, and implement various estimators for controlled direct effects using a coding style that we refer to as the “assembly line.” Users can specify the various components of the estimators sequentially and easily swap out implementation details such as what exact estimators are used for the estimation of each nuisance function or what arguments to pass to these functions. A unified structure and output across different research designs and estimators should make comparing estimators very easy. Finally, this implementation is built around the idea of cross-fitting for nuisance function estimation.

Let’s see how this works in practice. We first load a subset of the
`jobcorps`

data from the package:

```
data("jobcorps")
jc <- jobcorps[
1:200,
c("treat", "female", "age_cat", "work2year2q", "pemplq4", "emplq4", "exhealth30")
]
```

This is the Job Corps experiment data used in Huber (2014) to estimate the effect of a randomized job training program on self-assessed health holding constant the intermediate outcome of employment. We use a subset of the variables from their study, but the basic variable we use here:

- \(Y_i\) is an indicator for whether
participants reported “very good” health after 2.5 years out after
randomization (
`exhealth30`

) - \(A_i\) is an indicator for
assignment to the job training program (
`treat`

) - \(M_i\) is an indicator for
employment 1 to 1.5 years after assignment
(
`work2year2q`

) - \(Z_i\) are the post-treatment,
pre-mediator intermediate confounders (
`emplq4`

,`pemplq4`

) - \(X_i\) are the pre-treatment
characteristics (
`female`

,`age_cat`

)

In context of our CDE estimators, we refer to be \(A_i\) (the treatment) and \(M_i\) (the mediator) as *treatment
variables* because we are interested in causal effects that contrast
different combinations of these two variables.

The first step in our assembly line is to choose the type of CDE
estimator (or really estimation strategy) that we want to pursue. These
choices include IPW (`cde_ipw()`

), regression imputation
(`cde_reg_impute()`

), and augmented IPW
(`cde_aipw()`

) among others. Some of these different
functions will take different arguments to specify different aspects of
the estimator. The AIPW estimator is useful to demonstrate because it
uses many of the features of the assembly line approach.

```
my_aipw <- cde_aipw() |>
set_treatment(treat, ~ female + age_cat) |>
treat_model(engine = "logit") |>
outreg_model(engine = "lm") |>
set_treatment(work2year2q, ~ emplq4 + pemplq4) |>
treat_model(engine = "logit") |>
outreg_model(engine = "lm") |>
estimate(exhealth30 ~ treat + work2year2q, data = jobcorps)
```

The call here consists of several steps connected by R’s pipe
operator. The first call `cde_aipw()`

simply initializes the
estimator as taking the AIPW form, meaning that it will use both
propensity score models and outcome regressions to estimate the causal
effects.

Following this we have two blocks of 3 lines each that look similar
with a few differences. These “treatment blocks” specify the causal
variables of interest **in their causal ordering**. In
`set_treatment()`

we set the name of the treatment variable
and give a formula describing the pre-treatment covariates for this
treatment. This formula, unless overridden by the next two functions,
will be used in the fitting of the propensity score and outcome
regression models. `treat_model()`

allows the user to specify
how to fit the propensity score model for this treatment with a choice
of engine and optional arguments to pass to each engine.
`outreg_model()`

does the same for the outcome regression. We
repeat this specification for the mediator.

The `estimate()`

function finally implements the
specifications given in the previous commands. Users use a formula to
set the outcome and which of the treatment variables we want effects for
and indicate the data frame where all of these variables can be found.
By default, the nuisance functions (the propensity score model and the
outcome regression) are fit using cross-fitting.

We can either use the `summary()`

or `tidy()`

functions to get a summary of the different effects being estimated:

```
##
## aipw CDE Estimator
## ---------------------------
## Causal variable: treat
##
## Treatment model: treat ~ female + age_cat
## Treatment engine: logit
##
## Outcome model: ~female + age_cat
## Outcome engine: lm
## ---------------------------
## Causal variable: work2year2q
##
## Treatment model: work2year2q ~ emplq4 + pemplq4 + female + age_cat
## Treatment engine: logit
##
## Outcome model: ~emplq4 + pemplq4 + female + age_cat
## Outcome engine: lm
## ---------------------------
## Cross-fit: TRUE
## Number of folds: 5
##
## Estimated Effects:
## Estimate Std. Error t value Pr(>|t|)
## treat [(1, 0) vs. (0, 0)] 0.030412 0.02259 1.34644 0.17824
## treat [(1, 1) vs. (0, 1)] 0.029877 0.01396 2.14080 0.03233
## work2year2q [(0, 1) vs. (0, 0)] 0.001241 0.02077 0.05974 0.95237
## work2year2q [(1, 1) vs. (1, 0)] -0.001672 0.01655 -0.10101 0.91955
## CI Lower CI Upper DF
## treat [(1, 0) vs. (0, 0)] -0.013871 0.07469 3818
## treat [(1, 1) vs. (0, 1)] 0.002518 0.05724 6207
## work2year2q [(0, 1) vs. (0, 0)] -0.039475 0.04196 3991
## work2year2q [(1, 1) vs. (1, 0)] -0.034120 0.03078 6034
```

```
## term estimate std.error
## treat_1_0 treat [(1, 0) vs. (0, 0)] 0.030411569 0.02258657
## treat_1_1 treat [(1, 1) vs. (0, 1)] 0.029876867 0.01395590
## work2year2q_0_1 work2year2q [(0, 1) vs. (0, 0)] 0.001240634 0.02076711
## work2year2q_1_1 work2year2q [(1, 1) vs. (1, 0)] -0.001671883 0.01655227
## statistic p.value conf.low conf.high df
## treat_1_0 1.34644455 0.17823912 -0.013871340 0.07469448 3818
## treat_1_1 2.14080471 0.03232863 0.002518461 0.05723527 6207
## work2year2q_0_1 0.05974031 0.95236546 -0.039474510 0.04195578 3991
## work2year2q_1_1 -0.10100627 0.91954883 -0.034120251 0.03077648 6034
```

Because the `estimate()`

function included both
`treat`

and `work2year2q`

in the formula, the
output includes both the controlled direct effects of the treatment and
the conditional average treatment effect of the mediator. Furthermore,
by default, the output contains estimates for an effect for each
combination of other treatment variables. This can be changed by setting
the `eval_vals`

argument in the relevant
`set_treatment()`

call.

The modular structure of the call allow users to easily swap out parts of the models. For example, we could easily alter the above call to use lasso versions of the regression and logit model.

```
my_aipw_lasso <- cde_aipw() |>
set_treatment(treat, ~ female + age_cat) |>
treat_model(engine = "rlasso_logit") |>
outreg_model(engine = "rlasso") |>
set_treatment(work2year2q, ~ emplq4 + pemplq4) |>
treat_model(engine = "rlasso_logit") |>
outreg_model(engine = "rlasso") |>
estimate(exhealth30 ~ treat + work2year2q, data = jobcorps)
```

Here we use a “rigorous” version of the lasso from the
`hdm`

package to speed up computation. There are a number of
different engines for different outcome and treatment types
including:

- GLMs:
`lm`

(OLS),`logit`

(logistic regression),`multinom`

(mulitnomial logit) - LASSO estimators (from
`glmnet`

):`lasso`

,`lasso_logit`

,`lasso_multinom`

- Rigorous LASSO estimator (from
`hdm`

):`rlasso`

,`rlasso_logit`

Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2016.
“Explaining Causal Findings Without Bias: Detecting and Assessing
Direct Effects.” *American Political Science Review* 110
(3): 512–29. https://doi.org/10.1017/S0003055416000216.

Huber, Martin. 2014. “Identifying Causal Mechanisms (Primarily)
Based on Inverse Probability Weighting.” *Journal of Applied
Econometrics* 29 (6): 920–43.