The code below demonstrates how to load two sample datasets, select the sets of confounders and effect modifiers and a variety of estimation methods and provide Figure 1 comparing the various estimates of the (C)ATEs. The descriptions of the input arguments in the function RCTREP() are as follows:
source.data <- RCTrep::source.data
target.data <- RCTrep::target.data
output <- RCTREP(TEstimator = "G_computation", SEstimator = "Exact",
outcome_method = "BART",
source.data = RCTrep::source.data,
target.data = RCTrep::target.data,
vars_name = list(outcome_predictors =
c("x1","x2","x3","x4","x5","x6"),
treatment_name = c('z'),
outcome_name = c('y')),
selection_predictors = c("x2","x6"),
stratification = c("x1","x3","x4","x5"),
stratification_joint = TRUE)
fusion <- Fusion$new(output$target.obj,
output$source.obj,
output$source.rep.obj)
fusion$plot()
In the above example, we use G_computation method to adjust the treatment allocation mechanism and use the exact matching method to adjust the sampling mechanism. We use Bayesian additive regression trees (BART) to model the outcome. We select x1,x2,x3,x4,x5,x6 as X and x2,x6 as Xs. In this example, since x2,x6 are the only effect modifiers that are predictive of the selection probability, they are the minimal set of selection_predictors that allows for the validation.
In the set-selection step, we identify two covariates sets from all pre-treatment outcome predictors:
In the Estimation step, two sub-steps are summarized, namely, estimation of the (C)ATEs in TEstimator, and estimation of the weighted (C)ATE in SEstimator. In the first sub-step, we use one method to adjust for the treatment allocation mechanism of Sobs in class TEstimator, namely, G-computation method, and one method to derive the unbiased estimate of the truth of Srct in class TEstimator, namely, Crude method; we use one method to adjust for the sampling mechanism of Sobs in class SEstimator, namely, exact matching. We first estimate the CATEs using Sobs. ### Step 2.1 Estimation of the (C)ATEs In this step, we estimate the (C)ATEs in TEstimator. We start out by instantiating objects of class TEstimator using Sobs and Srct. We call TEstimator_wrapper() function to initialize the object source.obj and target.obj using Sobs and Srct respectively:
source.obj <- TEstimator_wrapper(
Estimator = "G_computation",
data = source.data,
name = "RWD",
vars_name = vars_name,
outcome_method = "glm",
outcome_formula = y ~ x1 + x2 + x3 + z + z:x1 + z:x2 +z:x3+ z:x6,
data.public = TRUE
)
target.obj <- TEstimator_wrapper(
Estimator = "Crude",
data = target.data,
name = "RCT",
vars_name = vars_name,
data.public = TRUE,
isTrial = TRUE
)
We specify the following arguments to instantiate source.obj and target.obj:
source.obj.rep <- SEstimator_wrapper(Estimator = "Exact",
target.obj = target.obj,
source.obj = source.obj,
selection_predictors = c("x2","x6"))
source.obj.rep$EstimateRep(stratification = c("x1","x3","x4","x5"))
The arguments list for the function SEstimator_wrapper is:
Then we call EstimateRep() - the core function of the instantiated object source.obj.rep. The function is to estimate the weighted ATEs of the target population and subsets using Sobs in source.obj. The weighted distribution of counfounders_sampling_name in source.obj and the distribution of counfounders_sampling_name in target.obj should be balanced. Two optional arguments for the function EstimateRep() are specified:
On completion of all class instantiations, we need to diagnose assumptions for object source.obj of class TEstimator, and we need to diagnose assumptions for object source.obj.rep of class SEstimator:
Lastly, we compute the validation metric in equation 3 on population and sub-population levels. We initialize a class Fusion as an object fusion and assign source.obj, target.obj, and source.obj.r ep to fusion. fusion combines estimates from the objects and validates the average treatment effects of the target population Pθ and sub-populations. The subpopulations are selected according to stratification and stratification_joint specified in source.obj.rep$Estim ateRep(). fusion validates estimates in source.obj and source.obj.rep using four metrics, i.e., pseudo mean squared error (mse), length of confidence interval (len_ci), estimate agreement (agg.est), and regulatory agreement (agg.reg):
RCTrep provides a dashboard that allows users to present all necessary results generated from the four steps and provides users with the flexibility to select sub-population(s) based on which RCTrep validates estimates of the average treatment effect. The dashboard can be launched by calling the function:
source.data <- RCTrep::source.data
target.data <- RCTrep::target.data
vars_name <- list(outcome_predictors = c("x1","x2","x3","x4","x5","x6"),
treatment_name = c('z'),
outcome_name = c('y')
)
source.obj.gc <- TEstimator_wrapper(
Estimator = "G_computation",
data = source.data,
name = "RWD",
vars_name = vars_name,
outcome_method = "glm",
outcome_formula = y ~ x1 + x2 + x3 + z + z:x1 + z:x2 +z:x3+ z:x6,
data.public = TRUE
)
source.obj.ipw <- TEstimator_wrapper(
Estimator = "IPW",
data = source.data,
name = "RWD",
vars_name = vars_name,
treatment_method = "glm",
treatment_formula = z ~ x1 + x2 + x3 + x4 + x5 + x6 + x1:x2 + x3:x4,
data.public = TRUE
)
source.obj.dr <- TEstimator_wrapper(
Estimator = "DR",
data = source.data,
name = "RWD",
vars_name = vars_name,
outcome_method = "glm",
outcome_formula = y ~ x1 + x2 + x3 + z + z:x1 + z:x2 +z:x3+ z:x6,
treatment_method = "glm",
treatment_formula = z ~ x1 + x2 + x3 + x4 + x5 + x6 + x1:x2 + x3:x4,
data.public = TRUE
)
target.obj <- TEstimator_wrapper(
Estimator = "Crude",
data = target.data,
name = "RCT",
vars_name = vars_name,
data.public = TRUE,
isTrial = TRUE
)
strata <- c("x1","x4")
selection_predictors <- c("x2","x6")
source.gc.exact <- SEstimator_wrapper(Estimator = "Exact",
target.obj = target.obj,
source.obj = source.obj.gc,
selection_predictors =
selection_predictors)
source.gc.exact$EstimateRep(stratification = strata,
stratification_joint = TRUE)
source.gc.isw <- SEstimator_wrapper(Estimator = "ISW",
target.obj = target.obj,
source.obj = source.obj.gc,
selection_predictors =
selection_predictors,
method = "glm")
source.gc.isw$EstimateRep(stratification = strata,
stratification_joint = TRUE)
source.gc.subclass <- SEstimator_wrapper(Estimator = "Subclass",
target.obj = target.obj,
source.obj = source.obj.gc,
selection_predictors =
selection_predictors)
source.gc.subclass$EstimateRep(stratification = strata,
stratification_joint = TRUE)
source.ipw.exact <- SEstimator_wrapper(Estimator = "Exact",
target.obj = target.obj,
source.obj = source.obj.ipw,
selection_predictors =
selection_predictors)
source.ipw.exact$EstimateRep(stratification = strata,
stratification_joint = TRUE)
source.ipw.isw <- SEstimator_wrapper(Estimator = "ISW",
target.obj = target.obj,
source.obj = source.obj.ipw,
selection_predictors =
selection_predictors,
method = "glm")
source.ipw.isw$EstimateRep(stratification = strata,
stratification_joint = TRUE)
source.ipw.subclass <- SEstimator_wrapper(Estimator = "Subclass",
target.obj = target.obj,
source.obj = source.obj.ipw,
selection_predictors =
selection_predictors)
source.ipw.subclass$EstimateRep(stratification = strata,
stratification_joint = TRUE)
source.dr.exact <- SEstimator_wrapper(Estimator = "Exact",
target.obj = target.obj,
source.obj = source.obj.dr,
selection_predictors =
selection_predictors)
source.dr.exact$EstimateRep(stratification = strata,
stratification_joint = TRUE)
source.dr.isw <- SEstimator_wrapper(Estimator = "ISW",
target.obj = target.obj,
source.obj = source.obj.dr,
selection_predictors =
selection_predictors,
method = "glm")
source.dr.isw$EstimateRep(stratification = strata,
stratification_joint = TRUE)
source.dr.subclass <- SEstimator_wrapper(Estimator = "Subclass",
target.obj = target.obj,
source.obj = source.obj.dr,
selection_predictors =
selection_predictors)
source.dr.subclass$EstimateRep(stratification = strata,
stratification_joint = TRUE)
fusion <- Fusion$new(target.obj,
source.gc.exact,
source.gc.isw,
source.gc.subclass,
source.ipw.exact,
source.ipw.isw,
source.ipw.subclass,
source.dr.exact,
source.dr.isw,
source.dr.subclass)
fusion$plot()
fusion$evaluate()
source.data <- RCTrep::source.data
target.data <- RCTrep::target.data
# Identification
vars_name <- list(outcome_predictors = c("x1","x2","x3","x4","x5","x6"),
treatment_name = c('z'),
outcome_name = c('y')
)
selection_predictors <- c("x2","x6")
# Estimate conditional average treatment effect
source.obj <- TEstimator_wrapper(
Estimator = "G_computation",
data = source.data,
vars_name = vars_name,
outcome_method = "glm",
outcome_form=y ~ x1 + x2 + x3 + z + z:x1 + z:x2 +z:x3+ z:x6,
name = "RWD",
data.public = FALSE
)
target.obj <- TEstimator_wrapper(
Estimator = "Crude",
data = target.data,
vars_name = vars_name,
name = "RCT",
data.public = FALSE,
isTrial = TRUE
)
head(source.obj$data)
# Estimate the weighted conditional average treatment effect of source.obj
strata <- c("x1","x4")
source.rep.obj <- SEstimator_wrapper(Estimator = "Exact",
target.obj = target.obj,
source.obj = source.obj,
selection_predictors =
selection_predictors)
source.rep.obj$EstimateRep(stratification = strata, stratification_joint = TRUE)
# Validate
fusion <- Fusion$new(target.obj,
source.obj,
source.rep.obj)
fusion$plot()
fusion$print()
fusion$evaluate()
In example 2 we demonstrate the validation approach using aggregated data of sub-populations from Srct and Sobs. However, in practice, we rarely have access to such aggregated RCT data. In most cases, we only have aggregated data of each variable and average treatment effects of sub-populations stratified by the variable individually. In example 3 we demonstrate using the marginal distribution of variables and estimates of sub-populations stratified by these variables individually to generate synthetic RCT data for validation. First, for a demonstrative purpose, we instantiate an object of class Crude using full RCT data. We derive the marginal distributions of variables of RCT data as descriptive statistics of the target population, and derive the estimates of the average treatment effects of populations stratified by the variables individually as the truth for validation:
library(dplyr)
source.data <- RCTrep::source.data
target.data <- RCTrep::target.data
# Identification
vars_name <- list(outcome_predictors = c("x1","x2","x3","x4","x5","x6"),
treatment_name = c('z'),
outcome_name = c('y')
)
# Generate target.obj using full dataset
target.obj <- TEstimator_wrapper(
Estimator = "Crude",
data = target.data,
vars_name = vars_name,
name = "RCT",
data.public = FALSE,
isTrial = TRUE
)
# Get unbiased estimates of conditional average treatment effect
vars_rct <- c("x1","x2","x3","x4","x5","x6")
RCT.estimates <- list(ATE_mean = target.obj$estimates$ATE$est,
ATE_se = target.obj$estimates$ATE$se,
CATE_mean_se = target.obj$get_CATE(vars_rct,FALSE))
Then we generate a synthetic RCT dataset synthetic.data using the marginal distributions of the variables Xk by calling the RCTrep function GenerateSyntheticData(). In the function, we specify a marginal distribution of each variable and pairwise correlations between the variables. Then the function generates the synthetic data of the RCT accordingly:
emp.p1 <- mean(target.data$x1)
emp.p2 <- mean(target.data$x2)
emp.p3 <- mean(target.data$x3)
emp.p4 <- mean(target.data$x4)
emp.p5 <- mean(target.data$x5)
emp.p6 <- mean(target.data$x6)
t.d <- target.data[,vars_rct]
n <- dim(source.data)[1]
pw.cor <- gdata::upperTriangle(cor(t.d), diag = FALSE, byrow = TRUE)
synthetic.data <- RCTrep::GenerateSyntheticData(
margin_dis="bernoulli",
N = n,
margin = list(emp.p1, emp.p2, emp.p3, emp.p4, emp.p5, emp.p6),
var_name = vars_rct,
pw.cor = pw.cor)
Then we instantiate target.obj of class TEstimator_Synthetic. We initialize the public field data by assigning synthetic.data to df; initialize the public field estimates by assigning RCT.estimates to estimates; initialize the public field outcome_predictors by assigning c(“x1”,“x2”,“x3”,“x4”,“x5”,“x6”) to vars_name. Note that synthetic.data might slightly shift from the true target population.
synthetic.data <- semi_join(synthetic.data, source.data, by = vars_rct)
target.obj <- TEstimator_Synthetic$new(data = synthetic.data,
estimates=RCT.estimates,
vars_name = vars_name,
name = "RCT",
isTrial = TRUE,
data.public = TRUE)
# Estimate conditional average treatment effect
source.data <- semi_join(source.data, synthetic.data, by = vars_rct)
source.obj <- TEstimator_wrapper(
Estimator = "G_computation",
data = source.data,
vars_name = vars_name,
outcome_method = "glm",
outcome_form=y ~ x1 + x2 + x3 + z + z:x1 + z:x2 +z:x3+ z:x6,
name = "RWD",
data.public = TRUE
)
# Estimate weighted conditional average treatment effect
source.rep.obj <- SEstimator_wrapper(Estimator="Exact",
target.obj=target.obj,
source.obj=source.obj,
selection_predictors=c("x2","x6"))
source.rep.obj$EstimateRep(stratification = vars_rct,
stratification_joint = FALSE)
# Combine objects and validate estimates
fusion <- Fusion$new(target.obj,
source.obj,
source.rep.obj)
fusion$plot()
fusion$evaluate()