--- title: "Targeted Learning with Stochastic Treatment Regimes" author: "[Nima Hejazi](https://nimahejazi.org), [Jeremy Coyle](https://github.com/jeremyrcoyle), and [Mark van der Laan](https://vanderlaan-lab.org)" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: ../inst/REFERENCES.bib vignette: > %\VignetteIndexEntry{Targeted Learning with Stochastic Treatment Regimes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r options, echo=FALSE} options(scipen=999) ``` ## Introduction Stochastic treatment regimes present a relatively simple manner in which to assess the effects of continuous treatments by way of parameters that examine the effects induced by the counterfactual shifting of the observed values of a treatment of interest. Here, we present an implementation of a new algorithm for computing targeted minimum loss-based estimates of treatment shift parameters defined based on a shifting function $d(A,W)$. For a technical presentation of the algorithm, the interested reader is invited to consult @diaz2018stochastic. For additional background on Targeted Learning and previous work on stochastic treatment regimes, please consider consulting @vdl2011targeted, @vdl2018targeted, and @diaz2012population. To start, let's load the packages we'll use and set a seed for simulation: ```{r setup, message=FALSE, warning=FALSE} library(data.table) library(sl3) library(tmle3) library(tmle3shift) set.seed(429153) ``` ## Data and Notation Consider $n$ observed units $O_1, \ldots, O_n$, where each random variable $O = (W, A, Y)$ corresponds to a single observational unit. Let $W$ denote baseline covariates (e.g., age, sex, education level), $A$ an intervention variable of interest (e.g., nutritional supplements), and $Y$ an outcome of interest (e.g., disease status). Though it need not be the case, let $A$ be continuous-valued, i.e. $A \in \mathbb{R}$. Let $O_i \sim \mathcal{P} \in \mathcal{M}$, where $\mathcal{M}$ is the nonparametric statistical model defined as the set of continuous densities on $O$ with respect to some dominating measure. To formalize the definition of stochastic interventions and their corresponding causal effects, we introduce a nonparametric structural equation model (NPSEM), based on @pearl2000causality, to define how the system changes under posited interventions: \begin{align*}\label{eqn:npsem} W &= f_W(U_W) \\ A &= f_A(W, U_A) \\ Y &= f_Y(A, W, U_Y), \end{align*} We denote the observed data structure $O = (W, A, Y)$ Letting $A$ denote a continuous-valued treatment, we assume that the distribution of $A$ conditional on $W = w$ has support in the interval $(l(w), u(w))$ -- for convenience, let this support be _a.e._ That is, the minimum natural value of treatment $A$ for an individual with covariates $W = w$ is $l(w)$; similarly, the maximum is $u(w)$. Then, a simple stochastic intervention, based on a shift $\delta$, may be defined \begin{equation}\label{eqn:shift} d(a, w) = \begin{cases} a - \delta & \text{if } a > l(w) + \delta \\ a & \text{if } a \leq l(w) + \delta, \end{cases} \end{equation} where $0 \leq \delta \leq u(w)$ is an arbitrary pre-specified value that defines the degree to which the observed value $A$ is to be shifted, where possible. For the purpose of using such a shift in practice, the present software provides the functions `shift_additive` and `shift_additive_inv`, which define a variation of this shift, assuming that the density of treatment $A$, conditional on the covariates $W$, has support _a.e._ ### Simulate Data ```{r sim_data} # simulate simple data for tmle-shift sketch n_obs <- 1000 # number of observations n_w <- 1 # number of baseline covariates tx_mult <- 2 # multiplier for the effect of W = 1 on the treatment ## baseline covariates -- simple, binary W <- as.numeric(replicate(n_w, rbinom(n_obs, 1, 0.5))) ## create treatment based on baseline W A <- as.numeric(rnorm(n_obs, mean = tx_mult * W, sd = 1)) ## create outcome as a linear function of A, W + white noise Y <- A + W + rnorm(n_obs, mean = 0, sd = 0.5) ``` The above composes our observed data structure $O = (W, A, Y)$. To formally express this fact using the `tlverse` grammar introduced by the [`tmle3` package](https://github.com/tlverse/tmle3), we create a single data object and specify the functional relationships between the nodes in the _directed acyclic graph_ (DAG) via _nonparametric structural equation models_ (NPSEMs), reflected in the node list that we set up: ```{r data_nodes} # organize data and nodes for tmle3 data <- data.table(W, A, Y) node_list <- list(W = "W", A = "A", Y = "Y") head(data) ``` We now have an observed data structure (`data`) and a specification of the role that each variable in the data set plays as the nodes in a DAG. ## Methodology To start, we will initialize a specification for the TMLE of our parameter of interest (called a `tmle3_Spec` in the `tlverse` nomenclature) simply by calling `tmle_shift`. We specify the argument `shift_val = 0.5` when initializing the `tmle3_Spec` object to communicate that we're interested in a shift of $0.5$ on the scale of the treatment $A$ -- that is, we specify $\delta = 0.5$ (note that this is an arbitrarily chosen value for this example). ```{r spec_init} # initialize a tmle specification tmle_spec <- tmle_shift(shift_val = 0.5, shift_fxn = shift_additive, shift_fxn_inv = shift_additive_inv) ``` As seen above, the `tmle_shift` specification object (like all `tmle3_Spec` objects) does _not_ store the data for our specific analysis of interest. Later, we'll see that passing a data object directly to the `tmle3` wrapper function, alongside the instantiated `tmle_spec`, will serve to construct a `tmle3_Task` object internally (see the [`tmle3` documentation](https://tlverse.org/tmle3) for details). Note that, by default, the `tmle_spec` object is set up to facilitate cross-validated estimation of likelihood components, ensuring certain empirical process conditions may be circumvented by reducing the contribution of an empirical process term to the estimated influence function [@zheng2011cross]. In practice, this automatic incorporation of cross-validation (CV-TMLE) means that the user need not be concerned with these theoretical conditions being satisfied; moreover, cross-validated estimation of the efficient influence function is expected to control the estimated variance. ### _Interlude:_ Constructing Optimal Stacked Regressions with `sl3` To easily incorporate ensemble machine learning into the estimation procedure, we rely on the facilities provided in the [`sl3` R package](https://tlverse.org/sl3). For a complete guide on using the `sl3` R package, consider consulting https://tlverse.org/sl3, or https://tlverse.org for the [`tlverse` ecosystem](https://github.com/tlverse), of which `sl3` is a core component. Using the framework provided by the [`sl3` package](https://tlverse.org/sl3), the nuisance parameters of the TML estimator may be fit with ensemble learning, using the cross-validation framework of the Super Learner algorithm of @vdl2007super. To estimate the treatment mechanism (denoted g, we must make use of learning algorithms specifically suited to conditional density estimation; a list of such learners may be extracted from `sl3` using `sl3_list_learners()`: ```{r sl3_density_lrnrs_search} sl3_list_learners("density") ``` To proceed, we'll select two of the above learners, `Lrnr_haldensify` for using the highly adaptive lasso for conditional density estimation, based on an algorithm given by @diaz2011super, and `Lrnr_density_semiparametric`, an approach for semiparametric conditional density estimation: ```{r sl3_density_lrnrs} # learners used for conditional density regression (i.e., propensity score) haldensify_lrnr <- Lrnr_haldensify$new( n_bins = 3, grid_type = "equal_mass", lambda_seq = exp(seq(-1, -9, length = 100)) ) hse_lrnr <- Lrnr_density_semiparametric$new(mean_learner = Lrnr_glm$new()) mvd_lrnr <- Lrnr_density_semiparametric$new(mean_learner = Lrnr_glm$new(), var_learner = Lrnr_mean$new()) sl_lrn_dens <- Lrnr_sl$new( learners = list(haldensify_lrnr, hse_lrnr, mvd_lrnr), metalearner = Lrnr_solnp_density$new() ) ``` We also require an approach for estimating the outcome regression (denoted Q). For this, we build a Super Learner composed of an intercept model, a main terms GLM, and the [xgboost algorithm](https://xgboost.ai/) for gradient boosting: ```{r sl3_regression_lrnrs} # learners used for conditional expectation regression (e.g., outcome) mean_lrnr <- Lrnr_mean$new() glm_lrnr <- Lrnr_glm$new() xgb_lrnr <- Lrnr_xgboost$new() sl_lrn <- Lrnr_sl$new( learners = list(mean_lrnr, glm_lrnr, xgb_lrnr), metalearner = Lrnr_nnls$new() ) ``` We can make the above explicit with respect to standard notation by bundling the ensemble learners into a `list` object below. ```{r make_lrnr_list} # specify outcome and treatment regressions and create learner list Q_learner <- sl_lrn g_learner <- sl_lrn_dens learner_list <- list(Y = Q_learner, A = g_learner) ``` The `learner_list` object above specifies the role that each of the ensemble learners we've generated is to play in computing initial estimators to be used in building a TMLE for the parameter of interest here. In particular, it makes explicit the fact that our `Q_learner` is used in fitting the outcome regression while our `g_learner` is used in fitting our treatment mechanism regression. ### Targeted Estimation of Stochastic Interventions Effects Note that, by default, the ```{r fit_tmle} tmle_fit <- tmle3(tmle_spec, data, node_list, learner_list) tmle_fit ``` The `print` method of the resultant `tmle_fit` object conveniently displays the results from computing our TML estimator. ### Statistical Inference for Targeted Maximum Likelihood Estimates Recall that the asymptotic distribution of TML estimators has been studied thoroughly: $$\psi_n - \psi_0 = (P_n - P_0) \cdot D(\bar{Q}_n^*, g_n) + R(\hat{P}^*, P_0),$$ which, provided the following two conditions: 1. If $D(\bar{Q}_n^{\star}, g_n)$ converges to $D(P_0)$ in $L_2(P_0)$ norm, and 2. the size of the class of functions considered for estimation of $\bar{Q}_n^{\star}$ and $g_n$ is bounded (technically, $\exists \mathcal{F}$ s.t. $D(\bar{Q}_n^{\star}, g_n) \in \mathcal{F}$ *__whp__*, where $\mathcal{F}$ is a Donsker class), readily admits the conclusion that $\psi_n - \psi_0 = (P_n - P_0) \cdot D(P_0) + R(\hat{P}^{\star}, P_0)$. Under the additional condition that the remainder term $R(\hat{P}^*, P_0)$ decays as $o_P \left( \frac{1}{\sqrt{n}} \right),$ we have that $$\psi_n - \psi_0 = (P_n - P_0) \cdot D(P_0) + o_P \left( \frac{1}{\sqrt{n}} \right),$$ which, by a central limit theorem, establishes a Gaussian limiting distribution for the estimator: $$\sqrt{n}(\psi_n - \psi) \to N(0, V(D(P_0))),$$ where $V(D(P_0))$ is the variance of the efficient influence curve (canonical gradient) when $\psi$ admits an asymptotically linear representation. The above implies that $\psi_n$ is a $\sqrt{n}$-consistent estimator of $\psi$, that it is asymptotically normal (as given above), and that it is locally efficient. This allows us to build Wald-type confidence intervals in a straightforward manner: $$\psi_n \pm z_{\alpha} \cdot \frac{\sigma_n}{\sqrt{n}},$$ where $\sigma_n^2$ is an estimator of $V(D(P_0))$. The estimator $\sigma_n^2$ may be obtained using the bootstrap or computed directly via the following $$\sigma_n^2 = \frac{1}{n} \sum_{i = 1}^{n} D^2(\bar{Q}_n^{\star}, g_n)(O_i)$$ Having now re-examined these facts, let's simply examine the results of computing our TML estimator: ```{r tmle_inference} tmle_fit ``` ## References