--- title: "Variable Importance Analysis with Stochastic Interventions" author: "[Nima Hejazi](https://nimahejazi.org), [Jeremy Coyle](https://github.com/jeremyrcoyle), and [Mark van der Laan](https://vanderlaan-lab.org)" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: ../inst/REFERENCES.bib vignette: > %\VignetteIndexEntry{Variable Importance Analysis with Stochastic Interventions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r options, echo=FALSE, eval=FALSE} options(scipen=999) ``` ## Introduction Stochastic treatment regimes present a relatively simple manner in which to assess the effects of continuous treatments by way of parameters that examine the effects induced by the counterfactual shifting of the observed values of a treatment of interest. Here, we present an implementation of a new algorithm for computing targeted minimum loss-based estimates of treatment shift parameters defined based on a shifting function $d(A,W)$. For a technical presentation of the algorithm, the interested reader is invited to consult @diaz2018stochastic. For additional background on Targeted Learning and previous work on stochastic treatment regimes, please consider consulting @vdl2011targeted, @vdl2018targeted, and @diaz2012population. To start, let's load the packages we'll use and set a seed for simulation: ```{r setup, message=FALSE, warning=FALSE} library(data.table) library(sl3) library(tmle3) library(tmle3shift) set.seed(429153) ``` ## Data and Notation Consider $n$ observed units $O_1, \ldots, O_n$, where each random variable $O = (W, A, Y)$ corresponds to a single observational unit. Let $W$ denote baseline covariates (e.g., age, sex, education level), $A$ an intervention variable of interest (e.g., nutritional supplements), and $Y$ an outcome of interest (e.g., disease status). Though it need not be the case, let $A$ be continuous-valued, i.e. $A \in \mathbb{R}$. Let $O_i \sim \mathcal{P} \in \mathcal{M}$, where $\mathcal{M}$ is the nonparametric statistical model defined as the set of continuous densities on $O$ with respect to some dominating measure. To formalize the definition of stochastic interventions and their corresponding causal effects, we introduce a nonparametric structural equation model (NPSEM), based on @pearl2000causality, to define how the system changes under posited interventions: \begin{align*}\label{eqn:npsem} W &= f_W(U_W) \\ A &= f_A(W, U_A) \\ Y &= f_Y(A, W, U_Y), \end{align*} We denote the observed data structure $O = (W, A, Y)$ Letting $A$ denote a continuous-valued treatment, we assume that the distribution of $A$ conditional on $W = w$ has support in the interval $(l(w), u(w))$ -- for convenience, let this support be _a.e._ That is, the minimum natural value of treatment $A$ for an individual with covariates $W = w$ is $l(w)$; similarly, the maximum is $u(w)$. Then, a simple stochastic intervention, based on a shift $\delta$, may be defined \begin{equation}\label{eqn:shift} d(a, w) = \begin{cases} a - \delta & \text{if } a > l(w) + \delta \\ a & \text{if } a \leq l(w) + \delta, \end{cases} \end{equation} where $0 \leq \delta \leq u(w)$ is an arbitrary pre-specified value that defines the degree to which the observed value $A$ is to be shifted, where possible. ### Simulate Data ```{r sim_data} # simulate simple data for tmle-shift sketch n_obs <- 1000 # number of observations n_w <- 1 # number of baseline covariates tx_mult <- 2 # multiplier for the effect of W = 1 on the treatment # baseline covariates -- simple, binary W <- as.numeric(replicate(n_w, rbinom(n_obs, 1, 0.5))) # create treatment based on baseline W A <- as.numeric(rnorm(n_obs, mean = tx_mult * W, sd = 1)) # create outcome as a linear function of A, W + white noise Y <- A + W + rnorm(n_obs, mean = 0, sd = 0.5) ``` The above composes our observed data structure $O = (W, A, Y)$. To formally express this fact using the `tlverse` grammar introduced by the [`tmle3` package](https://github.com/tlverse/tmle3), we create a single data object and specify the functional relationships between the nodes in the _directed acyclic graph_ (DAG) via _nonparametric structural equation models_ (NPSEMs), reflected in the node list that we set up: ```{r data_nodes} # organize data and nodes for tmle3 data <- data.table(W, A, Y) node_list <- list(W = "W", A = "A", Y = "Y") head(data) ``` We now have an observed data structure (`data`) and a specification of the role that each variable in the data set plays as the nodes in a DAG. ## Methodology ### Defining a grid of counterfactual interventions In order to specify a _grid_ of shifts $\delta$ to be used in defining a set of stochastic intervention policies in an _a priori_ manner, let us consider an arbitrary scalar $\delta$ that defines a counterfactual outcome $\psi_n = Q_n(d(A, W), W)$, where, for simplicity, let $d(A, W) = A + \delta$. A simplified expression of the auxiliary covariate for the TML estimator of $\psi$ is $H_n = \frac{g^{\star}(a \mid w)}{g(a \mid w)}$, where $g^{\star}(a \mid w)$ defines the treatment mechanism with the stochastic intervention implemented. To ascertain whether a given choice of the shift $\delta$ is _admissable_ -- that is, whether such an intervention may be implemented while avoiding violations of the positivity assumption -- define a bound $C(\delta) := \frac{g^{\star}(a \mid w)}{g(a \mid w)} \leq M$, where $g^{\star}(a \mid w)$ is a function of $\delta$ in part, and $M$ is a potentially user-specified upper bound of $C(\delta)$. Then, $C(\delta)$ may be interpreted as a measure of the influence of a given observation providing a way to limit the maximum influence of a given observation through a choice of the shift $\delta$ and the setting of the bound $M$. We formalize and extend the procedure to determine an acceptable set of values for the shift $\delta$ in the sequel. Specifically, let there be a shift $d(a, w) = a + \delta$, where the shift $\delta$ is defined \begin{equation} \delta(a, w) = \begin{cases} \delta, & \delta_{\text{min}}(a,w) \leq \delta \leq \delta_{\text{max}}(a,w) \\ \delta_{\text{max}}(a,w), & \delta \geq \delta_{\text{max}}(a,w) \\ \delta_{\text{min}}(a,w), & \delta \leq \delta_{\text{min}}(a,w) \\ \end{cases}, \end{equation} where $$\delta_{\text{max}}(a, w) = \text{argmax}_{\left\{\delta \geq 0, \frac{g(a - \delta \mid w)}{g(a \mid w)} \leq M \right\}} \frac{g(a - \delta \mid w)}{g(a \mid w)}$$ and $$\delta_{\text{min}}(a, w) = \text{argmin}_{\left\{\delta \leq 0, \frac{g(a - \delta \mid w)}{g(a \mid w)} \leq M \right\}} \frac{g(a - \delta \mid w)}{g(a \mid w)}.$$ The above provides a strategy for implementing a shift at the level of a given observation $(a, w)$, thereby allowing for all observations to be shifted to an appropriate value -- whether $\delta_{\text{min}}$, $\delta$, or $\delta_{\text{max}}$. For the purpose of using such a shift in practice, the present software provides the functions `shift_additive_bounded` and `shift_additive_bounded_inv`, which define a variation of this shift: \begin{equation} \delta(a, w) = \begin{cases} \delta, & C(\delta) \leq M \\ 0, \text{otherwise} \\ \end{cases}, \end{equation} which corresponds to an intervention in which the natural value of treatment of a given observational unit is shifted by a value $\delta$ in the case that the ratio of the intervened density $g^{\star}(a \mid w)$ to the natural density $g(a \mid w)$ (that is, $C(\delta)$) does not exceed a bound $M$. In the case that the ratio $C(\delta)$ exceeds the bound $M$, the stochastic intervention policy does not apply to the given unit and they remain at their natural value of treatment $a$. ### _Interlude:_ Constructing Optimal Stacked Regressions with `sl3` To easily incorporate ensemble machine learning into the estimation procedure, we rely on the facilities provided in the [`sl3` R package](https://tlverse.org/sl3). For a complete guide on using the `sl3` R package, consider consulting https://tlverse.org/sl3, or https://tlverse.org for the [`tlverse` ecosystem](https://github.com/tlverse), of which `sl3` is a core engine. Using the framework provided by the [`sl3` package](https://tlverse.org/sl3), the nuisance parameters of the TML estimator may be fit with ensemble learning, using the cross-validation framework of the Super Learner algorithm of @vdl2007super. To estimate the treatment mechanism (often denoted "g" in the targeted learning literature), we must make use of learning algorithms specifically suited to conditional density estimation; a list of such learners may be extracted from `sl3` by using `sl3_list_learners()`: ```{r sl3_density_lrnrs_search} sl3_list_learners("density") ``` To proceed, we'll select two of the above learners, `Lrnr_haldensify` for using the highly adaptive lasso for conditional density estimation, based on an algorithm given by @diaz2011super, and `Lrnr_density_semiparametric`, an approach for semiparametric conditional density estimation: ```{r sl3_density_lrnrs} # learners used for conditional density regression (i.e., propensity score) haldensify_lrnr <- Lrnr_haldensify$new( n_bins = 3, grid_type = "equal_mass", lambda_seq = exp(seq(-1, -9, length = 100)) ) hse_lrnr <- Lrnr_density_semiparametric$new(mean_learner = Lrnr_glm$new()) mvd_lrnr <- Lrnr_density_semiparametric$new(mean_learner = Lrnr_glm$new(), var_learner = Lrnr_mean$new()) sl_lrn_dens <- Lrnr_sl$new( learners = list(haldensify_lrnr, hse_lrnr, mvd_lrnr), metalearner = Lrnr_solnp_density$new() ) ``` We also required an approach for estimating the outcome regression (often denoted "Q" in the targeted learning literature). For this, we build a Super Learner composed of an intercept model, a GLM, and the [xgboost algorithm](https://xgboost.ai/) for gradient boosting: ```{r sl3_regression_lrnrs} # learners used for conditional expectation regression (e.g., outcome) mean_lrnr <- Lrnr_mean$new() glm_lrnr <- Lrnr_glm$new() xgb_lrnr <- Lrnr_xgboost$new() sl_lrn <- Lrnr_sl$new( learners = list(mean_lrnr, glm_lrnr, xgb_lrnr), metalearner = Lrnr_nnls$new() ) ``` We can make the above explicit with respect to standard notation by bundling the ensemble learners into a `list` object below. ```{r make_lrnr_list} # specify outcome and treatment regressions and create learner list Q_learner <- sl_lrn g_learner <- sl_lrn_dens learner_list <- list(Y = Q_learner, A = g_learner) ``` The `learner_list` object above specifies the role that each of the ensemble learners we've generated is to play in computing initial estimators to be used in building a TMLE for the parameter of interest here. In particular, it makes explicit the fact that our `Q_learner` is used in fitting the outcome regression while our `g_learner` is used in fitting our treatment mechanism regression. ### Initializing `vimshift` through its `tmle3_Spec` To start, we will initialize a specification for the TMLE of our parameter of interest (called a `tmle3_Spec` in the `tlverse` nomenclature) simply by calling `tmle_shift`. We specify the argument `shift_grid = seq(-1, 1, by = 1)` when initializing the `tmle3_Spec` object to communicate that we're interested in assessing the mean counterfactual outcome over a grid of shifts `r seq(-1, 1, by = 1)` on the scale of the treatment $A$ (note that the numerical choice of shift is an arbitrarily chosen set of values for this example). ```{r vim_spec_init} # what's the grid of shifts we wish to consider? delta_grid <- seq(-1, 1, 1) # initialize a tmle specification tmle_spec <- tmle_vimshift_delta(shift_fxn = shift_additive_bounded, shift_fxn_inv = shift_additive_bounded_inv, shift_grid = delta_grid, max_shifted_ratio = 2) ``` As seen above, the `tmle_vimshift` specification object (like all `tmle3_Spec` objects) does _not_ store the data for our specific analysis of interest. Later, we'll see that passing a data object directly to the `tmle3` wrapper function, alongside the instantiated `tmle_spec`, will serve to construct a `tmle3_Task` object internally (see the `tmle3` documentation for details). ### Targeted Estimation of Stochastic Interventions Effects One may walk through the step-by-step procedure for fitting the TML estimator of the mean counterfactual outcome under each shift in the grid, using the machinery exposed by the [`tmle3` R package](https://tlverse.org/tmle3) (see below); however, the step-by-step procedure is more often not of interest. ```{r fit_tmle_manual, eval=FALSE} # NOT RUN -- SEE NEXT CODE CHUNK # define data (from tmle3_Spec base class) tmle_task <- tmle_spec$make_tmle_task(data, node_list) # define likelihood (from tmle3_Spec base class) likelihood_init <- tmle_spec$make_initial_likelihood(tmle_task, learner_list) # define update method (fluctuation submodel and loss function) updater <- tmle_spec$make_updater() likelihood_targeted <- Targeted_Likelihood$new(likelihood_init, updater) # invoke params specified in spec tmle_params <- tmle_spec$make_params(tmle_task, likelihood_targeted) updater$tmle_params <- tmle_params # fit TML estimator update tmle_fit <- fit_tmle3(tmle_task, likelihood_targeted, tmle_params, updater) # extract results from tmle3_Fit object tmle_fit ``` Instead, one may invoke the `tmle3` wrapper function (a user-facing convenience utility) to fit the series of TML estimators (one for each parameter defined by the grid delta) in a single function call: ```{r fit_tmle_auto} # fit the TML estimator tmle_fit <- tmle3(tmle_spec, data, node_list, learner_list) tmle_fit ``` _Remark_: The `print` method of the resultant `tmle_fit` object conveniently displays the results from computing our TML estimator. ### Inference with Marginal Structural Models In the directly preceding section, we consider estimating the mean counterfactual outcome $\psi_n$ under several values of the intervention $\delta$, taken from the aforementioned $\delta$-grid. We now turn our attention to an approach for obtaining inference on a single summary measure of these estimated quantities. In particular, we propose summarizing the estimates $\psi_n$ through a marginal structural model (MSM), obtaining inference by way of a hypothesis test on a parameter of this working MSM. For a data structure $O = (W, A, Y)$, let $\psi_{\delta}(P_0)$ be the mean outcome under a shift $\delta$ of the treatment, so that we have $\vec{\psi}_{\delta} = (\psi_{\delta}: \delta)$ with corresponding estimators $\vec{\psi}_{n, \delta} = (\psi_{n, \delta}: \delta)$. Further, let $\beta(\vec{\psi}_{\delta}) = \phi((\psi_{\delta}: \delta))$. For a given MSM $m_{\beta}(\delta)$, we have that $$\beta_0 = \text{argmin}_{\beta} \sum_{\delta}(\psi_{\delta}(P_0) - m_{\beta}(\delta))^2 h(\delta),$$ which is the solution to $$u(\beta, (\psi_{\delta}: \delta)) = \sum_{\delta}h(\delta) \left(\psi_{\delta}(P_0) - m_{\beta}(\delta) \right) \frac{d}{d\beta} m_{\beta}(\delta) = 0.$$ This then leads to the following expansion $$\beta(\vec{\psi}_n) - \beta(\vec{\psi}_0) \approx -\frac{d}{d\beta} u(\beta_0, \vec{\psi}_0)^{-1} \frac{d}{d\psi} u(\beta_0, \psi_0)(\vec{\psi}_n - \vec{\psi}_0),$$ where we have $$\frac{d}{d\beta} u(\beta, \psi) = -\sum_{\delta} h(\delta) \frac{d}{d\beta} m_{\beta}(\delta)^t \frac{d}{d\beta} m_{\beta}(\delta) -\sum_{\delta} h(\delta) m_{\beta}(\delta) \frac{d^2}{d\beta^2} m_{\beta}(\delta),$$ which, in the case of an MSM that is a linear model (since $\frac{d^2}{d\beta^2} m_{\beta}(\delta) = 0$), reduces simply to $$\frac{d}{d\beta} u(\beta, \psi) = -\sum_{\delta} h(\delta) \frac{d}{d\beta} m_{\beta}(\delta)^t \frac{d}{d\beta} m_{\beta}(\delta),$$ and $$\frac{d}{d\psi}u(\beta, \psi)(\psi_n - \psi_0) = \sum_{\delta} h(\delta) \frac{d}{d\beta} m_{\beta}(\delta) (\psi_n - \psi_0)(\delta),$$ which we may write in terms of the efficient influence function (EIF) of $\psi$ by using the first order approximation $(\psi_n - \psi_0)(\delta) = \frac{1}{n}\sum_{i = 1}^n \text{EIF}_{\psi_{\delta}}(O_i)$, where $\text{EIF}_{\psi_{\delta}}$ is the efficient influence function (EIF) of $\vec{\psi}$. Now, say, $\vec{\psi} = (\psi(\delta): \delta)$ is d-dimensional, then we may write the efficient influence function of the MSM parameter $\beta$ (assuming a linear MSM) as follows $$\text{EIF}_{\beta}(O) = \left(\sum_{\delta} h(\delta) \frac{d}{d\beta} m_{\beta}(\delta) \frac{d}{d\beta} m_{\beta}(\delta)^t \right)^{-1} \cdot \sum_{\delta} h(\delta) \frac{d}{d\beta} m_{\beta}(\delta) \text{EIF}_{\psi_{\delta}}(O),$$ where the first term is of dimension $d \times d$ and the second term is of dimension $d \times 1$. In an effort to generalize still further, consider the case where $\psi_{\delta}(P_0) \in (0, 1)$ -- that is, $\psi_{\delta}(P_0)$ corresponds to the probability of some event of interest. In such a case, it would be more natural to consider a logistic MSM $$m_{\beta}(\delta) = \frac{1}{1 + \exp(-f_{\beta}(\delta))},$$ where $f_{\beta}$ is taken to be linear in $\beta$ (e.g., $f_{\beta} = \beta_0 + \beta_1 \delta + \ldots$). In such a case, we have the parameter of interest $$\beta_0 = \text{argmax}_{\beta} \sum_{\delta} \left(\psi_{\delta}(P_0) \text{log} m_{\beta}(\delta) + (1 - \psi_{\delta}(P_0))\log(1 - m_{\beta}(\delta))\right)h(\delta),$$ where $\beta_0$ solves the following $$ \sum_{\delta} h(\delta) \frac{d}{d\beta} f_{\beta}(\delta) (\psi_{\delta}(P_0) - m_{\beta}(\delta)) = 0.$$ Inference from a working MSM is rather straightforward. To wit, the limiting distribution for $m_{\beta}(\delta)$ may be expressed $$\sqrt{n}(\beta_n - \beta_0) \to N(0, \Sigma),$$ where $\Sigma$ is the empirical covariance matrix of $\text{EIF}_{\beta}(O)$. #### Directly Targeting the MSM Parameter $\beta$ Note that in the above, a working MSM is fit to the individual TML estimates of the mean counterfactual outcome under a given value of the shift $\delta$ in the supplied grid. The parameter of interest $\beta$ of the MSM is asymptotically linear (and, in fact, a TML estimator) as a consequence of its construction from individual TML estimators. In smaller samples, it may be prudent to perform a TML estimation procedure that targets the parameter $\beta$ directly, as opposed to constructing it from several independently targeted TML estimates. An approach for constructing such an estimator is proposed in the sequel. Let $C = \left(\sum_{\delta} h(\delta) \frac{d}{d\beta} m_{\beta}(\delta) \frac{d}{d\beta} m_{\beta}(\delta)^t \right)$, then $$\text{EIF}_{\beta}(O) = C^{-1} \cdot \sum_{\delta} h(\delta) \frac{d}{d\beta} m_{\beta}(\delta)(Y - \overline{Q}(A,W) + C^{-1} \sum_{\delta} h(\delta) \frac{d}{d\beta} m_{\beta}(\delta) \left(\int \overline{Q}(a,w) g_{\delta}^0(a \mid w) - \Psi_{\delta}\right).$$ Suppose a simple working MSM $\mathbb{E}Y_{g^0_{\delta}} = \beta_0 + \beta_1 \delta$, then a TML estimator targeting $\beta_0$ and $\beta_1$ may be constructed as $$\overline{Q}_{n, \epsilon}(A,W) = \overline{Q}_n(A,W) + \epsilon (H_1(g), H_2(g),$$ for all $\delta$, where $H_1(g)$ is the auxiliary covariate for $\beta_0$ and $H_2(g)$ is the auxiliary covariate for $\beta_1$. To construct a targeted maximum likelihood estimator that directly targets the parameters of the working marginal structural model, we may use the `tmle_vimshift_msm` Spec (instead of the `tmle_vimshift_delta` Spec that appears above): ```{r vim_targeted_msm_fit} # what's the grid of shifts we wish to consider? delta_grid <- seq(-1, 1, 1) # initialize a tmle specification tmle_msm_spec <- tmle_vimshift_msm(shift_fxn = shift_additive_bounded, shift_fxn_inv = shift_additive_bounded_inv, shift_grid = delta_grid, max_shifted_ratio = 2) # fit the TML estimator and examine the results tmle_msm_fit <- tmle3(tmle_msm_spec, data, node_list, learner_list) tmle_msm_fit ``` ## References