tmle3
Framework OverviewThe tmle3
package differs from previous TMLE software
efforts in that it attempts to directly model the key objects defined in
the mathematical and theoretical framework of Targeted Minimum
Loss-Based Estimation (TMLE). That is, rather than focus on implementing
a specific TML estimator, or a small set of related estimators, the
focus is on modeling the TMLE framework itself.
Therefore, we explicitly define objects to model the NPSEM, the
factorized likelihood, counterfactual interventions, parameters, and
TMLE update procedures. The hope is that, in so doing, it will be
possible to support a substantial subset of the vast array of TML
estimators currently present in the literature, as well as those that
have yet to be developed. In this vignette, we describe these
mathematical objects, their software analogs in tmle3
, and
illustrate with a motivating example, described below. At the end, we
describe how these objects can be bundled into a complete specification
of a TML estimation procedure that can be easily applied by an end
user.
We use data from the Collaborative Perinatal Project (CPP), available
in the sl3
package. To simplify this example, we define a
binary intervention variable, parity01
– an indicator of
having one or more children before the current child and a binary
outcome, haz01
– an indicator of having an above average
height for age.
TMLE requires the specification of a Nonparametric Structural Equation Model (NPSEM), which specifies our knowledge of relationships between the variables.
We start with a set of endogenous variables, X = (X1, …, XJ), that we want to model the relationship between. Each Xj is at least partially observed in the dataset. The NPSEM defines each variable (Xj) by a deterministic function (fXj) of its parent nodes (Pa(Xj)) and an exogenous random variable (UXj):
Xj = fXj(Pa(Xj), UXj), j ∈ {1, …, J}
The exact functional form of the functions fXj is left unspecified at this step. If there is a priori knowledge for some of these functions, that can be specified during the likelihood step below.
The collection of exogenous random variables defined by the NPSEM is
U = (UX1, …, UXJ).
Typically, non-testable assumptions about the joint distribution of
U are necessary for
identifiability of causal parameters with statistical parameters of the
observed data. These assumptions are not managed in the
tmle3
framework, which instead focus on the statistical
estimation problem. Therefore, those developing tools for end users need
to be clear about the additional causal assumptions necessary for causal
interpretation of estimates.
In the case of our CPP example, we use the classic point treatment
NPSEM which defines three nodes: X = (W, A, Y),
where W is a set of baseline
covariates, A is our exposure
of interest (parity01
), and Y is our outcome of interest
(haz01
). We define the following SCM:
W = fW(UW) A = fA(W, UA) Y = fY(W, UY)
In tmle3
, this is done using the
define_node
function for each node.
define_node
allows a user to specify the node_name, which
columns in the data comprise the node, and a list of parent nodes.
npsem <- list(
define_node("W", c(
"apgar1", "apgar5", "gagebrth", "mage",
"meducyrs", "sexn"
)),
define_node("A", c("parity01"), c("W")),
define_node("Y", c("haz01"), c("A", "W"))
)
Nodes also track information about the data types of the variables (continuous, categorical, binomial, etc). Here, that information is being estimated automatically from the data. In the future, each node will also contain information about censoring indicators, where applicable, but this is not yet implemented.
tmle3_Task
A tmle3_Task
is an object comprised of observed data,
and the NPSEM defined above:
This task object contains methods to help subset the data as needed for various steps in the TMLE process:
## [1] 1 1 1 0 0 1
## An sl3 Task with 1441 obs and these nodes:
## $covariates
## A W1 W2 W3 W4 W5 W6
## "parity01" "apgar1" "apgar5" "gagebrth" "mage" "meducyrs" "sexn"
##
## $outcome
## [1] "haz01"
##
## $id
## NULL
##
## $weights
## NULL
##
## $offset
## NULL
##
## $time
## NULL
A tmle3_Task
is a special kind of sl3_Task
that can be used to estimate factors of a likelihood from data. The
process of defining and estimating a likelihood is described next.
Having defined the NPSEM, we can now define a joint likelihood (probability density function) over the observed variables X:
P(X1, …, XJ ∈ D) = ∫DfX1, …, XJ(x1, …, xJ)dx1, …, dxJ
This can then be factorized into a series of conditional densities according to the NPSEM: $$f_{X_1, \ldots, X_J} = \prod_j^J f_{X_j \mid Pa(X_j)}(x \mid Pa(x_j))$$
Where each fXj ∣ Pa(Xj) is a conditional pdf (or probability mass function for discrete Xj), where the conditioning set is all parent nodes as defined in the NPSEM. We refer to these objects as likelihood factors.
TMLE depends on estimates (or a priori knowledge) of the functional form of these likelihood factors. However, not all factors of the likelihood are always necessary for estimation, and only those necessary will be estimated.
tmle3
models this likelihood as a list of likelihood
factor objects, where each likelihood factor object describes either
a priori knowledge or an estimation strategy for the
corresponding likelihood factor. These objects all inherit from the
LF_base
base class, and there are different types depending
on which of a range of estimation strategies or a priori
knowledge is appropriate.
In some cases, a full conditional density for a particular factor is not necessary. Instead, a conditional mean – a much easier quantity to estimate – is all that’s required. Although conditional means are not truly likelihood factors, conditional means are also modeled using using likelihood factor objects.
LF_emp
LF_emp
represents a likelihood factor to be estimated
using nonparametric maximum likelihood estimation (NP-MLE). That is,
probability mass $\frac{1}{n}$ is
placed on each observation in the observed dataset:
$$f_{X_j}(x_j) = \frac{1}{n}\mathbb{I}(x_j \in X_{n,j})$$
Going forward, weights will be used if specified, although this is
not yet supported. LF_emp
only supports marginal densities.
That is, the conditioning set, Pa(Xj)
must be empty. Therefore, it is only appropriate for estimation of the
marginal density of baseline covariates.
LF_fit
LF_fit
represents a likelihood factor to be estimated
using the sl3
framework. Based on the learner type used,
this can fit a pmf (for binomial or categorical data, see
sl3_list_learners("binomial")
and
sl3_list_learners("categorical")
for lists), a conditional
mean (most learners), or a conditional density (e.g., using a
semiparametric conditional density estimator via
Lrnr_density_semiparametric
). LF_fit
takes a
sl3
learner object as an argument, which is fit to the data
in the tmle3_Task
automatically. Details for specifying
different kinds of learners in sl3
may be found at http://tlverse.org/sl3/articles/intro_sl3.html
The above to likelihood factor types, LF_fit
, and
LF_emp
, are both likelihood factors where the factor is
estimated from data. In some cases, users may have a priori
knowledge of a likelihood factor. For instance, in an RCT, there might
be an unconditional probability of treatment of p = 0.5. Additional likelihood
factor types need to be create to accommodate this type of
knowledge.
Going back to our CPP data example, we will estimate the marginal
likelihood of W, using NP-MLE,
the conditional density of A
given W using a GLM fit via
sl3
and the conditional mean of Y given A and W using another GLM fit via
sl3
:
# set up sl3 learners for tmle3 fit
lrnr_glm_fast <- make_learner(Lrnr_glm_fast)
lrnr_mean <- make_learner(Lrnr_mean)
# define and fit likelihood
factor_list <- list(
define_lf(LF_emp, "W"),
define_lf(LF_fit, "A", lrnr_glm_fast),
define_lf(LF_fit, "Y", lrnr_glm_fast, type="mean")
)
The particular likelihood factors and estimation strategies to use
will of course depend on the parameter of interest. Once this list of
likelihood factors is defined, we can construct a
Likelihood
object and train it on the data contained in
tmle_task
:
likelihood_def <- Likelihood$new(factor_list)
likelihood <- likelihood_def$train(tmle_task)
print(likelihood)
## W: Lf_emp
## A: LF_fit
## Y: LF_fit
A tmle3
Likelihood
is actually a special
type of sl3
learner, so the syntax to train it on data is
analogous.
Having fit the likelihood, we can now get likelihood values for any
tmle3_Task
:
## [1] 0.5792991 0.5792991 0.6909451 0.6909451 0.6909451 0.4523370
In tmle3
, interventions are modeled by likelihoods where
one or more likelihood factors is replaced with a counterfactual version
representing some intervention.
tmle3
defines the CF_Likelihood
class,
which inherits from Likelihood
, and takes an
observed_likelihood
and an
intervention_list
.
Below, we describe some examples of additional likelihood factors
intended to be used to describe interventions. We expect this list to
grow as tmle3
is extended to additional use-cases.
LF_static
Likelihood factor for a static intervention, where all observations are set do a single intervention value x′:
fXj ∣ Pa(Xj)(xj ∣ Pa(xj)) = I(xj = x′)
Additional likelihood factor types need to be defined for other types
of interventions, such as dynamic rules and stochastic interventions.
Currently, a prototype version of a stochastic shift intervention exists
in LF_shift
.
For our CPP example, we’ll define a simple intervention where we set all treatment A = 1:
We can then use this to construct a counterfactual likelihood:
A cf_likelihood
is a likelihood object, and so has the
same behavior as the observed likelihood object defined above, but with
the observed likelihood factors being replaced by the defined
intervention likelihood factors.
In particular, we can get likelihood values under the counterfactual likelihood:
## [1] 1 1 0 0 0 1
We see that the likelihood values for the A node are all either 0 or 1, as would be expected from an indicator likelihood function. In addition, the likelihood values for the non-intervention nodes have not changed.
Each CF_Likelihood
can generate one or more
counterfactual tasks. These are tmle3_Task
s in which
observed values are replaced with counterfactual values according to the
specified intervention distribution. For deterministic interventions,
only one task will be generated. However, stochastic interventions, when
implemented, will generate several such tasks, one for each combination
of possible values of the intervention node(s).
To enumerate these tasks, use enumerate_cf_tasks
:
cf_likelihood_tasks <- cf_likelihood$enumerate_cf_tasks(tmle_task)
head(cf_likelihood_tasks[[1]]$data)
## apgar1 apgar5 gagebrth mage meducyrs sexn parity01 haz01
## <num> <num> <num> <int> <num> <int> <num> <num>
## 1: 8 9 287 21 12 1 1 1
## 2: 8 9 287 21 12 1 1 1
## 3: 8 9 280 15 0 1 1 1
## 4: 8 9 280 15 0 1 1 0
## 5: 8 9 280 15 0 1 1 0
## 6: 9 9 266 23 0 1 1 1
In this case, you can see that parity01
has been set to
1 for all observations, consistent with a static intervention on this
node.
In the TMLE framework, we define a target parameter Ψ(P) as a mapping from a probability distribution P ∈ ℳ to a set of real numbers ℝd. Here ℳ is implied by the NPSEM we defined above.
In tmle3
, we define parameter objects as objects
inheriting from the Param_base
class, which keep track of
not only the mapping from a probability distribution to a parameter
value, but also the corresponding EIF of the parameter, and the “clever
covariates” needed to calculate a TMLE update to the likelihood.
Here, we define a treatment-specific mean (TSM) parameter based on the intervention we defined previously:
## Warning in super$initialize(observed_likelihood, ..., outcome_node =
## outcome_node): Parameter was passed a non-Targeted Likelihood object so
## estimates cannot be updated from initial
TODO: provide details about parameter definition
The update procedure component of tmle3
is currently in
flux. The current structure is as follows:
We have an object, tmle3_Update
, which calculates the
individual update steps using tmle3_Update$update_step
.
This adds to a Likelihood$update_list
, so that future calls
to Likelihood$get_likelihoods
will return updated
likelihood values. However, likelihood values are generally recomputed
at each step, which requires applying all past updates. This is
ridiculously inefficient.
Instead, we need to do what previous TMLE implementations have done, which is enumerate a list of required likelihood values, and update those values as we go (as opposed to updating the function and recalculating the value each time they are needed). This requires the ability to have the parameters enumerate which likelihood values they will need for defining the clever covariate, as well as parameter mapping and the EIF. This has not yet been implemented.
Therefore, the update procedure, as well as the structure of the
Param_base
parameter objects are subject to substantial
changes in the near future.
Currently, the tmle3_Update
object also has a hard-coded
submodel (logistic), loss function (log-likelihood), and solver (GLM).
These need to be generalized so updates can be done for a range of
submodels, loss functions, and solvers.
Current Usage:
In the TMLE framework, we define a target parameter Ψ(P) as a mapping from a probability distribution P ∈ ℳ to a set of real numbers ℝd. Here ℳ is implied by the NPSEM we defined above.
In tmle3
, we define parameter objects as objects
inheriting from the Param_base
class, which keep track of
not only the mapping from a probability distribution to a parameter
value, but also the corresponding EIF of the parameter, and the “clever
covariates” needed to calculate a TMLE update to the likelihood.
Here, we define a treatment specific mean (TSM) parameter based on the intervention we defined previously:
## Warning in super$initialize(observed_likelihood, ..., outcome_node =
## outcome_node): Parameter was passed a non-Targeted Likelihood object so
## estimates cannot be updated from initial
TODO: provide details about parameter definition
tmle3_Fit
- Putting it all togetherNow that we have specified all the components required for the TMLE procedure, we can generate an object that manages all the components and finally calculate an appropriate TML estimator.
## Warning in learner$predict_fold(learner_task, fold_number):
## Lrnr_glm_fast_TRUE_Cholesky is not cv-aware: self$predict_fold reverts to
## self$predict
## Warning in learner$predict_fold(learner_task, fold_number):
## Lrnr_glm_fast_TRUE_Cholesky is not cv-aware: self$predict_fold reverts to
## self$predict
## Warning in learner$predict_fold(learner_task, fold_number):
## Lrnr_glm_fast_TRUE_Cholesky is not cv-aware: self$predict_fold reverts to
## self$predict
## A tmle3_Fit that took 1 step(s)
## type param init_est tmle_est se lower upper
## <char> <char> <num> <num> <num> <num> <num>
## 1: TSM E[Y_{A=1}] 0.5280522 0.5280522 0.01464371 0.4993511 0.5567533
## psi_transformed lower_transformed upper_transformed
## <num> <num> <num>
## 1: 0.5280522 0.4993511 0.5567533
The tmle3
framework described above is completely
general, and allows most components of the TMLE procedure to be
specified in a modular way. However, most end users will not be
interested in manually specifying all of these components. Therefore,
tmle3
implements a tmle3_Spec
object that
bundles a set of components into a specification that, with
minimal additional detail, can be run by an end-user:
nodes <- list(W = c("apgar1", "apgar5", "gagebrth", "mage", "meducyrs",
"sexn"),
A = "parity01",
Y = "haz01")
lrnr_glm_fast <- make_learner(Lrnr_glm_fast)
lrnr_mean <- make_learner(Lrnr_mean)
learner_list <- list(Y = lrnr_mean, A = lrnr_glm_fast)
# make a new copy to deal with data.table weirdness
cpp2 <- data.table::copy(cpp)
tmle_fit_from_spec <- tmle3(tmle_TSM_all(), cpp2, nodes, learner_list)
## Warning in learner$predict_fold(learner_task, fold_number):
## Lrnr_glm_fast_TRUE_Cholesky is not cv-aware: self$predict_fold reverts to
## self$predict
## Warning in learner$predict_fold(learner_task, fold_number): Lrnr_mean is not
## cv-aware: self$predict_fold reverts to self$predict
## Warning in learner$predict_fold(learner_task, fold_number): Lrnr_mean is not
## cv-aware: self$predict_fold reverts to self$predict
## Warning in learner$predict_fold(learner_task, fold_number): Lrnr_mean is not
## cv-aware: self$predict_fold reverts to self$predict
## Warning in learner$predict_fold(learner_task, fold_number):
## Lrnr_glm_fast_TRUE_Cholesky is not cv-aware: self$predict_fold reverts to
## self$predict
## Warning in learner$predict_fold(learner_task, fold_number):
## Lrnr_glm_fast_TRUE_Cholesky is not cv-aware: self$predict_fold reverts to
## self$predict
## A tmle3_Fit that took 1 step(s)
## type param init_est tmle_est se lower upper
## <char> <char> <num> <num> <num> <num> <num>
## 1: TSM E[Y_{A=0}] 0.55517 0.6979470 0.04692307 0.6059795 0.7899146
## 2: TSM E[Y_{A=1}] 0.55517 0.5267412 0.01473676 0.4978577 0.5556247
## psi_transformed lower_transformed upper_transformed
## <num> <num> <num>
## 1: 0.6979470 0.6059795 0.7899146
## 2: 0.5267412 0.4978577 0.5556247
Currently, this is effectively a hard-coded list of those details:
the structure of the NPSEM, the parameters, and the update procedure are
coded into the specification. Only the data, the roles of the variables,
and the sl3
learners to use for likelihood estimation.
Ideally, instead a tmle3_Spec
would represent a set of
reasonable defaults for a particular TMLE, that experienced users could
override where appropriate.
Obviously, there’s a lot more to do:
tmle3_Update
tmpe3_Spec