User:Ostitelman/Draft of Targeted Maximum Likelihood Estimation(TMLE)

Targeted Maximum Likelihood Estimatation (TMLE) is a parameter estimation method that specifically targets a parameter of interest by constructing estimators that perform optimally in terms of the bias/variance trade off for that parameter. Targeted maximum likelihood estimators (TMLE), were originally proposed by van der Laan and Rubin in a 2006 paper. For many commonly implemented parameter estimation methods the parameter being estimated is either a convenient remnant of an apriori assumed model (e.g. linear regression, logistic regression, cox-proportional hazards regression) or the parameter is estimated without concern for the quality of the estimate. In most cases parameter estimation is performed with a mis-specified model where it is known that the resulting estimates, confidence intervals, and p-values for the target parameters will be biased. TMLE is a two-stage method which addresses the drawbacks of these other methods by estimating target parameters that focus on the question of interest in a way that is concerned with the optimality of the final estimate.

Parameter of Interest
TMLE allows one to estimate parameters which may be written as functions of the distribution of the underlying data under a desired intervention of interest. By allowing the parameters to be defined in this manner there is a great amount of flexibility which allows practitioners to create target parameters that directly answers ones underlying question of interest. Suppose one observes a censored data structure $$O = \Phi(C,X)$$ of the full data $$X$$ and censoring variable $$C$$ which has a probability distribution $$P_{0}$$. Let $$\mathcal{M}$$ be a semiparametric model for the probability distribution $$P_{0}$$. By assuming coarsening at random (CAR) the likelihood factors as $$P_{0}(O) = Q_{0}(O)g_{0}(O|X)$$, where $$Q_{0}$$ is the part of the likelihood associated with the full data, $$X$$, and $$g_{0}$$ is the conditional distribution of the observed data, $$O$$, given the full data. $$g_{0}$$ typically includes the distributions of censoring and treatment variables, which both act to coarsen the full data. The factorization of the density implies that the model $$\mathcal{M}$$ may be partitioned into a model $$\mathcal{Q}$$ for the full data distribution, $$Q_{0}$$, and model $$\mathcal{G}$$ for the censoring and treatment mechanism, $$g_{0}$$. So the probability distribution, $$P_{0}$$ may be indexed in the following way: $$P_{Q_{0},g_{0}}$$. One is typically interested in estimating a parameter, $$\Psi(P_{0})$$, which is a function of the true data generating distribution. Often the parameter of interest is $$\Psi(Q_{0})$$ which is a function of the true full data generating distribution absent coarsening.

Obtaining The Targeted Maximum Likelihood Estimate of A Parameter of Interest
The TMLE methodology is a two stage process which results in a substitution estimator of $$\Psi(Q_{0})$$. In the first stage the underlying distribution of the data is estimated using data adaptive methods under a non-parametric model. Then, in the second stage the initial estimate of the underlying data distribution is fluctuated with concern for the bias and variance of the estimate of the target parameter. The method requires an estimate $$g_n$$ of the $$g_0$$ part of the likelihood. In the first stage an initial estimate, $$Q_n^0$$, of $$Q_0$$ is obtained and in the second stage the first stage estimate is fluctuated to reduce bias in the estimate of the parameter of interest. The bias is reduced by insuring that the efficient influence curve equation is solved by the targeted maximum likelihood solution. This is achieved by finding the targeted maximum likelihood estimate of $$Q_{0}$$, $$Q^{*}_{n,g}$$, with a parametric fluctuation model, $$Q_n(\epsilon)$$, whose score (derivative of the log-likelihood) at the initial estimator (i.e., at zero fluctuation or $$\epsilon = 0$$) equals the efficient influence curve of the parameter of interest. This may be done by specifying a univariate regression model for the outcome of interest on a covariate, $$h(Q_n^0,g_n)$$, specifically chosen to yield the appropriate score while using the initial estimator of $$Q_n^0$$ as an offset. The coefficient $$\epsilon$$ in front of the clever covariate is then estimated using standard parametric maximum likelihood. This is known as the first targeted maximum likelihood step and yields $$Q_n^1$$, the first step targeted maximum likelihood estimate of $$Q_0$$. The targeted maximum likelihood step is then iterated using $$Q_n^1$$ as the initial estimator of $$Q_0$$ and the estimate of $$g_0$$ remains unchanged. This process is iterated until $$\epsilon$$ converges to zero, resulting in the targeted maximum likelihood estimate of $$Q_0$$, or $$Q^{*}_{n,g}$$. $$\Psi(Q^{*}_{n,g})$$ is the targeted maximum likelihood estimate of the parameter $$\Psi(Q_{0})$$. For many common parameters of interest the full bias reduction is achieved in one updating step. Note, that $$Q^{*}_{n,g}$$ is indexed by the treatment and censoring mechanisms, $$g$$. This is because, the TMLE, through $$h$$, makes use of the fact that the observed data is generated according to a censored data structure as dictated by the efficient influence curve for $$\Psi(Q_{0})$$.

Influence Curves and The Efficient Influence Curve
The efficient influence curve is an important object in semiparametric estimation and it is necessary to know the efficient influence curve in order to construct the TMLE estimator. The efficient influence curve is uniquely determined for each target parameter. Suppose one observes a n i.i.d random vectors $$X_1,...,X_n$$ from distribution $$P_0$$ and are interested in obtaining an estimate $$\Theta_n$$ of a particular target parameter, $$\Theta$$. In general there are many different possible $$\Theta_n$$ that range in their ability to be a good estimate of $$\Theta$$. An influence curve, $$D(X_i)$$, is a random function of the observed data, for which the following equation may be written for $$\Theta_n$$:


 * $$n^{1/2}(\Theta_n - \Theta) = n^{-1/2} \sum_{i=1}^n D(X_i) + o_p(1) $$,

where $$o_p(1)$$ converges in probability to zero. Each $$\Theta_n$$ of $$\Theta$$ has its own influence curve, $$D(X_i)$$. However, there is a unique $$D(X_i)$$, $$D^*(X_i)$$, that attains the semi-parametric efficiency bound. $$D^*(X_i)$$ is termed the efficient influence function. The $$\Theta_n$$ that corresponds with $$D^*(X_i)$$ exhibits optimal properties for estimating $$\Theta$$. The $$\Theta_n$$ that corresponds with $$D^*(X_i)$$ attains the attains the semi-parametric efficiency bound for $$\Theta$$. In addition, these $$\Theta_n$$ displays additional robustness properties. For an in depth discussion of estimators that solve the efficient influence curve equation and efficient influence curves in general see van der Laan and Robins (1996) and van der Vaart (2000).

The efficient influence curve, $$D^*(X_i)$$, for a particular target parameter may be constructed by projecting any $$D(X_i)$$ of $$\Theta$$ onto the tangent space $$T_Q$$, $$\prod(D(X_i)|T_Q)$$. $$T_Q$$ is the space spanned by the scores of a parametric submodel for $$P_0$$, $$P_0(\epsilon)$$. One easily attainable $$D(X_i)$$ that may be used for the projection is the influence curve for the inverse probability weighted estimator, $$D_{IPW}$$.

Properties of Targeted Maximum Likelihood Estimators
Targeted Maximum Likelihood Estimators are referred to as double robust because they are consistent when either the $$Q$$ or the $$g$$ part of the likelihood are estimated consistently. Furthermore since TMLE estimators solve the efficient influence curve equation they are locally efficient and attain the semiparametric efficiency bound when both $$Q$$ and $$g$$ are estimated consistently. Another class of estimators, namely the Augmented Inverse Probability Weighted (AIPW) estimators, also exhibit these two properties. However, those estimators are constructed as a solution to an estimating equation and as a result the TMLE exhibits several advantages over AIPW methods:


 * 1) The TMLE respects the global constraint of the model(e.g. when estimating a probability the estimate is guaranteed to fall between 0 and 1). This advantage also contributes to finite sample gains in efficiency.
 * 2) TMLE provides a loss functions upon which to assess different estimates of $$g$$ on how they affect the final estimate. Traditionally the fit of $$g$$ has been determined based on a loss function for the global density.
 * 3) The TMLE can produce estimates when the efficient influence curve may not be written as an estimating function in terms of the parameter of interest.
 * 4) The TMLE does not have multiple solutions in contrast to estimating equations.

In addition, since the TMLE solves the efficient influence curve equation, a natural consequence of which is asymptotically linearity, confidence intervals and p-values may be easily constructed using the empirical variance of the efficient influence curve.

Example
Suppose one observes n i.i.d observations of $$(W,A,Y)$$ from probability distribution $$P_0$$. Where $$W$$ a vector of baseline covariates, $$A$$ is a binary variable that indicates whether an individual was treated or not, and $$Y$$ is a binary outcome of interest. Assume the following time ordering of the variables: W -> A -> Y. This time ordering implies the following factorization of the likelihood:
 * $$p(O)=p(W)p(A|W)p(Y|A,W)$$.

This factorization of the likelihood corresponds to a $$g$$ part, $$g(A|W) = p(A|W)$$, and a $$Q$$ part, which is composed of $$Q(W) = p(W)$$ and $$Q(A,W) = p(Y|A,W)$$. Next, the g-computation formula provides the distribution of the data under a specified intervention, such as setting $$A$$ to a specified level $$a$$:
 * $$Q_a = p(W)p(Y|A=a,W)$$.

Now that the distribution under intervention is available via the g-computation formula the parameter of interest may be defined as a mapping from $$Q_0$$ into the real line, $$\Psi(Q_0)$$. One interesting parameter of interest is the causal effect of treatment, $$A$$ on the outcome $$Y$$, $$\Psi(Q_0)$$:


 * $$E_W[E[Y_{A=1}|W]-E[Y_{A=0}|W]]$$,

where $$Y_{A=a}$$ is the counter-factual outcome of Y that would have been observed had treatment been set to level $$a$$. By using the empirical distribution for $$W$$ and an estimate of $$p(Y|A,W)$$, $$Q_n$$, one may then map these estimates into $$\Psi(Q_n)$$ an estimate of $$\Psi(Q_0)$$:


 * $$\sum_{i=1}^n [ Q_{n}(1,W) - Q_{n}(0,W)]$$.

This estimate is a standard mle estimate of the causal effect of treatment $$A$$. However, the estimate of $$Q_0$$ was selected based on how well it estimates the entire distribution and not for its ability to estimate $$\Psi(Q_0)$$ as well as possible. Thus, the next step is to update $$\Psi(Q_n)$$ to $$\Psi(Q^*_n)$$ a distribution targeted toward estimating $$\Psi(Q_0)$$ as well as possible. For this to be accomplished the efficient influence curve, $$D^*$$, for the causal effect parameter defined is needed. The $$D_{IPW}$$ for the target parameter is:
 * $$\frac{I(A=1)Y}{p(A=1|W)}-\frac{I(A=0)Y}{p(A=0|W)}$$.

The efficient influence curve for this target parameter may be obtained by projecting $$D_{IPW}$$ on the tangent space for $$Q$$:
 * $$D^* = \frac{I(A=1)-I(A=0)}{g(A|W)}[Y-Q(A,W)] + Q(1,W) - Q(0,W) - \Psi(Q)$$.

Now that the efficient influence curve is available it is necessary to construct a parametric submodel:


 * $$logit(Q^*_n) = logit(Q_n) + \epsilon h(A,W)$$,

where $$h(A,W)$$ chosen such that the score (derivative of the log-likelihood) at $$\epsilon = 0$$ is equal to the efficient influence curve. For this parameter of interest $$h(A,W)$$ is equal to:


 * $$\frac{I(A=1)-I(A=0)}{g(A|W)}$$.

$$h(A,W)$$ is then estimated at $$g_n$$, an estimate of $$g_0$$, and $$Q^*_n$$ is obtained using standard logistic regression according to the parametric submodel above. Finally, $$\Psi(Q^*_n)$$, the TMLE estimate of the target parameter is obtained according to the mapping above now using $$Q^*_n$$.