User talk:Kvh157

The backfitting algorithm is a simple iterative procedure used to fit a Generalized additive model. It was introduced in 1985 by Leo Breiman and Jerome Freidman along with generalized additive models. In most cases, the backfitting algorithm is equivalent to the Gauss-Seidel method algorithm for solving a certain linear system of equations

Algorithm
Generalized additive models are a class of non-parametric regression models of the form:

$$ Y = \alpha + \sum_{i=1}^d f_i(X_i) + \epsilon $$

where each $$X_1, X_2, \ldots, X_p $$ is a variable in our p-dimensional predictor X, and Y is our outcome variable. $$\epsilon$$ represents our inherent error, which is assumed to have mean zero. The fi represent unspecified smooth functions of a single Xi. Given the flexibility in the fi, we typically do not have a unique solution: α is left unidentifiable. It is common to rectify this by constraining

$$\sum_1^N f_i(X_i) = 0$$

leaving

$$\alpha = 1/N \sum_1^N y_i$$

necessarily.

The backfitting algorithm is then: Initialize $$\hat{\alpha} = 1/N \sum_1^N y_i, \hat{f_j} \equiv 0$$,$$ \forall i, j$$ Do until $$\hat{f_j}$$ converge: For each predictor j: $$ \hat{f_j} \leftarrow Smooth[\lbrace y_i - \hat{\alpha} - \sum_{k \neq j} \hat{f_k}(x_{ik} \rbrace_1^N ]$$           $$ \hat{f_j} \leftarrow \hat{f_j} - 1/N \sum_{i=1}^N \hat{f_i}(x_{ij})$$

where $$Smooth$$ is our smoothing operator. This is typically chosen to be a cubic spline smoother but can be any other appropriate fitting operation, such as:


 * local polynomial regression
 * kernel smoothing methods
 * more complex operators, such as surface smoothers for second and higher-order interactions

Motivation
If we consider the problem of minimizing the expected squared error:

$$\min E[Y - (\alpha + \sum_{j=1}^p \hat{f_j}(x_{ij)})]^2$$

There exists a unique solution by the theory of projections given by:

$$f_i(X_i) = E[Y - (\alpha + \sum_{j \neq i}^p f_j(X_j)) | X_i)]^2$$

for all i = 1, 2, ... p.

This gives the matrix interpretation: $$

\begin{pmatrix} I & P_1 & \cdots & P_1 \\ P_2 & I & \cdots  & P_2 \\ \vdots & &  \ddots & \vdots \\ P_p & \cdots & P_p & I \end{pmatrix}

\begin{pmatrix} f_1(X_1)\\ f_2(X_2)\\ \vdots \\ f_p(X_p) \end{pmatrix} = \begin{pmatrix} P_1 Y\\ P_2 Y\\ \vdots \\ P_p Y \end{pmatrix}

$$

where $$P_i(\cdot) = E(\cdot|X_i)$$. In this context we can imagine a smoother matrix, $$S_i$$, which approximates our $$P_i$$ and gives an estimate, $$S_i Y$$, of $$E(Y|X)$$

$$

\begin{pmatrix} I & S_1 & \cdots & S_1 \\ S_2 & I & \cdots  & S_2 \\ \vdots & &  \ddots & \vdots \\ S_p & \cdots & S_p & I \end{pmatrix}

\begin{pmatrix} f_1\\ f_2\\ \vdots \\ f_p \end{pmatrix} = \begin{pmatrix} S_1 Y\\ S_2 Y\\ \vdots \\ S_p Y \end{pmatrix}

$$

or in abbreviated form

$$ \hat{S}f = QY $$

An exact solution of this is infeasible to calculate for large np, so the iterative technique of backfitting is used. We take initial guesses $$f_i^{(0)}$$ and update each $$f_i^{(j)}$$ in turn to be the smoothed fit for the residuals of all the others:

$$ \hat{f_i}^{(j)} \leftarrow Smooth[\lbrace y_i - \hat{\alpha} - \sum_{k \neq j} \hat{f_k}(x_{ik} \rbrace_1^N ]$$

Looking at the abbreviated form it is easy to see the backfitting algorithm as equivalent to the Gauss-Seidel method for linear smoothing operators S.

Explicit Derivation for Two Dimensions
For the two dimensional case, we can formulate the backfitting algorithm explicitly. We have:

$$ f_1 = S_1(Y-f_2), f_2 = S_2(Y-f_1) $$

If we denote <math \hat{f}_1^(i) as the estimate of $$f_1$$ in the i-th updating step, the backfitting steps are

$$ \hat{f}_1^{(i)} = S_1[Y - \hat{f}_2^{(i-1)}], \hat{f}_2^(i) = S_2[Y - \hat{f}_1^{(i-1)}] $$

By induction we get

$$ \hat{f}_1^{(i)} = Y - \sum_{\alpha = 0}^{i-1}(S_1 S_2)^\alpha(I-S_1)Y - (S_1 S_2)^{i -1} S_1\hat{f}_2^{(0)} $$

and

$$ \hat{f}_2^{(i)} = S_2 \sum_{\alpha = 0}^{i-1}(S_1 S_2)^\alpha(I-S_1)Y + S_2(S_1 S_2)^{i -1} S_1\hat{f}_2^{(0)} $$

If we assume our constant $$\alpha$$ is zero and we set $$ \hat{f}_2^{(0)}= 0$$ then we get

$$ \hat{f}_1^{(i)} = [I - \sum_{\alpha = 0}^{i-1}(S_1 S_2)^\alpha(I-S_1)]Y $$

$$ \hat{f}_2^{(i)} = [S_2 \sum_{\alpha = 0}^{i-1}(S_1 S_2)^\alpha(I-S_1)]Y $$

This converges if $$ \|S_1 S_2\| < 1 $$.

Issues
The choice of when to stop the algorithm is arbitrary and it is hard to know a priori how long reaching a specific conversion threshold will take. Also, the final model depends on the order in which the predictor variables $$X_i$$ are fit.

As well, the solution found by the backfitting procedure is non-unique. If $$b$$ is a vector such that $$\hat{S}b = 0$$ from above, then if $$\hat{f}$$ is a solution then so is $$\hat{f} + \alpha b$$ is also a solution for any $$ \alpha \in \mathbb{R}$$. A modification of the backfitting algorithm involving projections onto the eigenspace of S can remedy this problem.

Modified Algorithm
We can modify the backfitting algorithm to make it easier to provide a unique solution. Let $$ \mathcal{V}_1(S_i) $$ be the space spanned by all the eigenvectors of Si that correspond to eigenvalue 1. Then any b satisfying $$\hat{S}b = 0$$ has $$ b_i \in \mathcal{V}_1(S_i) \forall i=1,\dots,p$$ and $$ \sum_{i=1}^p b_i = 0.$$ Now if we take $$ A $$ to be a matrix that projects orthogonally onto $$ \mathcal{V}_1(S_1) + \dots + \mathcal{V}_1(S_p) $$, we get the following modified backfitting algorithm:

Initialize $$\hat{\alpha} = 1/N \sum_1^N y_i, \hat{f_j} \equiv 0$$,$$ \forall i, j$$, $$\hat{f_+} = \alpha + \hat{f_1} + \dots + \hat{f_p} $$ Do until $$\hat{f_j}$$ converge: Regress $$ y - \hat{f_+} $$ onto the space $$ \mathcal{V}_1(S_i) + \dots + \mathcal{V}_1(S_p) $$, setting $$ a = A(Y- \hat{f_+})$$ For each predictor j: Apply backfitting update to $$(Y - a)$$ using the smoothing operator $$(I - A_i)S_i$$, yielding new estimates for $$\hat{f_j}$$