User:Zkquan/sandbox

Generalized linear model (GLM) extends the application of regression model to fit different types of data by choosing the response distribution. In particular, the link function can connect the relationship between the linear predictor and the mean of the distribution function. The misspecified link function could cause considerable bias for the estimated regression coefficients and the mean response estimates.

Canonical Link Function
The canonical link function is derived from the the natural parameter in the exponential family.

and response $$Y $$ is distributed in exponential family of canonical form with canonical parameter $$\theta $$:


 * $$ f(y,\theta,\phi)=\exp\{\frac{y\theta-b(\theta)}{\phi}+c(y,\phi)\}\,$$ with known functions $$ b(.)$$, $$ c(.)$$ and $$\phi>0$$.

The canonical link function $$ g(\mu)$$ is the function that $$ \eta=\theta$$, which is equal to $$ g(.)=b'^{-1}(.)$$.

In most case, the canonical link function is preferred since:

(1) The minimal sufficient statistics for the regression parameter exist when using canonical link function in exponential family;

(2) In the iteratively reweighted least squares, Fisher scoring is the same as Newton-Raphson with observed information: $$E(H)=H \text{ and } I_{obs}=I $$

(3) The log-likelihood is strictly concave using the canonical function when $$\phi>0$$, which leads to unique MLE.

(4) It aids the good interpretation of the results. (e.g. the logit link allows for a simple representation of the odds in logistic regression)

In some cases, the non-canonical link is more suitable to fit the data since:

(1) For some dichotomous variables in the binary data, if the original variable are assumed to be normally distributed, the probit link is preferred.

(2) Compared with logit link and probit link, the complementary log-log link is the asymmetric link function so that it can be used when the probability of an event is too small or too large.

(3) In the exponential and gamma distributions, the positive linear predictor would give an impossible negative mean if using canonical link.

Goodness of Link Test
The goodness of link test for generalized linear models proposed by Pregibon can be done by embedding the assumed and true link functions in a parametric family of link functions. By considering the parametric family of link functions: $$G=\{g(u,\psi),\theta\in\Psi\}$$, where $$g(u)$$ is the correct but unknown link function when $$\psi=\psi^*$$.

The null and alternative hypotheses of link function $$g_0(u)=g(u,\psi_0)$$:


 * $$H_0:\psi=\psi_0\text{ and }H_a:\psi\neq\psi_0\,$$

For the first-order Taylor series expansion:


 * $$g(u,\psi)\approx g(u,\psi_0)+(\psi-\psi_0)\frac{\partial g(u,\psi)}{\partial \psi}|_{\psi=\psi_0}=X^T\beta+Z^T\gamma\,$$ where $$Z^T=\frac{\partial g(u,\psi)}{\partial \psi}|_{\psi=\psi_0}$$ and $$\gamma=(\psi-\psi_0)$$.

Therefore, the testing methods can be transformed to test if the additional term $$Z$$ is significant. McCullagh and and Nelder suggested the likelihood ratio test to see if a different link will lead to a significant improvement in fit. If the additional term $$Z^T\gamma$$ is significant, then $$H_0$$ can be rejected.

In the goodness-of-link test, it is intuitively assumed that the true parametric family of link functions should be chosen correctly. However, even though the true link family cannot be selected properly, one can still use this method to improve the fit for a given link function and its generalized family.

Link Function Selection in Binary Data
For the binary data, the logistic regression model with canonical logit link is popular since it provides simple interpretation of the odds. However, the application of non-canonical link like the complementary log-log link can sometimes significantly improve the fit. In this case, the goodness-of-link test can be applied with the Aranda-Ordaz link family :


 * $$g(u_i,\alpha)=g(\pi_i,\alpha)=\log(\frac{(1-\pi_i)^{-\alpha}-1}{\alpha})\,$$

where $$g(\pi_i,1)=\log(\frac{\pi_i}{1-\pi_i})$$ and $$g(\pi_i,0)=\log(-\log(1-\pi_i))$$.

For $$H_0: \alpha=\alpha_0$$, by the first-order Taylor series expansion:


 * $$g(\pi_i,\alpha)\approx g(\pi_i,\alpha_0)+(\alpha-\alpha_0)\frac{\partial g(u,\alpha)}{\partial \alpha}|_{\alpha=\alpha_0}=\eta_i+(\alpha-\alpha_0)\gamma_i\,$$ where $$\gamma_i=\frac{\log(1-\hat\pi_i)}{(1-\hat\pi_i)^{\alpha_0}-1}-\frac{1}{\alpha_0}$$.

If the goodness-of-link test rejects, another link function can be updated by $$g(u,\hat\alpha_0)$$, where $$\hat\alpha_0-\alpha_0$$ equals to the fitted regression coefficients of $$\gamma_i$$.

In addition to the Aranda-Ordaz link family, one can also utilize these testing methods with other potential link families:


 * Copenhaver and Mielke (one-parameter link families)


 * Guerrero and Johnson (similar to power transform)


 * Czado (two-parameter link families)

Link Function Selection in Count Data
The Poisson regression model with the canonical log link is common to model the count data, however, the negative binomial model allowed for more general variance functions to reduce the overdispersion issue since overdispersion can be seen as the lack of fit in the data. For a given link family, the goodness-of-link test can be used to determine appropriate link function and the dispersion parameters.

Residuals and Goodness of Fit
The goodness of fit test in GLM can be applied with residuals to select the link function. If there is no significant pattern in the residuals plot, it suggests that there is no significant indication of lack of fit.


 * The Pearson's Residuals: $$r_{i,P}=\frac{y_i-\hat{y}_i}{\sqrt{V(\hat\mu_i)\phi}}$$ where $$\hat{y}_i=\hat\mu_i=g^{-1}(X_i\hat\beta)$$.


 * The Deviance residuals: $$r_{D_i}=\text{sign}(y_i-\hat\mu_i)\sqrt{|d_i|}$$ with Deviance $$D^*(y,\hat\mu)=\sum_{i=1}^nd_i=2\{l(y,y)-l(\hat\mu,y)\}$$ where $$l$$ is the log likelihood.

These two types of residuals can be also used to validate the link choices by:
 * Hosmer–Lemeshow test (in binary data), Pearson's chi-squared test (in categorical data)
 * Akaike information criterion, Deviance information criterion
 * Peirce's criterion (robust statistics)

Link Function Selection in Quasi-likelihood Models
For the quasi-likelihood model, the discrepancy matrix can be viewed as an indicator of the goodness of link


 * $$ D(\beta)=E(\frac{\partial H}{\partial \beta})^2+E(\frac{\partial^2 H}{\partial \beta^2}) \,$$

The quasi-likelihood models with different links can be tested based on Information matrix test with following tests:


 * Asymptotic Wald test with dimension reduction technique provided by Cheng and Wu


 * Logarithm of the “in-and-out-sample” (IOS) likelihood ratio test proposed by Presnell and Boos


 * Information Ratio Test by Zhou and Song