Beta regression

Beta regression is a form of regression which is used when the response variable, $$y$$, takes values within $$(0, 1)$$ and can be assumed to follow a beta distribution. It is generalisable to variables which takes values in the arbitrary open interval $$(a, b)$$ through transformations. Beta regression was developed in the early 2000s by two sets of statisticians: Kieschnick and McCullough in 2003 and Ferrari and Cribari-Neto in 2004.

Description
The modern beta regression process is based on the mean/precision parameterisation of the beta distribution. Here the variable is assumed to be distributed according to $$B(\mu, \phi)$$ where $$\mu$$ is the mean and $$\phi$$ is the precision. As the mean of the distribution, $$\mu$$ is constrained to fall within $$(0, 1)$$ but $$\phi$$ is not. For given values of $$\mu$$, higher values of $$\phi$$ result in a beta with a lower variance, hence its description as a precision parameter.

Beta regression has three major motivations. Firstly, beta-distributed variables are usually heteroscedastic of a form where the scatter is greater closer to the mean value and lesser in the tails, whereas linear regression assumes homoscedasticity. Secondly, while transformations are available to consider beta distributed dependent variables within the generalised linear regression frameworks, these transformation mean that the regressions model $$y'$$ rather than $$y$$, so the interpretation is in terms of the mean of $$y'$$ rather than the mean of $$y$$, which presents a more awkward interpretation. Thirdly, values within $$(0, 1)$$ are generally from skewed distributions.

The basic algebra of the beta regression is linear in terms of the link function, but even in the equal dispersion case presented below, it is not a special case of generalised linear regression:

$$g(\mu_i) = x_i^T\beta_i = \eta_i,$$

where $$g$$ is a link function.

It is also notable that the variance of $$y$$ is dependent on $$\mu$$ in the model, so beta regressions are naturally heteroscedastic.

Variable dispersion beta regression
There is also variable dispersion beta regression, where $$\phi$$ is modelled independently for each observation rather than being held constant. Likelihood ratio tests can be "interpreted as testing the null hypothesis of equidispersion against a specific alternative of variable dispersion" by using normal versus variable dispersions. For example, within the R programming language, the formula "$$y \sim x_1 + x_2$$" describes an equidispersion model but it might be compared to any of the following three specific variable dispersion alternatives:


 * $$y\sim x_1 + x_2 | z_1$$
 * $$y\sim x_1 + x_2 | z_2$$
 * $$y\sim x_1 + x_2 | z_1 + z_2$$

The Breusch-Pagan test can be used to identify $$z$$ variables.

The choice of link equation can render the need for variable dispersion irrelevant, at least when judged in terms of model fit.

A quasi RESET diagnostic test (inspired by RESET, i.e. regression specification error test) is available for considering misspecification, particularly in the context of link equation choice. If a power of a fitted mean/linear predictor is used as a covariate and it results in a better model than the same formula without the power term, then the original model formula is a misspecification. This quasi-RESET diagnostic procedure may also be considered graphically, for example by comparing the absolute raw residuals for each model as the $$(x, y)$$ values, with the model that has the smaller absolute residual more often is to be preferred.

In general, the closer the observed $$y$$ values are to the $$(a, b)$$ extremes, the more significant the choice of link function.

The link function can also affect whether the MLE procedure statistical programs use to implement beta regressions converge. Furthermore the MLE procedure can tend to underestimate the standard errors and therefore significance inferences in beta regression. In practice, however, Bias Correction (BC) and Bias Reduction (BR) are essentially diagnostic steps, i.e. the analyst compares the model with neither BC nor BR to two models, each implementing one of BC and BR.

The assumptions of beta regression are:


 * link appropriateness ("deviance residuals vs. indices of observation", at least for the logit link )
 * homogenous residuals ("deviance residuals vs. linear predictor" )
 * normality ("half-normal plot of deviance residuals" )
 * no outliers ("Cook's distance to determine outliers" )