User:Sleepophil/sandbox

Bayesian hierarchical modelling is a statistical model written in multiple levels (hierarchical form) that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.

Frequentist statistics may yield conclusions seemingly incompatible with those offered by Bayesian statistics due to the Bayesian treatment of the parameters as random variables and its use of subjective information in establishing assumptions on these parameters. As the approaches answer different questions the formal results aren't technically contradictory but the two approaches disagree over which answer is relevant to particular applications. Bayesians argue that relevant information regarding decision making and updating beliefs cannot be ignored and that hierarchical modeling has the potential to overrule classical methods in applications where respondents give multiple observational data. Moreover, the model has proven to be robust, with the posterior distribution less sensitive to the more flexible hierarchical priors.

Hierarchical modeling is used when information is available on several different levels of observational units. The hierarchical form of analysis and organization helps in the understanding of multiparameter problems and also plays an important role in developing computational strategies.

Philosophy
Statistical methods and models commonly involve multiple parameters that can be regarded as related or connected in such a way that the problem implies dependence of the joint probability model for these parameters. Individual degrees of belief, expressed in the form of probabilities, come with uncertainty. Amidst this is the change of the degrees of belief over time. As was stated by Professor José M. Bernardo and Professor Adrian F. Smith, “The actuality of the learning process consists in the evolution of individual and subjective beliefs about the reality.” These subjective probabilities are more directly involved in the mind rather than the physical probabilities. Hence, it is with this need of updating beliefs that Bayesians have formulated an alternative statistical model which takes into account the prior occurrence of a particular event.

Bayes' theorem
The assumed occurrence of a real-world event will typically modify preferences between certain options. This is done by modifying the degrees of belief attached, by an individual, to the events defining the options.

Suppose in a study of the effectiveness of cardiac treatments, with the patients in hospital j having survival probability $$\theta_j$$, the survival probability will be updated with the occurrence of y, the event in which a controversial serum is created which, as believed by some, increases survival in cardiac patients.

In order to make updated probability statements about $$\theta_j$$, given the occurrence of event y, we must begin with a model providing a joint probability distribution for $$\theta_j$$ and y. This can be written as a product of the two distributions that are often referred to as the prior distribution $$P(\theta)$$ and the sampling distribution $$P(y\mid\theta)$$ respectively:


 * $$P(\theta, y) = P(\theta)P(y\mid\theta)$$

Using the basic property of conditional probability, the posterior distribution will yield:


 * $$P(\theta\mid y)=\frac{P(\theta,y)}{P(y)} = \frac{P(y\mid\theta)P(\theta)}{P(y)}$$

This equation, showing the relationship between the conditional probability and the individual events, is known as Bayes' theorem. This simple expression encapsulates the technical core of Bayesian inference which aims to incorporate the updated belief, $$P(\theta\mid y)$$, in appropriate and solvable ways.

Exchangeability
The usual starting point of a statistical analysis is the assumption that the n values $$y_1, y_2, \ldots, y_n$$ are exchangeable. If no information – other than data y – is available to distinguish any of the $$\theta_j$$’s from any others, and no ordering or grouping of the parameters can be made, one must assume symmetry among the parameters in their prior distribution. This symmetry is represented probabilistically by exchangeability. Generally, it is useful and appropriate to model data from an exchangeable distribution as independently and identically distributed, given some unknown parameter vector $$\theta$$, with distribution $$P(\theta)$$.

Finite exchangeability
For a fixed number n, the set $$y_1, y_2, \ldots, y_n$$ is exchangeable if the joint probability $$P(y_1, y_2, \ldots, y_n)$$ is invariant under permutations of the indices. That is, for every permutation $$\pi$$ or $$(\pi_1, \pi_2, \ldots, \pi_n)$$  of $$(1,2,...,n)$$, $$P(y_1, y_2, \ldots, y_n)= P(y_{\pi_1}, y_{\pi_2}, \ldots, y_{\pi_n}).$$

Following is an exchangeable, but not independent and identical (iid), example: Consider an urn with a red ball and a blue ball inside, with probability $$\frac{1}{2}$$ of drawing either. Balls are drawn without replacement, i.e. after one ball is drawn from the n balls, there will be n &minus; 1 remaining balls left for the next draw.


 * $$\text{Let }

Y_i = \begin{cases} 1, & \text{if the }i\text{th ball is red},\\ 0, & \text{otherwise}. \end{cases} $$

Since the probability of selecting a red ball in the first draw and a blue ball in the second draw is equal to the probability of selecting a blue ball on the first draw and a red on the second draw, both of which are equal to 1/2 (i.e. $$[P(y_1 = 1, y_2 =0) = P(y_1=0,y_2=1)= \frac{1}{2}]$$), then $$y_1$$ and $$y_2$$ are exchangeable.

But the probability of selecting a red ball on the second draw given that the red ball has already been selected in the first draw is 0, and is not equal to the probability that the red ball is selected in the second draw which is equal to 1/2 (i.e. $$ [P(y_2=1\mid y_1=1)=0 \ne P(y_2=1)= \frac{1}{2}]$$). Thus, $$y_1$$ and $$y_2$$ are not independent.

If $$x_1, \ldots, x_n$$ are independent and identically distributed, then they are exchangeable, but the converse is not necessarily true.

Infinite exchangeability
Infinite exchangeability is the property that every finite subset of an infinite sequence $$y_1$$, $$y_2, \ldots$$ is exchangeable. That is, for any n, the sequence $$y_1, y_2, \ldots, y_n$$ is exchangeable.

Components
Bayesian hierarchical modeling makes use of two important concepts in deriving the posterior distribution, namely:


 * 1) Hyperparameters: parameters of the prior distribution
 * 2) Hyperpriors: distributions of Hyperparameters

Suppose a random variable Y follows a normal distribution with parameter θ as the mean and 1 as the variance, that is $$Y\mid \theta \sim N(\theta,1)$$. The tilde relation $$\sim$$ can be read as "has the distribution of" or "is distributed as". Suppose also that the parameter $$\theta$$ has a distribution given by a normal distribution with mean $$\mu$$ and variance 1, i.e. $$\theta\mid\mu \sim N(\mu,1)$$. Furthermore, $$\mu$$ follows another distribution given, for example, by the standard normal distribution, $$\text{N}(0,1)$$. The parameter $$\mu$$ is called the hyperparameter, while its distribution given by $$\text{N}(0,1)$$ is an example of a hyperprior distribution. The notation of the distribution of Y changes as another parameter is added, i.e. $$Y \mid \theta,\mu \sim N(\theta,1)$$. If there is another stage, say, follows another nor$$\mu$$mal distribution with mean $$\beta$$ and variance $$\epsilon$$, meaning $$\mu \sim N(\beta,\epsilon)$$, $$ \mbox { }$$$$\beta$$ and $$\epsilon$$ can also be called hyperparameters while their distributions are hyperprior distributions as well.

Framework
Let $$y_j$$ be an observation and $$\theta_j$$ a parameter governing the data generating process for $$y_j$$. Assume further that the parameters $$\theta_1, \theta_2, \ldots, \theta_j$$ are generated exchangeably from a common population, with distribution governed by a hyperparameter $$\phi$$. The Bayesian hierarchical model contains the following stages:


 * $$\text{Stage I: } y_j\mid\theta_j,\phi \sim P(y_j\mid\theta_j,\phi)$$


 * $$\text{Stage II: } \theta_j\mid\phi \sim P(\theta_j\mid\phi)$$
 * $$\text{Stage III: } \phi \sim P(\phi)$$

The likelihood, as seen in stage I is $$P(y_j\mid\theta_j,\phi)$$, with $$P(\theta_j,\phi)$$ as its prior distribution. Note that the likelihood depends on $$\phi$$ only through $$\theta_j$$.

The prior distribution from stage I can be broken down into:


 * $$P(\theta_j,\phi) = P(\theta_j\mid\phi)P(\phi) $$ [from the definition of conditional probability]

With $$\phi$$ as its hyperparameter with hyperprior distribution, $$P(\phi)$$.

Thus, the posterior distribution is proportional to:


 * $$P(\phi,\theta_j\mid y) \propto P(y_j \mid\theta_j,\phi) P(\theta_j,\phi)$$ [using Bayes' Theorem]


 * $$P(\phi,\theta_j\mid y) \propto P(y_j\mid\theta_j ) P(\theta_j \mid\phi ) P(\phi) $$

Example
To further illustrate this, consider the example: A teacher wants to estimate how well a student did on the SAT. The teacher uses information on the student’s high school grades and current grade point average (GPA) to come up with an estimate. The student's current GPA, denoted by $$Y$$, has a likelihood given by some probability function with parameter $$\theta$$, i.e. $$Y\mid\theta \sim P(Y\mid\theta)$$. This parameter $$\theta$$ is the SAT score of the student. The SAT score is viewed as a sample coming from a common population distribution indexed by another parameter $$\phi$$, which is the high school grade of the student (freshman, sophomore, junior or senior). That is, $$\theta\mid\phi \sim P(\theta\mid\phi)$$. Moreover, the hyperparameter $$\phi$$ follows its own distribution given by $$P(\phi)$$, a hyperprior. To solve for the SAT score given information on the GPA,


 * $$P(\theta,\phi\mid Y) \propto P(Y\mid\theta,\phi)P(\theta,\phi)$$
 * $$P(\theta,\phi\mid Y) \propto P(Y\mid\theta)P(\theta\mid\phi)P(\phi)$$

All information in the problem will be used to solve for the posterior distribution. Instead of solving only using the prior distribution and the likelihood function, the use of hyperpriors gives more information to make more accurate beliefs in the behavior of a parameter.

2-stage hierarchical model
In general, the joint posterior distribution of interest in 2-stage hierarchical models is:


 * $$P(\theta,\phi\mid Y) = {P(Y\mid\theta,\phi) P(\theta,\phi) \over P(Y)} = {P(Y\mid\theta)P(\theta\mid\phi)P(\phi) \over P(Y)}$$


 * $$P(\theta,\phi\mid Y) \propto P(Y\mid\theta)P(\theta\mid\phi)P(\phi)$$

3-stage hierarchical model
For 3-stage hierarchical models, the posterior distribution is given by:


 * $$P(\theta,\phi, X\mid Y) = {P(Y\mid\theta)P(\theta\mid\phi)P(\phi\mid X)P(X) \over P(Y)}$$


 * $$P(\theta,\phi, X\mid Y) \propto P(Y\mid\theta)P(\theta\mid\phi)P(\phi\mid X)P(X)$$

8 schools hierarchical model (a 2-stage example)
A concrete example of a hierarchical model is the "8 schools" model, where 8 different schools organized coaching programs for the SAT test with various results. The improvement in students' SAT scores after finishing the coaching programs from 8 different schools is summarized as follows :

Namely, the scores of students in each school follows a normal distribution with mean $$y_i$$ and standard error $$\sigma_i$$, assuming the score distribution in each school is close enough to normal distributions so that the statistics can be effectively summarized by the two number $$y_i$$ and $$\sigma_i$$. The goal is to determine how effective are coaching programs at improving SAT scores in general, based on the data we obtain from these 8 schools.

To achieve this, we construct a hierarchical model with parameters $$\theta_i (i \in 1,\text{...},8)$$, which represents the true mean of the score improvement in each school, and hyperparameters $$\mu$$ and $$\tau$$, which reflects the effectiveness of coaching programs in general. Assuming that standard errors of the effect, $$\sigma_i$$, is well-constrained by the student score data, which can be achieved when the sample sizes are large, we have$$y_i | \theta_i \sim \mathcal{N}(\theta_ i,\sigma _i ^2)$$where $\mathcal{N}(\theta_i,\sigma _i ^2)$ is a normal distribution with mean $$\theta _i$$ and standard deviation $$\sigma_i$$. Assuming also that effectiveness of individual schools follows a normal distribution with mean $$\mu$$ and standard deviation $$\tau$$, i.e., $$ \theta _i | \mu, \tau \sim \mathcal{N} (\mu, \tau ^2) $$so this model is a 2-stage hierarchical model with parameters $$\theta _i$$ and hyperparameters $$\mu$$ and $$\tau$$.

Applying Bayes' theorem to this model gives

$$P(\mu,\tau | \{ y_i \}) \propto \prod_{i=1}^8 P(y_i | \theta _i,\sigma _i) P(\theta_i | \mu, \tau) P(\mu, \tau) $$where $$P(y_i | \theta _i,\sigma _i) $$ and $$P(\theta_i | \mu, \tau) $$ are both likelihood functions given by the normal distribution, and $$P(\mu, \tau) $$ is the prior of the hyperparameters. Prior to this experiments there is no information of how these coaching programs perform, so uniform distributions are used for priors of $$\mu $$ and $$\tau $$. This means that $$P(\mu, \tau) = P(\mu) P(\tau) $$ where $$P(\mu) $$ is a uniform distribution symmetric about 0 and $$P(\tau) $$ is a uniform distribution only on positive $$\tau$$ values. At this stage, the right-hand side of the equation above is completely specified, and Metropolis-Hastings algorithm can then be used to numerically integrate the posterior distribution of the hyperparameters. The result of a simulation using two separate Markov chains are plotted in the image to the right. From the posterior distributions shown, $$\mu $$ is center around values between 5 to 10, values greater than 0, meaning that coaching programs in general is effective at improving students' SAT score based on the data obtained from the 8 schools listed above.