Allele frequency spectrum

In population genetics, the allele frequency spectrum, sometimes called the site frequency spectrum, is the distribution of the allele frequencies of a given set of loci (often SNPs) in a population or sample. Because an allele frequency spectrum is often a summary of or compared to sequenced samples of the whole population, it is a histogram with size depending on the number of sequenced individual chromosomes. Each entry in the frequency spectrum records the total number of loci with the corresponding derived allele frequency. Loci contributing to the frequency spectrum are assumed to be independently changing in frequency. Furthermore, loci are assumed to be biallelic (that is, with exactly two alleles present), although extensions for multiallelic frequency spectra exist.

Many summary statistics of observed genetic variation are themselves summaries of the allele frequency spectrum, including estimates of $$ \theta $$ such as Watterson's $ \theta_W $ and Tajima's $$ \theta_\pi $$, Tajima's D, Fay and Wu's H and the fixation index $ F_{ST} $.

Example
The allele frequency spectrum from a sample of $$ n $$ chromosomes is calculated by counting the number of sites with derived allele frequencies $$ 1 \leq i \leq n-1 $$. For example, consider a sample of $$ n=6 $$ individuals with eight observed variable sites. In this table, a 1 indicates that the derived allele is observed at that site, while a 0 indicates the ancestral allele was observed. The allele frequency spectrum can be written as the vector $$ \mathbf{x} = (x_1,x_2,x_3,x_4,x_5) $$, where $$ x_i $$ is the number of observed sites with derived allele frequency $$ i $$. In this example, the observed allele frequency spectrum is $$ (4,2,1,0,1) $$, due to four instances of a single observed derived allele at a particular SNP loci, two instances of two derived alleles, and so on.

Calculation
The expected allele frequency spectrum may be calculated using either a coalescent or diffusion approach. The demographic history of a population and natural selection affect allele frequency dynamics, and these effects are reflected in the shape of the allele frequency spectrum. For the simple case of selective neutral alleles segregating in a population that has reached demographic equilibrium (that is, without recent population size changes or gene flow), the expected allele frequency spectrum $$ \mathbf{x} = (x_1,\ldots,x_{n-1}) $$ for a sample of size $$ n $$ is given by

x_i = \theta \frac{1}{i}, $$ where $$ \theta = 2N\mu $$ is the population scaled mutation rate. Deviations from demographic equilibrium or neutrality will change the shape of the expected frequency spectrum.

Calculating the frequency spectrum from observed sequence data requires one to be able to distinguish the ancestral and derived (mutant) alleles, often by comparing to an outgroup sequence. For example in human population genetic studies, the homologous chimpanzee reference sequence is typically used to estimate the ancestral allele. However, sometimes the ancestral allele cannot be determined, in which case the folded allele frequency spectrum may be calculated instead. The folded frequency spectrum stores the observed counts of the minor (most rare) allele frequencies. The folded spectrum can be calculated by binning together the $$ i $$th and $$ (n-i) $$th entries from the unfolded spectrum, where $$ n $$ is the number of sampled individuals.

Multi-population allele frequency spectrum
The joint allele frequency spectrum (JAFS) is the joint distribution of allele frequencies across two or more related populations. The JAFS for $$ d $$ populations, with $$ n_j $$ sampled chromosomes in the $$ j $$th population, is a $$ d $$-dimensional histogram, in which each entry stores the total number of segregating sites in which the derived allele is observed with the corresponding frequency in each population. Each axis of the histogram corresponds to a population, and indices run from $$ 0 \leq i \leq n_j $$ for the $$ j $$th population.

Example
Suppose we sequence diploid individuals from two populations, 4 individuals from population 1 and 2 individuals from population 2. The JAFS would be a $$ 9\times5 $$ matrix, indexed from zero. The $$ [3,2] $$ entry would record the number of observed polymorphic loci with derived allele frequency 3 in population 1 and frequency 2 in population 2. The $$ [1,0] $$ entry would record those loci with observed frequency 1 in population 1, and frequency 0 in population 2. The $$ [8,3] $$ entry would record those loci with the derived allele fixed in population 1 (seen in all chromosomes), and with frequency 3 in population 2.

Applications
The shape of the allele frequency spectrum is sensitive to demography, such as population size changes, migration, and substructure, as well as natural selection. By comparing observed data summarized in a frequency spectrum to the expected frequency spectrum calculated under a given demographic and selection model, one can assess the goodness of fit of that the model to the data, and use likelihood theory to estimate the best fit parameters of the model.

For example, suppose a population experienced a recent period of exponential growth and $$ n $$ sample sequences were obtained from the population at the end of the growth and the observed (data) allele frequency spectrum was calculated using putatively neutral variation. The demographic model would have parameters for the exponential growth rate $$ \rho $$, the time $$ T $$ for which the growth occurred, and a reference population size $$ N_{ref} $$, assuming that the population was at equilibrium when the growth began. The expected frequency spectrum for a given parameter set $$ (\rho,T,N_{ref}) $$ can be obtained using either diffusion or coalescent theory, and compared to the data frequency spectrum. The best fit parameters can be found using maximum likelihood.

This approach has been used to infer demographic and selection models for many species, including humans. For example, Marth et al. (2004) used the single population allele frequency spectra for a group of Africans, Europeans, and Asians to show that population bottlenecks have occurred in the Asian and European demographic histories, but not in the Africans. More recently, Gutenkunst et al. (2009) used the joint allele frequency spectrum for these same three populations to infer the time at which the populations diverged and the amount of subsequent ongoing migration between them (see out of Africa hypothesis). Additionally, these methods may be used to estimate patterns of selection from allele frequency data. For example, Boyko et al. (2008) inferred the distribution of fitness effects for newly arising mutations using human polymorphism data that controlled for the effects of non-equilibrium demography.