User:Citing/sandbox3

Measures
Population structure is a complex phenomenon and no single measure captures it entirely. Understanding a population's structure requires a combination of methods and measures.

Heterozygosity
One of the results of population structure is a reduction in heterozygosity. When populations split, alleles have a higher chance of reaching fixation within subpopulations, especially if the subpopulations are small or have been isolated for long periods. This reduction in heterozygosity can be thought of as an extension of inbreeding, with individuals in subpopulations being more likely to share a recent common ancestor. The scale is important — an individual with both parents born in the United Kingdom is not inbred relative to that country's population, but is more inbred than two humans selected from the entire world. This motivates the derivation of Wright's F-statistics (also called "fixation indices"), which measure inbreeding through observed versus expected heterozygosity. For example, $$F_{IS}$$ measures the inbreeding coefficient at a single locus for an individual $$I$$ relative to some subpopulation $$S$$:

$$F_{IS} = 1 - \frac{H_I}{H_S}$$

Here, $$H_I$$ is the fraction of individuals in subpopulation $$S$$ that are heterozygous. Assuming there are two alleles, $$A_1, A_2$$ that occur at respective frequencies $$p_S, q_S$$, it is expected that under random mating the subpopulation $$S$$ will have a heterozygosity rate of $$H_S = 2p_S(1-p_S) = 2 p_S q_S$$. Then:

$$F_{IS} = 1 - \frac{H_I}{2 p_S q_S}$$

Similarly, for the total population $$T$$, we can define $$H_T = 2 p_T q_T$$ allowing us to compute the expected heterozygosity of subpopulation $$S$$ and the value $$F_{ST}$$ as:

$$F_{ST} = 1 - \frac{H_S}{H_T} = 1 - \frac{2p_S q_S}{2 p_T q_T}$$

If F is 0, then the allele frequencies between populations are identical, suggesting no structure. The theoretical maximum value of 1 is attained when an allele reaches total fixation, but most observed maximum values are far lower. FST is one of the most common measures of population structure and there are several different formulations depending on the number of populations and the alleles of interest. Although it is sometimes used as a genetic distance between populations, it does not always satisfy the triangle inequality and thus is not a metric. It also depends on within-population diversity, which makes interpretation and comparison difficult.

Admixture inference
An individual's genotype can be modelled as an admixture between K discrete clusters of populations. Each cluster is defined by the frequencies of its genotypes, and the contribution of a cluster to an individual's genotypes is measured via an estimator. In 2000, Jonathan K. Pritchard introduced the STRUCTURE algorithm to estimate these proportions via Markov chain Monte Carlo. Since then, algorithms (such as ADMIXTURE) have been developed using other estimation techniques. Estimated proportions can be visualized using bar plots — each bar represents an individual, and is subdivided to represent the proportion of an individual's genetic ancestry from one of the K populations.

Varying K can illustrate different scales of population structure; using a small K for the entire human population will subdivide people roughly by continent, while using large K will partition populations into finer subgroups. Though clustering methods are popular, they are open to misinterpretation: for non-simulated data, there is never a true value of K, but rather an approximation considered useful for a given question. They are sensitive to sampling strategies, sample size, and close relatives in data sets; there may be no discrete populations at all; and there may be hierarchical structure where subpopulations are nested. Clusters may be admixed themselves, and may not have a useful interpretation as source populations.

Dimensionality reduction


Genetic data are high dimensional and dimensionality reduction techniques can capture population structure. Principal component analysis (PCA) was first applied in population genetics in 1978 by Cavalli-Sforza and colleagues and resurged with high-throughput sequencing.

Initially PCA was used on allele frequencies at known genetic markers for populations, though later it was found that by coding SNPs as integers (for example, as the number of non-reference alleles) and normalizing the values, PCA could be applied at the level of individuals. One formulation considers $$N$$ individuals and $$S$$ bi-allelic SNPs. For each individual $$i$$, the value at locus $$l$$ is $$g_{i,l}$$ is the number of non-reference alleles (one of $$0, 1, 2$$). If the allele frequency at $$l$$ is $$p_{l}$$, then the resulting $$N \times S$$ matrix of normalized genotypes has entries:

$$\frac{g_{i,l} - 2p_{l}}{\sqrt{2p_{l} (1-p_{l})}}$$

PCA transforms data to maximize variance; given enough data, when each individual is visualized as point on a plot, discrete clusters can form. Individuals with admixed ancestries will tend to fall between clusters, and when there is homogenous isolation by distance in the data, the top PC vectors will reflect geographic variation. The eigenvectors generated by PCA can be explicitly written in terms of the mean coalescent times for pairs of individuals, making PCA useful for interpretting population histories of groups in a given sample. PCA cannot, however, distinguish between different processes that lead to the same mean coalescent times.

Multidimensional scaling and discriminant analysis have been used to study differentiation, population assignment, and to analyze genetic distances. Neighborhood graph approaches like t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP) can visualize continental and subcontinental structure in human data. With larger datasets, UMAP better captures multiple scales of population structure — fine-scale patterns are hidden or split with other methods, and these are of interest when there are many diverse or admixed populations, or when examining relationships between genotypes, phenotypes, and/or geography. Variational autoencoders can generate artificial genotypes with structure representative of the input data.

In humans

 * Analysis of structure can re-construct the histories of populations
 * History has been shaped by migrations, population bottlenecks, admixture. Models that re-create the structure from such events are useful.
 * Commercial testing and genetic ancestry?


 * Ancient stuff
 * Medical
 * Population histories
 * Descriptive
 * Genetic ancestry

Ancient/archaic

 * Models of archaic admixture and recent history from two-locus statistics Archaic refs 43-48
 * Outstanding questions in the study of archaic hominin admixture
 * Reconstructing Prehistoric African Population Structure
 * Ancient Structure in Africa Unlikely to Explain Neanderthal and Non-African Genetic Similarity
 * Effect of ancient population structure on the degree of polymorphism shared between modern human populations and ancient hominins
 * The IICR and the non-stationary structured coalescent: towards demographic inference with arbitrary changes in population structure
 * Distinguishing Recent Admixture from Ancestral Population Structure
 * Tracking human population structure through time from whole genome sequences


 * Population Structure, Stratification, and Introgression of Human Structural Variation


 * Insights into human genetic variation and population history from 929 diverse genomes
 * Multiple Deeply Divergent Denisovan Ancestries in Papuans


 * Analysis of Human Sequence Data Reveals Two Pulses of Archaic Denisovan Admixture
 * Origins of modern human ancestry
 * Ref 27 on inferred N and structure

Not explicitly ancient/archaic

 * Population structure of modern-day Italians reveals patterns of ancient and archaic ancestries in Southern Europe
 * Genetic Consequences of the Transatlantic Slave Trade in the Americas
 * Clustering of 770,000 genomes reveals post-colonial population structure of North America
 * Exploring Cuba’s population structure and demographic history using genome-wide data
 * Population structure in Argentina
 * Toward a fine-scale population health monitoring system
 * What is ancestry?

Genetic epidemiology
Population structure can be a problem for association studies, such as case-control studies, where the association between the trait of interest and locus could be incorrect. As an example, in a study population of Europeans and East Asians, an association study of chopstick usage may "discover" a gene in the Asian individuals that leads to chopstick use. However, this is a spurious relationship as the genetic variant is simply more common in Asians than in Europeans. Also, actual genetic findings may be overlooked if the locus is less prevalent in the population where the case subjects are chosen. For this reason, it was common in the 1990s to use family-based data where the effect of population structure can easily be controlled for using methods such as the transmission disequilibrium test (TDT).

Phenotypes (measurable traits), such as height or risk for heart disease, are the product of some combination of genes and environment. These traits can be predicted using polygenic scores, which seek to isolate and estimate the contribution of genetics to a trait by summing the effects of many individual genetic variants. To construct a score, researchers first enrol participants in an association study to estimate the contribution of each genetic variant. Then, they can use the estimated contributions of each genetic variant to calculate a score for the trait for an individual who was not in the original association study. If structure in the study population is correlated with environmental variation, then the polygenic score is no longer measuring the genetic component alone.

Several methods can at least partially control for this confounding effect. The genomic control method was introduced in 1999 and is a relatively nonparametric method for controlling the inflation of test statistics. It is also possible to use unlinked genetic markers to estimate each individual's ancestry proportions from some K subpopulations, which are assumed to be unstructured. More recent approaches make use of principal component analysis (PCA), as demonstrated by Alkes Price and colleagues, or by deriving a genetic relationship matrix (also called a kinship matrix) and including it in a linear mixed model (LMM).

PCA and LMMs have become the most common methods to control for confounding from population structure. Though they are likely sufficient for avoiding false positives in association studies, they are still vulnerable to overestimating effect sizes of marginally associated variants and can substantially bias estimates of polygenic scores and trait heritability. If environmental effects are related to a variant that exists in only one specific region (for example, a pollutant is found in only one city), it may not be possible to correct for this population structure effect at all. For many traits, the role of structure is complex and not fully understood, and incorporating it into genetic studies remains a challenge and is an active area of research.

In other organisms
In non-human organisms, population structure is used to study diversity in crops, which can identify potential weaknesses to disease, or be used to trace human population histories by tracing the genetic history of cultivars. It can also be used to examine the evolution of microscopic organisms and pathogens. In animals, population structure is a useful tool for tracing the origins of disease vectors like mosquitos, or to study the origins and distributions of endangered animals.


 * In non-human animals, plants, bacteria, etc
 * Conservation
 * Fighting disease (pests, vectors, agriculture)


 * Mosquito


 * Rhino

























Refs












Abandoned refs
Might be useful: