Phylogenetic invariants

Phylogenetic invariants are polynomial relationships between the frequencies of various site patterns in an idealized DNA multiple sequence alignment. They have received substantial study in the field of biomathematics, and they can be used to choose among phylogenetic tree topologies in an empirical setting. The primary advantage of phylogenetic invariants relative to other methods of phylogenetic estimation like maximum likelihood or Bayesian MCMC analyses is that invariants can yield information about the tree without requiring the estimation of branch lengths of model parameters. The idea of using phylogenetic invariants was introduced independently by James Cavender and Joseph Felsenstein and by James A. Lake in 1987.

At this point the number of programs that allow empirical datasets to be analyzed using invariants is limited. However, phylogenetic invariants may provide solutions to other problems in phylogenetics and they represent an area of active research for that reason. Felsenstein stated it best when he said, "invariants are worth attention, not for what they do for us now, but what they might lead to in the future." (p. 390)

If we consider a multiple sequence alignment with t taxa and no gaps or missing data (i.e., an idealized multiple sequence alignment), there are 4t possible site patterns. For example, there are 256 possible site patterns for four taxa (fAAAA, fAAAC, fAAAG, … fTTTT), which can be written as a vector. This site pattern frequency vector has 255 degrees of freedom because the frequencies must sum to one. However, any set of site pattern frequencies that resulted from some specific process of sequence evolution on a specific tree must obey many constraints. and therefore have many fewer degrees of freedom. Thus, there should be polynomials involving those frequencies that take on a value of zero if the DNA sequences were generated on a specific tree given a particular substitution model.

Invariants are formulas in the expected pattern frequencies, not the observed pattern frequencies. When they are computed using the observed pattern frequencies, we will usually find that they are not precisely zero even when the model and tree topology are correct. By testing whether such polynomials for various trees are 'nearly zero' when evaluated on the observed frequencies of patterns in real data sequences one should be able infer which tree best explains the data.

Some invariants are straightforward consequences of symmetries in the model of nucleotide substitution and they will take on a value of zero regardless of the underlying tree topology. For example, if we assume the Jukes-Cantor model of sequence evolution and a four-taxon tree we expect:

$$f_{ACAT}-f_{CGCA}=0$$

This is a simple outgrowth of the fact that base frequencies are constrained to be equal under the Jukes-Cantor model. Thus, they are called symmetry invariants. The equation shown above is only one of a large number of symmetry invariants for the Jukes-Cantor model; in fact, there are a total of 241 symmetry invariants for that model. Symmetry invariants are non-phylogenetic in nature; they take on the expected value of zero regardless of the tree topology. However, it is possible to determine whether a particular multiple sequence alignment fits the Jukes-Cantor model of evolution (i.e., by testing whether the site patterns of the appropriate types are present in equal numbers). More general tests for the best-fitting model using invariants are also possible. For example Kedzierska et al. 2012 used invariants to establish the best-fitting model out from a specific model set. The asterisk after the JC69, K80, and K81 models is used to emphasize the non-homogeneous nature of the models that can be examined using invariants. These non-homogeneous models include the commonly used continuous-time JC69, K80, and K81 models as submodels. The SSM (strand-specific model ), also called the CS05 model, is a generalized non-homogeneous version of the HKY (Hasegawa-Kishino-Yano) model constrained to have equal distribution of the pairs of bases A,T and C,G at each node of the tree and no assumption regarding a stable base distribution. All models listed above are submodels of the general Markov model (GMM). The ability to perform tests using non-homogeneous models represents a major benefit of the invariants methods relative to the more commonly used maximum likelihood methods for phylogenetic model testing.

Phylogenetic invariants, which are defined as the subset of invariants that take on a value of zero only when the sequences were (or were not) generated on a specific topology, are likely to be the most useful invariants for phylogenetic studies. .

Lake's linear invariants
Lake's invariants (which he called "evolutionary parsimony") provide an excellent example of phylogenetic invariants. Lake's invariants involve quartets, two of which (the incorrect topologies) yield values of zero and one of which yields a value greater than zero. This can be used to construct a test based on following invariant relationship, which holds for the two incorrect trees when sites evolve under the Kimura two-parameter model of sequence evolution:

$$f_{1133}+f_{1234}=f_{1233}+f_{1134}$$

The indices of these site pattern frequencies indicate the bases scored relative to the base in the first taxon (which we call taxon A). If base 1 is a purine, then base 2 is the other purine and bases 3 and 4 are the pyrimidines. If base 1 is a pyrimidine, then base 2 is the other pyrimidine and. bases 3 and 4 are the purines.

We will call three possible quartet trees TX [TX is ((A,B),(C,D)); in newick format], TY [TY is ((A,C),(B,D)); in newick format],  and TZ [TZ is ((A,D),(B,C)); in newick format]. We can calculate three values from the data to identify the best topology given the data:

$$X=N_{1133}-N_{1233}-N_{1134}+N_{1234}$$

$$Y=N_{1313}-N_{132 3}-N_{1314}+N_{1324}$$

$$Z=N_{1331}-N_{1332}-N_{1341}+N_{1342}$$

Lake broke these values up into a "parsimony-like term" ($$P=N_{1133}+N_{1234}$$ for TX) the "background term" ($$B=N_{1233}+N_{1134}$$ for TX) and suggests testing for deviation from zero by calculating $$\chi^2=(P-B)^2/(P+B)$$ and performing a χ2 test with one degree of freedom. Similar χ2 tests can be performed for Y and Z. If one of the three values is significantly different from zero the corresponding topology is the best estimate of phylogeny. The advantage of using Lake's invariants relative to maximum likelihood or neighbor joining of Kimura two-parameter distances is that the invariants should hold regardless of the model parameters, branch lengths, or patterns of among-sites rate heterogeneity.

A classic study by John Huelsenbeck and David Hillis found that Lake's invariants converges on the true tree over all of the branch length space they examined when the underlying model of evolution is the Kimura two-parameter model. However, they also found that Lake's invariants are very inefficient (large amounts of data are necessary to converge on the correct tree). This inefficiency has caused most empiricists to abandon the use of Lake's invariants. Also, because Lake's invariants are based on the Kimura two-parameter model phylogenetic estimation using Lake's invariants may not yield the true tree when the model that generated the data strongly violates that model.

Modern approaches using phylogenetic invariants
The low efficiency of Lake's invariants reflects the fact that it used a limited set of generators for the phylogenetic invariants. Casanellas et al. introduced methods to derive a much larger set of set of generators for DNA data and this has led to the development of invariants methods that are as efficient as maximum likelihood methods. Several of these methods have implementations that are practical for analyses of empirical datasets.

Eriksson proposed an invariants method for the general Markov model based on singular value decomposition (SVD) of matrices generated by "flattening" the nucleotides associated with each of the leaves (i.e., the site pattern frequency spectrum). Different flattening matrices are produced for each topology. However, comparisons of the original Eriksson SVD method (ErikSVD) to neighbor joining and the maximum likelihood approach implemented in the PHYLIP program dnaml were mixed; ErikSVD underperformed the other two methods when used with simulated data but it appeared to perform better than dnaml when applied to an empirical mammalian dataset based on an early release of data from the ENCODE project. The original ErikSVD method was improved by Fernández-Sánchez and Casanellas, who proposed a normalization they called Erik+2. The original ErikSVD method is statistically consistent (it converges on. the true tree. as the empirical distribution approaches the theoretical distribution); the Erik+2 normalization improves the performance of the method given finite datasets. It has been implemented in the software package PAUP* as an option for the SVDquartets method.

"Squangles" (stochastic quartet tangles ) represents another example of an invariants method hat has been implemented in software package that is practical to be used with empirical datasets. Squangles permit the choice among the three possible quartets assuming that DNA sequences have evolved under the general Markov model; the quartets can then be assembled using a supertree method. There are three squangles that are useful for differentiating among quartets, which can be denoted as q1(f), q2(f), and q3(f) (f is a 256 element vector containing the site frequency spectrum). Each q has 66,744 terms and together they satisfy the linear relation q1 + q2 + q3 = 0 (i.e., up to linear dependence there are only two q values). Each possible quartet has different expected values for q1, q2, and q3: The expected values q1, q2, and q3 are all zero on the star topology (a quartet with an internal branch length of zero). For practicality, Holland et al. used least squares to solve for the q values. Empirical tests of the squangles method have been limited but they appear to be promising.