User:Manudouz/sandbox/Patristic distance

A patristic distance is the sum of all branch lengths connecting two leaves of a phylogenetic tree.

Definition
Patristic distance: The sum of the lengths of the branches that connect two nodes in a phylogenetic tree, where those nodes are typically terminal nodes representing extant taxa. It is thus an inferred distance (taking into account multiple substitutions) greater than the uncorrected distance directly computed from the number of differences observed between the two corresponding sequences in the alignment.

Etymology
The name derives from the ancient greek adjective πατρίς, -ίδος (patris, -idos) meaning from the ancestors. It refers to the fact that, when computing a so-called patristic distance, two leaves are reachable by repeated moves along all branches connecting them to their most recent common ancestral node.

Misc

 * 1) Path-length distance.
 * 2) Connection of internal nodes (not only leaves).
 * 3) Link with patrocladogram.

Working example
Let us assume that we have five elements $$(a,b,c,d,e)$$ and the following matrix $$D_1$$ of pairwise distances between them :

Upper right half-matrix: observed distances. Lower left half-matrix: patristic distances (yellow cells).



Use to build trees
The Fitch-Margoliash least-squares method.

Softwares
Select uncorrected distances under the un-weighted least squares criterion: dset distance=p objective=lsfit power=0 The dset command is used to set various options for the distance-based methods. Option "distance=p" specifies the use of "uncorrected sequence distances", i.e., we do not want to correct the observed distances for multiple substitutions. Note that distances are here reported as "substitutions per site". This simply means that the number of differences has been divided by the length of the sequence. You can think of this distance as the fraction of sites that are different between two sequences. The option "objective=lsfit" specifies that we want to reconstruct trees using the least squares optimality criterion. Recall that under least squares we are trying to find the tree that has the smallest possible deviation between the observed pairwise distances and the pairwise distances measured along the tree. (The distance between two taxa measured along the tree is called the "patristic" distance). The overall fit of the tree is found by (1) computing the difference between each observed distance and the corresponding patristic distance, (2) squaring this difference (this way we are sure to obtain a positive number, regardless of whether the observed or the patristic difference is bigger), (3) adding all the squared differences. The option "power=0" specifies that we do not want to weight the squared differences according to branch lengths when computing this fit. For each possible tree topology PAUP* finds the best set of branch lengths, and then computes the "sum of squared errors" as a measure of how well the patristic distances fit the observed, pairwise sequence distances. The best tree is the one with the smallest sum of squared errors. At the end of the run, PAUP* outputs a histogram giving the distribution of the sum of squares for all trees (only three in this case).
 * 1) Computation under R.
 * 2) PAUP*
 * 1) PATRISTIC