User:Manudouz/sandbox/NJ

Working example


The working example is based on a JC69 genetic distance matrix computed from the 5S ribosomal RNA sequence alignment of five bacteria: Bacillus subtilis ($$a$$), Bacillus stearothermophilus ($$b$$), Lactobacillus viridescens ($$c$$), Acholeplasma modicum ($$d$$), and Micrococcus luteus ($$e$$).

First step
Let us assume that we have five elements $$(a,b,c,d,e)$$ and the following matrix $$D_1$$ of pairwise distances between them:
 * First clustering

For each element $$i$$, we calculate $$u_i$$ :
 * $$u_i = \frac{1}{(n-2)} \sum_{j=1}^5 d(i,x_j)$$    (where $$x_j \subseteq \{a,b,c,d,e\}$$)
 * $$u_i = \frac{1}{(n-2)} \sum_{j \neq i} d(i,j)$$    (where $$i$$ and $$j$$ $$ \subseteq \{a,b,c,d,e\}$$)

For example:
 * $$u_a = \frac{1}{(n-2)} \sum_{j=1}^5 d(a,x_j) = \frac{1}{(n-2)} (d(a,b)+d(a,c)+d(a,d)+d(a,e)) = \frac{17+21+31+23}{5-2} = \frac{92}{3} = 30.7$$
 * $$u_b = \frac{1}{(n-2)} \sum_{j=1}^5 d(b,x_j) = \frac{1}{(n-2)} (d(b,a)+d(b,c)+d(b,d)+d(b,e)) = \frac{17+30+34+21}{5-2} = \frac{102}{3} = 34.0$$

and so on for $$c$$, $$d$$, and $$e$$.

First step

 * First joining

We calculate the values of the $$Q_1$$ matrix:
 * $$Q_1(i,j) = d(i,j) - u_i - u_j $$

For example, for element $$a$$:
 * $$Q_1(a,b) = d(a,b) - u_a - u_b = 17 - 30.7 - 34.0 = -47.7$$
 * $$Q_1(a,c) = d(a,c) - u_a - u_c = 21 - 30.7 - 39.3 = -49.0$$
 * $$Q_1(a,d) = d(a,d) - u_a - u_d = 31 - 30.7 - 45.3 = -45.0$$
 * $$Q_1(a,e) = d(a,e) - u_a - u_e = 23 - 30.7 - 42.0 = -49.7$$

We obtain the following values for the $$Q_1$$ matrix (the diagonal elements of the matrix are not used and are omitted here):

In the example above, $$Q_1(c,d)=-56.7$$. This is the smallest value of $$Q_1$$, so we join elements $$c$$ and $$d$$.


 * First branch length estimation

Let $$u$$ denote the new node. By equation ($$), above, the branches joining $$a$$ and $$b$$ to $$u$$ then have lengths:


 * $$\delta(a,u)=\frac{1}{2}d(a,b)+\frac{1}{2(5-2)} \left [ \sum_{k=1}^5 d(a,k) - \sum_{k=1}^5 d(b,k) \right ] \quad =\frac{5}{2} + \frac{31-34}{6} = 2$$
 * $$\delta(b,u)=d(a,b)-\delta(a,u) \quad = 5-2 = 3$$


 * First distance matrix update

We then proceed to update the initial distance matrix $$D$$ into a new distance matrix $$D_1$$ (see below), reduced in size by one row and one column because of the joining of $$a$$ with $$b$$ into their neighbor $$u$$. Using equation ($$) above, we compute the distance from $$u$$ to each of the other nodes besides $$a$$ and $$b$$. In this case, we obtain:


 * $$d(u,c)=\frac{1}{2} [d(a,c)+d(b,c)-d(a,b)] = \frac{9+10-5}{2} = 7$$
 * $$d(u,d)=\frac{1}{2} [d(a,d)+d(b,d)-d(a,b)] = \frac{9+10-5}{2} = 7$$
 * $$d(u,e)=\frac{1}{2} [d(a,e)+d(b,e)-d(a,b)] = \frac{8+9-5}{2} = 6$$

The resulting distance matrix $$D_1$$ is:

Bold values in $$D_1$$ correspond to the newly calculated distances, whereas italicized values are not affected by the matrix update as they correspond to distances between elements not involved in the first joining of taxa.

Second step

 * Second joining

The corresponding $$Q_2$$ matrix is:

We may choose either to join $$u$$ and $$c$$, or to join $$d$$ and $$e$$; both pairs have the minimal $$Q_2$$ value of $$-28$$, and either choice leads to the same result. For concreteness, let us join $$u$$ and $$c$$ and call the new node $$v$$.


 * Second branch length estimation

The lengths of the branches joining $$u$$ and $$c$$ to $$v$$ can be calculated:


 * $$\delta(u,v)=\frac{1}{2}d(u,c)+\frac{1}{2(4-2)} \left [ \sum_{k=1}^4 d(u,k) - \sum_{k=1}^4 d(c,k) \right ] \quad =\frac{7}{2} + \frac{20-22}{4} = 3$$
 * $$\delta(v,c)=d(u,c)-\delta(u,v) \quad = 7-3 = 4$$

The joining of the elements and the branch length calculation help drawing the neighbor joining tree as shown in the figure.


 * Second distance matrix update

The updated distance matrix $$D_2$$ for the remaining 3 nodes, $$v$$, $$d$$, and $$e$$, is now computed:


 * $$d(v,d)=\frac{1}{2} [d(u,d)+d(c,d)-d(u,c)] = \frac{7+8-7}{2} = 4$$
 * $$d(v,e)=\frac{1}{2} [d(u,e)+d(c,e)-d(u,c)] = \frac{6+7-7}{2} = 3$$

Final step
The tree topology is fully resolved at this point. However, for clarity, we can calculate the $$Q_3$$ matrix. For example:


 * $$Q_3(v,e) = (3-2)d(v,e) - \sum_{k=1}^3 d(v,k) - \sum_{k=1}^3 d(e,k) = 3-7-6 = -10$$

For concreteness, let us join $$v$$ and $$d$$ and call the last node $$w$$. The lengths of the three remaining branches can be calculated:


 * $$\delta(v,w)=\frac{1}{2}d(v,d)+\frac{1}{2(3-2)} \left [ \sum_{k=1}^3 d(v,k) - \sum_{k=1}^3 d(d,k) \right ] \quad =\frac{4}{2} + \frac{7-7}{2} = 2$$
 * $$\delta(w,d)=d(v,d)-\delta(v,w) = 4-2 = 2$$
 * $$\delta(w,e)=d(v,e)-\delta(v,w) = 3-2 = 1$$

The neighbor joining tree is now complete, as shown in the figure.

Conclusion: additive distances
This example represents an idealized case: note that if we move from any taxon to any other along the branches of the tree, and sum the lengths of the branches traversed, the result is equal to the distance between those taxa in the input distance matrix. For example, going from $$d$$ to $$b$$ we have $$2+2+3+3=10$$. A distance matrix whose distances agree in this way with some tree is said to be 'additive', a property which is rare in practice. Nonetheless it is important to note that, given an additive distance matrix as input, neighbor joining is guaranteed to find the tree whose distances between taxa agree with it.