Talk:Neighbor joining

Comments
The last sentence of the 2nd paragraph needs to be referenced: Even though it is sub-optimal in this sense, it has been extensively tested and usually finds a tree that is quite close to the optimal tree.

Who tested it?


 * Yeah, that's a really bad statement. The basic idea is that neighbor-joining assumes rates-across-sites. So if you generate your data under a rates-across-sites model, then you can be fairly sure that you'll get pretty good results. So neighbor-joining is only "suboptimal" in the sense that it requires the rates-across-sites assumption to do well. I'll look for a reference for this. --Wzhao553 20:10, 21 February 2006 (UTC)


 * Actually, there is a precise answer. A distance matrix is called additive if it can be specified by a tree with distances along the edges. NJ works if the given distance matrix d is 'close' to additive. The precise notion is given in one of the references of the article (why NJ works). --Jochgem (talk) 15:38, 18 April 2008 (UTC)

In the algorithm description: "Find the pair of taxa in Q with the lowest value. Create a node on the tree that joins these two taxa (i.e. join the closest neighbors, as the algorithm name implies)." -- It is not clear to me how the joining of two taxa works. while the method is probably specific to the given data, at least the general criteria for the joined object should be mentioned. 91.89.65.97 (talk) 09:13, 29 July 2008 (UTC)

Time and space complexity of this algorithm? --TobiasBengtsson (talk) 20:47, 17 November 2008 (UTC)
 * I'm currently trying to improve the article, and I added what a naive implementation would result in. Probably the faster methods out there give more details in their papers, but I haven't looked at them yet 134.58.42.46 (talk) 08:58, 16 March 2011 (UTC)

The statement "In the example above, this formula would give a distance of 6 between A and the new node and a distance of 1 between B and the new node." appears to be wrong. For the distance between A and the new node, I get 7/2 + 17/4. Tedtoal (talk) 01:20, 27 August 2010 (UTC)
 * My implementation gets the "right" values (6 and 1), maybe I should finish the example originally posted by "???" 134.58.42.46 (talk) 08:58, 16 March 2011 (UTC)

I think there is another problem with the example. When i compute the value Q(A,B) with the given formula for the Q-matrix, i get -26 and not -40 as stated in the examples Q-matrix. The same is true for the other values in the Q-matrix. It appears that to compute the values in the example the first term of the formula for Q(i,j), namely (r-2)d(i,j), is omitted. As i am just a layman in this field, i don't know which is correct (i guess it is the formula) but either the formula or the computed example is wrong.--85.91.175.222 (talk) 08:38, 4 October 2011 (UTC)


 * In the sums in the formula for Q(i,j) the index k runs over all values including i and j, so for Q(A,B), the sums are (7 + 11 + 14) = 32, and (7 + 6 + 9) = 22. These get subtracted from 2 d(A,B) = 14; Q(A,B) = 14 - (32 + 22) = -40.  — Preceding unsigned comment added by 24.59.127.24 (talk) 04:30, 9 February 2012 (UTC)


 * That is different from what it says in the description of the Q matrix. It says "k is any other node not i or j".  Based on the description, I got Q(1,2) = -26 as well.  So the description is either un-clear or the example does not follow it.  Captkrob (talk) 21:01, 3 January 2014 (UTC)


 * Ditto this. This is still a problem, but it is unclear what the solution should be. A reference text, such as Biological Sequence Analysis (Durbin, et al.) gives a slightly different formula altogether. — Preceding unsigned comment added by 131.111.184.92 (talk) 16:56, 25 January 2014 (UTC)


 * The solution is to remove "k is any other node not i or j respectively"; k should take on all values from 1 to r. I think the "respectively" was meant to mean that k=i is excluded from the first sum, an k=j is excluded from the second sum, which would make it right, but since d(i,i) = 0, it doesn't matter whether we exclude these or not, so it's simpler just to let the sum run over all k=1 to r. I have fixed it. tom fisher-york (talk) 17:43, 26 March 2014 (UTC)

Is it correct to say NJ is for construction of "phenetic trees"? Phenetics seems to be grouping based on similarity, rather on (inferred) evolutionary relationships. I suggest saying "phylogenetic trees" rather than phenetic. Objections? — Preceding unsigned comment added by Tomfy (talk • contribs) 16:03, 16 February 2012 (UTC) Saitou and Nei's original NJ paper's title is "The Neighbor-joining method: A New Method for Reconstructing Phylogenetic Trees." So I think it is fair to replace 'phenetic' with 'phylogenetic' here (which I have done). tom fisher-york (talk) 16:53, 6 April 2014 (UTC)

Is the time complexity of the algorithm really O(n^3)? It takes n-3 iterations to complete a tree with n taxa, and there are n^2 elements in the Q matrix. But for each element in the Q matrix, one has to cycle through all elements k=1...n such that k is not equal to i or j. This adds another factor of n-1. So wouldn't the complexity by O(n^4)? Calculating the matrix takes n^3 steps, and this has to be done n times, roughly speaking. 71.166.127.224 (talk) 21:25, 21 March 2014 (UTC)


 * I believe the answer is that the sums from k=1 to r which appear in the formula for Q(i,j) do not have to be evaluated from scratch with each iteration. When two leaves are joined, two terms are lost from each of these sums, and are replaced by a new term (the distance to the newly created node).tom fisher-york (talk) 17:07, 26 March 2014 (UTC)

Weird, possibly racist figure
The figure at the right seems false, and quite unsettling to me. JPLeRouzic (talk) 10:58, 11 December 2016 (UTC)
 * It presents the "Caucasoid" in a central place, and "Negroid" at a larger distance of "Caucasoid" than other groups. I think that we now know that the "homo sapiens" has its origin in Africa. Genome analysis shown that European, Asian groups are derived from the African homo sapiens, with possibly incorporation of genes from other human species such as Neanderthal. So "Caucasoid" (whatever it means) should not be in the center of the figure.
 * We know also that Bantu are genetically closer to European that they are close to San or Khoi (the correct name for the people that inhabited Sub-Saharan Africa before Bantu migration).
 * I have some problem to understand the distinction between Nigerian and Bantu as Bantu is a human group with its origins in Southern Nigeria and Cameroon.
 * And finally, even if English is not my native tongue, I do know that words like "negroid", "Bushmen" are derogatory.

Agreed regarding the racist implications of the figure. From the perspective of Saitou Naruya, this is a quite-possibly-innocent example of what scientists call "hypercontextualization", a term for when facts and figures alone have a misleading conclusion when divorced from their broader context. The Neighbor-joining method inherently creates an unrooted tree, and the figure conforms unremarkably to the usual default way that scientists represent an unrooted tree: by putting the genetic center of gravity in the center of the figure. The tree can't help but recapitulate the literal geography of human migrations, and since the trans-caucasian region between Europe, Arabia, the Pontic Steppe, and India is geographically in between East Asia, Malesia, Oceania, and the Americas, the human genes here are indeed shown to represent that. However, other scientific facts tell us that the genetic root of the tree is not located at the geographic center of gravity of the populations represented in that figure, and as encyclopedists, we must never let minor details give a misleading impression of a broader picture; thus, that figure is too hypercontextualized to be appropriate for this general encyclopedic context. I'm thus going to remove it, with a note in the edit log that anyone reverting or re-adding the figure ought to name their reasons for doing so here. Signed, TheMigratoryPlatypus 67.3.121.25 (talk) 22:02, 28 February 2018 (UTC)