Talk:Euclidean minimum spanning tree/GA1

GA Review
The edit link for this section can be used to add comments to the review.''

Reviewer: Ovinus (talk · contribs) 06:50, 24 August 2022 (UTC)

I'll grab this one. Looks excellent on first glance. A quick note: I'm moving soon and may not respond in a timely manner; sorry in advance. Ovinus (talk) 06:50, 24 August 2022 (UTC)
 * I'm traveling the next few days and so may also not respond in a timely manner. —David Eppstein (talk) 07:30, 24 August 2022 (UTC)

Good stuff. Since this is a long (and important) one I'll probably do another read through when my mind is fresh. Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * Publications on this problem commonly abbreviate it as "EMST" What problem, though? Ovinus (talk) 07:04, 24 August 2022 (UTC)
 * Replaced "this problem" by "the Euclidean minimum spanning tree" —David Eppstein (talk) 07:27, 24 August 2022 (UTC)
 * must again start and end at the given points Probably "still must", since this is in contrast to the Steiner tree
 * Expanded on like MST and unlike Steiner —David Eppstein (talk) 07:27, 24 August 2022 (UTC)
 * No mention of the triangle inequality here. What about the generalized problem of finding the MST of a weighted complete graph satisfying the triangle inequality? The metric TSP is materially different in difficulty (at least to approximate) from the non-metric TSP, right? Or is this too broad a generalization to warrant a mention Ovinus (talk) 07:04, 24 August 2022 (UTC)
 * Triangle inequality doesn't make complete-graph MST any easier than the general problem, because you can obtain the triangle inequality without changing the tree by adding your favorite big number to all of the weights. Anyway, that's a topic for rather than this article. I think other geometries are interesting (there's less published about hyperbolic-geometry MST than there should be) but again that's somewhat off-topic. —David Eppstein (talk) 07:27, 24 August 2022 (UTC)
 * Aha, that's clever. Well, is there enough to warrant a sentence about finding MSTs in other metrics? You're already mentioning related problems/variations; this isn't that far out and probably can't be covered in a standalone article. Ovinus (talk) 07:34, 24 August 2022 (UTC)
 * The only thing I think that's relevant enough and sourceable is that the approach of computing a Delaunay triangulation to get $$O(n\log n)$$ time also works in the hyperbolic plane (actually, using the Euclidean Delaunay triangulation of a Poincaré model of the hyperbolic plane). Which is sort of interesting, but kind of folklore. The earliest reference I can find is some talk slides from a survey talk I gave in 2003, I'm uncomfortable using that as a reference for multiple reasons (it's not really a publication, it's by me, and I'm not certain I deserve credit for discovering it) but I'm even less comfortable with later sources (the clearest I can find is https://arxiv.org/abs/2112.02553 but it's also not a reliably published source and definitely doesn't deserve credit for this observation). —David Eppstein (talk) 03:06, 27 August 2022 (UTC)
 * How unfortunate. The current list is good, then. Cool, that this has been an area of research for you! Ovinus (talk) 07:23, 27 August 2022 (UTC)
 * I'd briefly mention, in Angles and vertex degrees, that the kissing number in dimension n is unknown for most n. An uninformed reader might think that this characteristic of EMSTs will be known for all dimensions. Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * Ok, added a listing of the few dimensions for which it is known. —David Eppstein (talk) 03:06, 27 August 2022 (UTC)
 * "their precise values are not known" Have there been any Monte Carlo–type estimates? Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * The original reference has some empirical data, but without any error or convergence rate analysis that could be used to justify calling it an estimate. They are pretty cautious in how they interpret it. —David Eppstein (talk) 03:13, 27 August 2022 (UTC)
 * "This contradiction shows that w cannot exist. It follows that any other region of the plane within this empty lens must also be empty" This seems fairly evident from what was just explained, because you say "cannot" and eventually "contradicted". Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * I'm not sure what is being asked here. —David Eppstein (talk) 03:17, 27 August 2022 (UTC)
 * I'm just saying you can remove what I quoted.
 * Ok, rephrased to not be a proof by contradiction, so that it is no longer necessary to complete the proof by pointing to the contradiction and saying that its existence demonstrates the falsity of the thing that was assumed true at the start of the proof by contradiction. —David Eppstein (talk) 19:49, 27 August 2022 (UTC)
 * "defined from a system of points using empty regions" wdym by "using empty regions" Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * I mean "using empty regions in ways that are detailed individually in each of the bullet points below". —David Eppstein (talk) 03:17, 27 August 2022 (UTC)
 * Yes but it is confusing to the uninitiated. I'd like something like "using empty regions in their construction" Ovinus (talk) 07:20, 27 August 2022 (UTC)
 * My feeling is that "certain X defined from Y using Z must contain all A of all B" is already grammatically overcomplicated. Expanding it to "certain X defined from Y using Z in their definition must contain all A of all B" is both more complicated and redundant. On the other hand, I don't want the introductory sentence to go into detail about what it means for a geometric graph to be defined by empty regions because the definitions are different. For some (RNG or Gabriel) there is a single region for each pair of points that, when empty, causes an edge to exist between them. For Delaunay there are multiple regions any one of which, when empty, causes an edge to exist. For Urquhart, we have a combination of the Delaunay condition (on all pairs of points) and the RNG condition (limited to the third points of Delaunay triangles). —David Eppstein (talk) 19:49, 27 August 2022 (UTC)
 * Your rewording is a bit better, I think. Now the issue is that a set of points' "empty regions" is rather handwavy because it's not clear that they are rigorously constructed in each case. How about: "Certain geometric graphs defined from points, and empty regions constructed from those points, must contain all Euclidean minimum spanning trees as subgraphs." (emphasis mine) That avoids an extra quantifier. Ovinus (talk) 20:49, 27 August 2022 (UTC)
 * I don't understand your issue. We have an introductory sentence saying that certain geometric graphs related to EMSTs by having definitions based on empty regions. Why would that sentence cause one to worry that maybe these graphs' definitions are unrigorous? Would your objection still apply to a sentence that replaced "defined" by "rigorously defined" or "fully rigorously defined" or maybe "defined in full mathematical rigor"? What extra information would these adverbs convey to the reader? —David Eppstein (talk) 21:12, 27 August 2022 (UTC)
 * As fully rigorous, precise, exact, pedantic as possible! I think I see the problem now; when parsing the sentence it says “graphs defined from points and their empty regions” and it sounds like each POINT has some number of empty regions, which is absurd. It should be “graphs defined from a set of points and its empty regions”.
 * What you have now is a bit better, but I still think saying "all EMSTs are subgraphs" makes more sense than "all edges of all ...". Ovinus (talk) 03:46, 28 August 2022 (UTC)
 * We do say they are subgraphs after the bullet points. But these inclusions are proved one edge at a time, so I think it's important to start by focusing on the edges. Also, this makes it easier to make it clear that they do not merely contain one EMST as a subgraph; they contain all EMSTS, for point sets that have more than one. —David Eppstein (talk) 17:51, 29 August 2022 (UTC)
 * Hm. The sentence is rather complicated... let's agree to disagree. Ovinus (talk) 03:41, 31 August 2022 (UTC)
 * Reorder the graphs in Supergraphs to enjoy the inclusion chain order Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * Not possible because the Urquhart graph cannot be described before the Delaunay triangulation. —David Eppstein (talk) 03:17, 27 August 2022 (UTC)
 * Sad! Ovinus (talk) 07:20, 27 August 2022 (UTC)
 * Is "EMST" too informal to be used throughout the article? Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * Informal is the wrong word. It's not too informal to be used in many published papers, where the standard of writing is also very formal. What it is, that is more problematic for Wikipedia than elsewhere, is jargony. But the alternative is repeating "Euclidean minimum spanning tree Euclidean minimum spanning tree Euclidean minimum spanning tree" all over the article, and that's a long enough phrase to get kind of tedious. —David Eppstein (talk) 03:20, 27 August 2022 (UTC)
 * Indeed, which I why I thought you could use it more liberally in the article. It's never used besides its mention as an acronym. Ovinus (talk) 07:20, 27 August 2022 (UTC)
 * I'm not a big fan of the overuse of jargony acronyms in academic publication. I tried compromising be removing many instances of the word "Euclidean", and adding a note in the definitions section about removing that word when the context is obvious (sourced to several of the existing sources that do that). —David Eppstein (talk) 03:18, 28 August 2022 (UTC)
 * "or any other shape of constant diameter" Hey, throwback. But a square isn't such a shape, is it? Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * There is no notion of directional diameter. The diameter of a shape is just a number. "Constant diameter" means that the diameter is a fixed number, like 1, rather than something that grows larger as a function of the number of points. —David Eppstein (talk) 03:23, 27 August 2022 (UTC)
 * I've adjusted to "fixed shape". Ovinus (talk) 07:20, 27 August 2022 (UTC)
 * "Another way of interpreting these results is that the average edge length for points in a unit square is O(1/sqrt(n)), proportional to the spacing of points in a regular grid, and that for random points in a unit square the average length is proportional to 1/sqrt(n)." Isn't this saying the same thing twice Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * Not quite. The worst case bound is only an upper bound. The random point bound is that the average length is of the form (c+o(1)/sqrt(n), giving both an upper and lower bound with the same constant factor. —David Eppstein (talk) 12:16, 27 August 2022 (UTC)
 * I am confused. Where does it say worst-case bound? Ovinus (talk) 17:23, 27 August 2022 (UTC)
 * The average length for points in a unit square is always O(1/sqrt n). There is no qualifier for the points. It applies universally to all point sets. In this sense, it is a worst-case bound. In contrast, the "proportional to" part is qualified: it is "for random points". —David Eppstein (talk) 17:43, 27 August 2022 (UTC)
 * Ohhhhhhhhhh okay. I tried to make that clear. Ovinus (talk) 17:50, 27 August 2022 (UTC)
 * "That is, with high probability, some edges of length proportional ..." Is this restatement necessary? I find it more difficult to parse than the original, mainly bc of the "required" part, making it sound like you're talking about graphs in general (that is, any graph spanning the points will have (with high probability) an edge of length O(sqrt(log n / n))). Or—if that is indeed the case—that should be made clear in the original statement. Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * Removed restatement and copyedited the following sentence of clustering. —David Eppstein (talk) 19:54, 27 August 2022 (UTC)
 * "The constant of proportionality for this longest edge length, and its distribution around its expected value, are also known" Perhaps the distribution is too much info, but the constant seems like succulent info. Is it algebraic? No closed form? (If the latter I'd say "although there is no closed form") Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * Ok, done. The constant is $$\sqrt\pi$$. —David Eppstein (talk) 20:10, 27 August 2022 (UTC)
 * 😍 Ovinus (talk) 20:49, 27 August 2022 (UTC)
 * Link "which can be done" to Delaunay triangulation? Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * I'm not happy with the state of that section. The only part where it clearly states that deterministic n log n is possible is the divide & conquer section, but then it immediately and confusingly talks about getting expected time n log n without hurting worst case performance, which makes no sense for an algorithm that is already deterministic n log n. It doesn't mention using Fortune's algorithm and then dualizing. It doesn't state that the incremental method can be derandomized. It doesn't mention Shewchuk's fast practical implementation of many of these methods. I'm not convinced that it's a useful enough resource for this topic to link in this way. —David Eppstein (talk) 20:18, 27 August 2022 (UTC)
 * Hm, sounds good; perhaps it can be linked once that section is fleshed out. Ovinus (talk) 20:49, 27 August 2022 (UTC)
 * "the version of the traveling salesman problem on a set of points in the plane with edges labelled by their length" Necessary clarification? If someone knows what TSP is they'll know what the Euclidean TSP is Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * Ok, the gloss isn't enough different from the term being glossed to be useful. Changed to a different gloss involving shortest polygonalization. —David Eppstein (talk) 20:53, 27 August 2022 (UTC)
 * "some trees may require a bounding box of exponential size" This is cool. So also the edges may be exponential in size? I would note this Ovinus (talk) 18:15, 24 August 2022 (UTC)
 * Yes, that's more or less equivalent. —David Eppstein (talk) 20:53, 27 August 2022 (UTC)

Second pass
Continuing in a couple days. Ovinus (talk) 03:41, 31 August 2022 (UTC)
 * "total length of the selected segments" pleonasm
 * Removed. —David Eppstein (talk) 07:05, 31 August 2022 (UTC)
 * "is controlled by the kissing number" how about bounded?
 * Better, replaced. —David Eppstein (talk) 07:05, 31 August 2022 (UTC)
 * "For points in higher dimensions, finding an optimal algorithm remains an open problem." if it's open in higher dimensions, we ought to state that it's proven optimal in 2D.
 * Added a forward pointer in the 2d algorithm section and more explanation in the lower bounds section (where it was already stated, but maybe not clearly enough). —David Eppstein (talk) 07:05, 31 August 2022 (UTC)
 * Ah, I mean in the lead. For example, it could say "For points in dimensions greater than 2". Ovinus (talk) 07:23, 31 August 2022 (UTC)
 * Ok, no rush. —David Eppstein (talk) 07:05, 31 August 2022 (UTC)

To clarify what I meant in my last comment, I think the lead should say that O(n log n) is optimal for 2D, if it later says that the optimal algorithm for d >= 3 is unknown. Also noting that I haven't forgotten the review. Ovinus (talk) 04:05, 8 September 2022 (UTC)


 * Presumably "(if that degree is possible)" is unnecessary, since if the degree is impossible then it converges—quite rapidly—to the constant zero times the number of vertices. Ovinus (talk) 07:37, 11 September 2022 (UTC)
 * I suppose; removed. It's not $$\Theta(n)$$ but I guess with the less formal language we're using here it's accurate enough. —David Eppstein (talk) 00:37, 12 September 2022 (UTC)
 * Is the \, in the pi * n denominator on purpose? Looks a bit gappy to me. Ovinus (talk) 07:37, 11 September 2022 (UTC)
 * It was, but I've removed it. —David Eppstein (talk) 00:38, 12 September 2022 (UTC)
 * "than the average by a non-constant factor" Isn't "by a non-constant factor" redundant? Ovinus (talk) 07:37, 11 September 2022 (UTC)
 * No. Two times the average would not be a non-constant factor. More technically using little omega notation the intended meaning of this phrase is $$\omega(n)$$. —David Eppstein (talk) 00:39, 12 September 2022 (UTC)
 * OH I didn't see the comma after the equation. Ovinus (talk) 03:11, 12 September 2022 (UTC)
 * "Conversely, for any red–blue coloring of any subset of a given set of points, the bichromatic closest pair produces one edge of the minimum spanning tree" Clarify that it's the MST of the subset, not the full set Ovinus (talk) 07:37, 11 September 2022 (UTC)
 * Yes, done. —David Eppstein (talk) 07:59, 11 September 2022 (UTC)
 * "for random inputs or inputs whose distances and clustering resemble those of random data" seems a bit pedantic over "random inputs" or "unrealistic inputs". In this context when I hear "random inputs" I know it doesn't mean some very strong form of randomness Ovinus (talk) 07:37, 11 September 2022 (UTC)
 * I really did intend random (Poisson point process or a given number of uniformly random samples, they're both more or less the same). I'm pretty sure two of the three sources (the ones in experimental algorithms venues) were similarly careful, but the SIGKDD paper appears sloppier, justifying "resemble random". —David Eppstein (talk) 08:03, 11 September 2022 (UTC)
 * Okay. It's just a bit verbose and I don't see any benefit over just "resembling random", but no big deal. Ovinus (talk) 16:36, 11 September 2022 (UTC)
 * "also requires this much time" well, probably should say "at least this much time" Ovinus (talk) 07:37, 11 September 2022 (UTC)
 * Isn't this qualification at least as redundant as the "if that degree is possible" one above? If it requires even more time, it is also true that it requires that much time. —David Eppstein (talk) 00:40, 12 September 2022 (UTC)
 * Lol, fair enough. Ovinus (talk) 03:11, 12 September 2022 (UTC)

Quite happy with the article. Illustrations are pleasing (of course a 3D example would be cool—might try my hand at one). Will pass once we discuss the remaining points. Ovinus (talk) 07:37, 11 September 2022 (UTC)
 * Cool. Passing; nice work! Ovinus (talk) 03:11, 12 September 2022 (UTC)