Wikipedia:Reference desk/Archives/Miscellaneous/2020 January 16

= January 16 =

"Disambiguation chains" on Wikipedia
This is related to the "link chains" where one claims that any two WP articles are at most six links apart, or where you usually end up at the same article if you click the first link of an article, then the first link of that article etc. So it's an entertainment question, a humanities question, sort of a graph theory question, and a (very tangential) question about Wikipedia at the same time. Part of the question is where it actually belongs; posting it in all places at the same time would look spammy.

I'm not using any links, but the disambiguation syntax of WP articles, which makes it a language question as well. I start with a title which comes with a disambiguation phrase, say, hamlet (settlement), and the rule I follow is that the next article must start with "settlement (", i.e. one of the articles about settlements with a disambiguator attached. For example, I could pick Settlement (finance).  The next article could be Finance (Bhutanese football club) but it's unlikely that there's an article starting with "Bhutanese football club", much less a disambiguation.  Finance (game) looks more promising.

My questions are if that has been tried before, and how long those chains can get. Without a lot of searching (just picking phrases which looked promising), I got the following, which uses some redirects:

Hamlet (settlement) → Settlement (trust) → Trust (business) → Business (song) → Song (album) → Album (magazine) → Magazine (band) → Band (channel) → Channel (broadcasting) → Broadcasting (networking) → Networking (disambiguation) → Disambiguation (disambiguation),

all in all 12 articles, including the fixed point at the end. Exhaustive search looks too inefficient, but if one wanted to attack the problem more seriously, one could cull all articles without disambiguators (like Finance) from a list, followed by all articles with disambiguators which don't lead to new articles (like Finance (Bhutanese football club)). The resulting list should be much shorter. . . but is it short enough? 84.148.240.150 (talk) 13:27, 16 January 2020 (UTC)


 * 1) I haven't heard of anyone attempting that; 2) I seriously doubt the resulting graph is connected--the "six links" claim is obviously false since there are articles with no outgoing links, so the article link graph is not even connected, much less with such a small diameter; 3) You can pull dumps of Wikipedia content and/or metadata from dumps.wikimedia.org if you want to play with this stuff offline. 2601:648:8202:96B0:0:0:0:DF95 (talk) 21:13, 18 January 2020 (UTC)


 * Re 2) IIRC, it's allowed to traverse links "backwards" e.g. by opening the "What links here" page of an article and use its links; the only restriction is that you only use the article proper. Clicking the Wikipedia logo on both endpoints would be a trivial way to get a 2-link connection.  Another sensible restriction would be "null-space only", e.g. links to "Wikipedia:" namespace like Reference desk → Reference desk would be outlawed.


 * At any rate, there could be "islands" of few, possibly even single, articles. In that case, one would have to weaken the claim to the biggest "continent" of articles, which should inclode almost all "good" and featured articles. 84.148.240.150 (talk) 10:47, 20 January 2020 (UTC)


 * Re 3) Those look very useful, both to search for long chains and to set a "fixed" target (e.g. articles at a certain date). Good job using 7zip instead of zip, btw.
 * To have a truly fixed target, one would have to define some border cases, too, e.g. if it's possible to re-use popular disambiguators like (song). Thanks! 84.148.240.150 (talk) 10:47, 20 January 2020 (UTC)
 * 7z is only used for the meta history dumps that contain the complete content of every page in the wiki. Uncompressed it is some tens of terabytes so I'd only download it if you plan to do something with all that data.  For the other dumps, 7z isn't used because it is much slower than bz2 without beating the compression ratio by much.  7z is basically the same as xz fwiw.  I fooled around with zstd for these dumps a little, but it doesn't seem to be a significant win. 67.164.113.165 (talk) 01:07, 21 January 2020 (UTC)