User:Ilmari Karonen/First link

An internet meme, originally discovered by a user of Reddit.com named pixelcrak on April 13, 2011 and further popularized by xkcd, says that if you go to a random article on Wikipedia and keep clicking the first non-parenthesized link in the body text of each article you get, you'll always eventually end up at Philosophy.

But is this really true? Or, rather, what is the probability of that actually happening? To test this, I downloaded the English Wikipedia database dump from 26 May 2011 and wrote a Perl script to extract the first link from each page, skipping any templates, comments, image captions, categories, interlanguage links, selflinks, hatnotes and links inside parentheses. (At least, I thought I skipped image captions... I'm not sure why my script still thinks United States Census links to United States Census Bureau. It does seem to have skipped most captions, at least.)  I also decided to skip any links to disambiguation pages (actually, to any page whose title ends in "(disambiguation)") to avoid some problems with unusually formatted hatnotes getting mistaken for body text.

Having extracted the links (which took a while), I computed the limit cycles of the resulting iterated map and their basin sizes, i.e. the number of articles (excluding redirects) from which the iteration converged to each cycle. The results are below.

To summarize, the meme is indeed quite accurate: starting from a random Wikipedia article, the probability of ending up at the cycle which includes Philosophy is almost 95%. About 2% of all paths ended up at a red link (or some interwiki link that my code didn't recognize and skip), while a bit under 1% ended up at an article in which my code detected no valid outgoing links at all. The rest ended up in a variety of other cycles, listed below. (Redirects are shown in italics in the cycles.)

Most of the cycles at the end of the table (including all those with a basin size 1) are basically indirect self-links: article "Foo" links to "Bar", but "Bar" redirects back to "Foo". Neither my code nor MediaWiki actually recognize these as self-links, so they generally just frustrate any users who happen to click them. Among the rest, common patterns include an article about an author linking to their most famous work, which in turn links back to the author.

It's also interesting to note that, while the convergence to Philosophy seems a fairly robust phenomenon, the other members of that cycle can vary a lot as the articles get edited. For example, at the time of this writing, one week after the database dump was taken, the cycle contains only two members: the first link at Philosophy points to Metaphysics, which links back to Philosophy.

These results should not be considered completely accurate, since determining what is actually the "first link" from a Wikipedia page can be quite tricky, especially when working from the unparsed markup. For example, Israeli Jews links back to Ada Yonath from an infobox, but my script doesn't recognize it as such (because it has apparently been substed) and thinks the link is part of the body text. The actual first link from Israeli Jews leads to Israelis, and from there eventually to — you guessed it — Philosophy.