Wikipedia:Reference desk/Archives/Mathematics/2019 July 13

= July 13 =

Wikipedia stats question
This is a statistics question.

Question: What percentage of Wikipedia articles should contain at least one dead link?

Given the following assumptions:
 * 80% of Wikipedia articles contain one or more external URLs
 * URLs die on average after 7 years ie.
 * Articles with URLs contain an average of 4 URLs
 * Wikipedia has been live, on average, since 2007 (to account for other language wikis that started later)

This is not specific to Enwiki but across all languages. The numbers are assumptions and might not be accurate but the important thing is how to determine the answer so that assumptions can be tweaked. The literal number of articles is not important only percentages. -- Green  C  13:44, 13 July 2019 (UTC)


 * I'm not a statistician, but you're going to need a bit more to go on. Just knowing the average number of links in an article doesn't give you the full picture.  You'll get a different answer if every article contains 4 links than you will if every article contains 1 link, except for one article, which contains millions, even though both will have an average of 4 per article (whatever the actual amount is needed to make the average come out to 4).  Trying to think about how long links live is going to make life probably more complicated than it needs to be too; after all, dead links do get removed and updated.  You probably want to separate this into estimating what fraction of current links are dead vs live, then making some kind of guess about the distribution of number of links per page.  At that point, you could estimate what you're ultimately looking for.  Maybe someone else can suggest realistic ways to go about it or some plausible first guesses, but I don't think you can really say much only given what you've stated above.  –Deacon Vorbis (carbon &bull; videos) 01:11, 14 July 2019 (UTC)
 * Links are distributed on a power law curve with a few articles with a lot and most articles with a few on the long tail. The top-linked article might have a few thousand and rapidly drop off from there. Dead links being updated and removed is of no concern, assume that doesn't happen it will be an assumption. Archived links are treated as dead links for this purpose as well, in case that matters. -- Green  C  14:19, 14 July 2019 (UTC)
 * Well, if you assume that the lifetime of a link is exponentially distributed (probably somewhat reasonable) with average $$\mu = 1/\lambda,$$ and that links are added at a constant rate (probably reasonable) over a total length of time $$T$$, then the probability of a particular link being dead is (assuming all links are independent, which may or may not be a reasonable approximation)
 * $$p = 1 - \frac{1 - e^{-\lambda T}}{\lambda T}.$$
 * Using your values of a 7-year mean and 12-year total life, this comes out to be $$p \approx 0.5217.$$
 * Next, assuming a power law (in other words, a zeta distribution) with mean 4, Wolfram Alpha kindly tells me that the parameter for that is $$s \approx 2.185.$$ Then the probability that a particular page has at least one dead link is given by
 * $$1 - \frac{\operatorname{Li}_s(1-p)}{\zeta(s)},$$
 * where $$\operatorname{Li}_s$$ is the polylogarithm, and $$\zeta$$ is the Riemann zeta function. Plugging everything in, I get a final answer that about 0.639 of pages (that contain at least 1 link to begin with) contain at least 1 dead link.  However, there were an awful lot of assumptions that went into this, several of which were pretty suspect, so I wouldn't put much stock into this value.  I totally could have done some algebra or computation wrong in there too, just to warn you.  –Deacon Vorbis (carbon &bull; videos) 16:04, 14 July 2019 (UTC)
 * Thank you! I am unable to follow the math but the end result seems reasonable maybe a little high. Currently about 0.18 of pages contain an archive URL (based on searching and counting of archive links). Even if 0.639 is high, it's still a long way from 0.18 suggesting there are still many unfixed pages that contain at least one dead link not yet archived. --  Green  C  21:51, 14 July 2019 (UTC)
 * where $$\operatorname{Li}_s$$ is the polylogarithm, and $$\zeta$$ is the Riemann zeta function. Plugging everything in, I get a final answer that about 0.639 of pages (that contain at least 1 link to begin with) contain at least 1 dead link.  However, there were an awful lot of assumptions that went into this, several of which were pretty suspect, so I wouldn't put much stock into this value.  I totally could have done some algebra or computation wrong in there too, just to warn you.  –Deacon Vorbis (carbon &bull; videos) 16:04, 14 July 2019 (UTC)
 * Thank you! I am unable to follow the math but the end result seems reasonable maybe a little high. Currently about 0.18 of pages contain an archive URL (based on searching and counting of archive links). Even if 0.639 is high, it's still a long way from 0.18 suggesting there are still many unfixed pages that contain at least one dead link not yet archived. --  Green  C  21:51, 14 July 2019 (UTC)


 * I know the original question specifically asked for the math rather than discussion of the assumptions, but URLs die on average after 7 years is very dubious. First of all, I wonder where the stat comes from: if it is about internet in general, URL cited in Wikipedia are not a random sample of internet links: they are skewed towards WP:RS which are likely to stay up longer. Second, there is a more subtle statistical effect (called "tempo effect" here): life expectancy of links is estimated from the previous observed rates of link rot, which might not be representative of the future rates of link rot. Tigraan Click here to contact me 08:26, 15 July 2019 (UTC)