Wikipedia:Analysis of citation issues for date and year articles

It's been brought up that most of the year/date articles lack inline citations; as of now, referencing guidelines have specific exemptions for these articles. Some people want to change this, on the basis that all text in the encyclopedia should be referenced to something.

Other people say that, since these entries all point to articles which have citations for the information in question, this is a pointless waste of time.

My personal opinion is that, if it's even remotely feasible, we should try to go through and cite the content in these articles.

The question, then, is whether this is remotely feasible. Let's investigate!

Number of articles
I could not find any articles for specific days of specific years: for example, January 1, 2000 redirects to Year 2000 problem, July 20 1969 redirects to Apollo 11, and September 11, 2001 redirects to September 11 attacks. I will assume that these "aren't a thing". (NB: I was later to learn that they were all redirected following a couple of discussions many years ago!)

There are, however, articles for every day of the year (January 1 through December 31) and February 29. The three I just linked are going to be abnormally long, since they're quite unique days (the first of the year, the last of the year, and the anomalous day that happens once every four years). There are 366 dates.

As for years, we have articles for all of them between 708 BC and 2021. There are also articles for years going up to 2029 (they haven't happened yet, but including them makes virtually no impact on the projections, so why the hell not). Since 0 AD did not exist, this gives us 2,736 year articles.

There are also articles for decades and centuries, which are generally the same "deal" as years (they list significant events that happened, and when they happened). Consulting List of decades, centuries, and millennia, we can see that there are 13 millenium articles, 61 century articles, and 384 decade articles.

Many of the decade articles are autofilled by transcluding events from year articles (see the section below for more information); since the information in these ones is the exact same as in the respective year articles (which we already counted), we can ignore these ones, which leaves us with 154.

Adding those to the day and year articles, we end up with 3,560 in total; of these, we can leave out a few that are produced almost entirely by transcluding other pages (which we'll talk about more in a little bit), giving us 3,330 pages. Wowie zowie!

Estimation of scope
My strat here is going to be looking at a few random dates, then shamelessly extrapolating this into broad generalizations. To do this, I will just look at the source of a few articles of each category and see how many bullet points it has (some entries start with ** instead of *, so double-asterisk and triple-asterisk strings will be find-replaced to a single asterisk before counting them).

Days
For date articles, of which there are 366, we can probably just pick one from each month and assume that those twelve are representative of date articles in general. Now, we don't want to pick the first of each month, because more stuff will happen then, and we don't want to pick the ends of weeks, or round numbers, so they ought to be actually random. I found a website that claims to generate random dates, and it gave me a few.

Note that January 4 has a bizarrely high amount of citations (about 90% of its entries are cited, while every other date sampled has around 10%). Since it seems like an anomaly to me, I've included a second projection based on an average excluding it (note how much it changes the averages).

Years
Extrapolating these numbers out to the year/decade articles is problematic for a number of reasons; primarily, because most of them aren't that big. While the worst-case scenario is mind-blowingly bad (all year lists having the same number of entries as date lists would yield slightly over a million uncited entries), it isn't very likely.

While the size of date articles is largely random, the size of year articles is not; the further back a year is, the less likely records are to exist. Articles like 2020, for example, are extremely anomalous both in their number of entries and the percentage carrying citations. Moreover, the size of year articles is determined by a multitude of factors: the amount of people living in the world and causing events to happen, the advancement of recording technologies that allow those events to be documented, and the willingness of Wikipedia editors to compile chronicles of any given year all have an impact.

So instead of doing that, we can concoct an estimate of what we're dealing with as far as years go, using a "random" sample of one year from each century all the way forward and back. Since this is getting boring, I will select a few years fairly arbitrarily.


 * NB: Whether I count only the 2,728 years up to 2021, or the 2,736 years up to 2029 makes basically no difference, so I might as well just future-proof this essay for another decade by including the latter figures.

Later years have an outsized effect on these numbers: 2020 brings up the average number of entries considerably (from 100 to 128), and if it is excluded, the estimate for uncited entries drops from ≈261k to ≈254k. If 2020 and 2008 are excluded, it drops to ≈221k (with 85 entries on average); if 1969 is removed as well, it drops to 110k (with only 44 entries on average).

Because of the extreme influence of 20th- and 21st-century articles on averages, previous version of this section which made an estimate based on a far smaller sample of year articles had projected a much higher number of entries (≈900k) as well as a much higher number of uncited entries (≈380k – 600k).

But anyway, let's move onto the other stuff.

Decades
The decade articles, of which there are 384, are more complicated: these tend to all contain a "Events", "Significant people", "Births" and "Deaths" section (at the minimum). On many articles (for example, 210s), three of these sections use templates to transclude content from individual year articles. By looking at transclusions† for Births and deaths by year for decade, Events by year for decade, and Events by year for decade BC, we can see that everything between 490s BC and 1790s contains these templates.

Below is a table of the three decade transclusion templates, and which mainspace articles they are on (according to WhatLinksHere and TransclusionCount, which seem to disagree):


 * Note 1: Births and deaths by year for decade cut off at 1350s for no apparent reason; I decided to see what was going on with subsequent articles. Many of them either had nothing for births and deaths, or contained information nearly identical to the stuff from individual year pages copy/pasted over, so I began checking them to see if they corresponded with the transcluded lists that are provided by that template. For the most part, they were, and a couple entries were missing from individual year lists (so I added some). Others, however, were quite wrong; Dafydd ap Gwilym, James Audley, and Nissim of Gerona's articles place them as dying in totally different decades (let alone the specific years they were included as in the 1380s article). At any rate, I've verified consistency between the births and deaths sections on the decade articles versus individual years, and added the template to every decade up to the 1790s (exept the 1770s, where for some reason the template refused to transclude and I had to use transclude births and transclude deaths for each year instead.
 * Note 2: There is some kind of weird error I don't understand; while there are 230 articles in that decade range, neither of the measures seem accurate. For Births and deaths by year for decade, there should be 229 transclusions (it couldn't be used on 1770s), but WhatLinksHere gives 228 and TransclusionCount gives 230. Similarly, WhatLinksHere gives a sum of 229 transclusions for the "events by year" templates, and TC gives 232. I've got no idea what's up with that.

Out of 384 decade pages, 230 of them are like this, rendering them a "who cares" situation (of course, they still have "Significant people" and "World leaders" sections, but let's say we ignore those for the time being); only 154 remain. Of these, 130 of these occur prior to the 5th century BC, and 24 of them occur after the 18th century; it goes without saying that the older ones are much sparser in content.

By this point, you know the drill: here are some samples for the decade pages prior to 490s BC.

I really wasn't kidding when I said they were sparse — by my estimation, all of the 130 decade pages from this period put together have about eight hundred missing citations, which is less than the amount of in one decade of year pages (which have an average of 96 missing citations each)!

Now here are some articles representing the decades from 1800s onward:

You'll notice I have provided a second projection, which only gives the first 12 decades in this series (1800s – 1910s). This is because pages for modern decades present a special challenge — as we get closer to the present day, they stop being simple lists of events, and start becoming articles in their own right. For example, 2010s is 321,395 bytes long and has 556 citations, but only contains 34 bullet points. The majority of the article is written in prose, or contained in specially constructed tables; there's no way to apply the bird's-eye view of assessing citation density by tallying up list entries and counting numbered references. consequently, I'm heavily tempted to say that every decade article from the mid-20th century onward is beyond the scope of this analysis and needs to be evaluated individually.

Centuries
Not a whole lot to say about these. There's 61 of them; sampling ten apart manages to capture the earliest (40th century BC) and the latest (21st century). Unlike the decades, these do not turn into full prose articles; the 20th and 21st centuries are still basically lists of events.

Millennia
There are only 13 millennium articles, so sampling isn't necessary; we can just look at each article individually. The six from the 10th BC through the 5th BC are prose articles, with all statements appropriately sourced. The ones that can be assessed as list articles, then, are the seven afterwards.

Summary
Based on the analysis above, which I'm sure could be refined further (but I don't think is off by an appreciable amount), we have somewhere around four hundred thousand uncited statements between all the day, year, decade, century and millenium articles.

For reference, as of the time of writing (February 2021), Citation needed's transclusion count indicates we have just over 455,000 instances of it.

Of course, there are some mitigating factors that make these a little less bad than cn transclusion (i.e. each date list entry should link to an article containing the relevant citation, which can simply be copied out); nonetheless, it does look like we are dealing with a problem roughly comparable in scope to the entirety of tagged uncited statements in the whole encyclopedia.

For comparison, the Guild of Copy Editors, over the course of over ten years, has succeeded in getting a backlog of 9,000 articles down to almost zero.

What is to be done?
Some random stuff that has popped into my head:


 * I'm not quite convinced anything really needs to be done. Year and date articles don't seem to be very commonly used; if someone is using them for something really mission-critical, they only link to people who have articles written about them anyway, and the information can easily be verified/debunked based on those articles. This seems like a fairly distinct situation, compared to other Wikipedia pages. A good comparison might be, say, List of people from Sacramento, California: sure, there are lots of citations here, but every entry doesn't carry one, and why should it? Who cares? You can go to their articles and find out.


 * This whole situation seems like a legacy issue from the way things were a long time ago (inline citations being a luxury option for most articles). If these articles were all being created today, it wouldn't be that hard to just go through and add a citation for every entry; the main issue seems to be that they've gotten this way over the course of twenty years. If we were to go through and verify every single entry on all these articles, even if we completely stopped monitoring them all, it would probably take at least another twenty years for things to get this way again (by which time I'm sure we will have some way to reliably parse language and make large amounts of edits other than manually opening up at browser windows, highlighting text and slapping ctrl+c and ctrl+v).


 * It seems like lots of this information should be possible to automatically extract from biography articles now; I don't know whether that's infoboxes, categories, or some kind of fancy language parsing. However, if we're talking about a half million missing citations, I think it'd be well worth the investment to spend time on some solution that had even a tiny impact. Let's say a program is able to fill in citations on date/year links... but only if the person's birth year in their article had a direct inline citation... to a machine-readable website that directly mentioned it as their birth year... from a small whitelist of reliable sources. Maybe this only happens on one out of every hundred entries. But that's still five thousand citations. How many days of work is that? Probably at least a few.

Conclusions
I guess my opinion of this is that, barring some technological solution that allows portions of this workload to be automated (which may emerge sooner rather than later), it would be unimaginably time-consuming to cite all the date/year/etc articles (potentially involving an effort comparable to the repair of all uncited statements in the whole of Wikipedia).

I would not recommend going through and fixing these by hand, when the effort involved could be used on any number of other tasks.

— jp×g, 2021