User:Mill 1/Project Missing Medieval Link

Motivation
As a history buff I've been visiting Wikipedia for years. Before long I developed an interest in the Timeline-articles, especially the year pages (like 1492) and the Days of the year pages (short: DOY pages) like December 24. Days of the year pages all contain a Births- and Deaths-section, listing links to person's bio pages in chronological order. I noticed, however, that pre-medieval entries (before 1500 AD) were fastly underrepresented in DOY-pages. It seemed like no one was born or died on a specific date before the 16th century:



I also noticed that, compared to the DOY-pages, the Year pages listed far more persons in their Births and Deaths sections, many of them stating the exact date of their birth/demise. A quick scan confirmed this: a lot of persons with a biography stated in a year page are not present in the corresponding Days of the year page. The reason for this is probably that it is less obvious for biographers to add an entry to the corresponding DOY-page than to the Year-page. Once omitted by an author chances are small that someone else would add a link to a DOY manually since it is horrendously tedious.

Solution: the wiki-client app
Towards December 2016 I got an idea; shouldn't it be possible to (semi-)automate cross-referencing these timeline types? In case a missing link was spotted an added advantage would be that the text to insert into the DOY page could be generated based on the one in the Year page (whose format only slightly differs from the DOY page). A potential big issue, however, is the fact that the text within the Births- and Deaths sections is unstructured. Luckily I noticed that the level of standardization within the pages and sections is quite high; I ran an automated check regarding the Years 500 BC - 1550 AD and only had to change a few dozen pages, fixing missing sections or re-applying template standardization. I was quite astonished that I encountered so few text structure errors, given the fact that editing wiki-pages is open to anybody. Unfortunately this proved not be the case regarding the actual content (see results and statistics).

In the weeks that followed I created and improved a VBA-powered MS Excel-application that implemented the envisioned functionality. When clicking the button 'Check year' in the Excel-file, a specific Year page is analysed and used as a starting point to look for missing entries in matching DOY pages. Per section (Births and Deaths) the general algorithm looks like this:
 * Get the raw response text of a Year page regarding a specific section
 * Per person encountered do the following:
 * Store information (Display-/link-name, date of birth/death, link text)
 * Check if a bio page exists based on the name of the entry
 * In case an exact date of birth/death is known: check existence of a link to the biography in the matching date of the year page
 * Write the results to the project-sheet of the Excel-application.

Handling a single DOY
After a specific year is processed the results may look like this:

The results are shown in the sheet per section; Births on the left and Deaths on the right. Per section all persons are listed of which an exact date was stated in the Year page and of which a bio exists. Per person next information is displayed:
 * Name person: the wiki display name (hyperlink to the biography)
 * Exact date (of birth or death, hyperlink to the corresponding Day of the year page)
 * Name exist?: Does the entry exist in the corresponding DOY page?
 * Text to add to section: The generated link-text to insert into the DOY page. This text is manually copied and pasted.

When the application has determined that the date in the Year page and in the bio page are identical then the "Text to add to section"-cell gets a green backcolor. Also, When a person is already listed in the matching day-page 'Name exists?' is TRUE and the text to add is '-'. Otherwise the text to copy into the matching DOY page is generated based on the one in the year page. Quite a few notable links are missing; in case of Year page 1492, section Births, 21 persons with exact birth dates are listed. But only 8 of them are present in the matching DOY pages. Another thing that stands out are the erroneous entries; Not all "Text to add"-cells are green. If a discrepancy is identified between the date stated in the Year page and the one in the matching biography it is marked by a red back color instead. As it turned out, quite a few types of errors existed that needed to be fixed before I could add missing entries:
 * Mismatch between date in Year page and in bio page (either date or year).
 * In the bio page no exact date was stated
 * In the Year page or biography incorrect date-formatting was used
 * In the Year page the link to the bio page was incorrect
 * Etcetera...

If such an error is detected further investigation is required; what is the source of error?; the year page or the bio? Or do I need to tweak my VBA-code? For instance: take a look at Births; person: Adam Ries. The Year page states March 27 as the date of birth, whereas the matching bio states January 17. Further investigation will have to make clear which correction will have to be made to which page. As a consequence I had to correct numerous Year- and bio-pages (which I don't mind being a Wiki Gnome).

Adding the entries
After adressing the errors of a specific year I could finally do what was the initial goal of the project: insert links to bio's that are missing in WP:DAYS. Per year the generated text of the entries is copied manually from Excel to the correct location within the section of the DOY page. So far (today is 26 June 2017) I checked all the years between 500 BC and 1625 AD this way. I added a few thousand entries during the process, in some cases adding 10+ (pre-)medieval entries to a DOY.

However, not all persons should be added to a DOY because of 'sufficiently globally notable'. All my insertions are swiftly validated by the Wiki-community, especially by Rms125a@hotmail.com. Again, thanks for all your hard work!

Going through the centuries I noted a signifant increase in data from the 14th century onwards. Until 1350 every missing link to a referenced bio was added. From 1350 going forward the missing entries detected by the Excel-application were subject to more scrutiny. Apparently there's a lot of crap being added. Apart from notability I decided to also look at the size of the bio involved as an indication for possible insertion into DOY. I am fully aware of all the ongoing discussions around entry notability. However, because of the sheer amount of missing entries I had to come up with some criterium to quickly sift through the found missing links. I found that article size is a good indication of notability. Next table shows the century and the minimum number of characters (based on the raw http request content) required for an article usually to be eligable for insertion:

Keep in mind that its limits are quite flexible. For instance: some articles on medieval poets actually quote some of their entire poems, greatly diminishing the article's relevance and notability based on the article size in characters. Other bio's state all kinds of invisible information like an infobox with a lot of empty properties and/or numerous wiki categories with the same result. On the other hand I learned that, based on article size, the minimum number of characters to compose a relevant wiki bio regardless of its occurence in history is around 8,000. Of course there are some other criteria for a bio to be notable or not:[ongoing]
 * European nobility is always notable.
 * Roman catholics and German theologists are never reverted.
 * Renaissance painters no matter how obsure are always accepted.
 * Anyone who had to do anything with the Mayflower is acceptable.
 * If > 15th century: a picture of a painting/drawing/bust of the subject is required.

In the end bio size doesn't seem to matter in order to be admitted to a DOY-page. Just take a look at this table (more summaries can be found in the archive at the top).

Milestones
Ageed, 'milestone' is a terrible term concocted by management ;). Anyway:
 * By now I've added between 5 and 20 (pre-)medieval entries to every Day of the Year page (to January 1 chk), average: ± 10.
 * Regarding these pages I've pushed back the latest year of the first item significantly. Births: 1495 (21 November), Deaths: 997 (23 July ).
 * Both the Births and Deaths section in all DOY pages now show at least one link from the 16th and 17th century.
 * Every Deaths section in a DOY page now contains at least 5 (pre-)medieval entries, stating at least 4 different centuries (yes, I went a bit overboard on that one.. )


 * The earliest date of death I added was regarding Leonidas I: 480 BC (11 August)

Statistics
Below charts per section are displayed that clearly shows the progress that's been made since the start of this endeavour. Also note that the many existing erroneous links were removed from the DOY pages during the period December 2016 – June 2017.



During this project I also compiled some other statistics. Following excerpt of the output explains itself. Pay special attention to the number/fraction of discrepancies per processed year. Luckily the fraction of erroneous entries dropped sharply after 1600. I never found out why.