User:Mill 1/Project All Who Are Born Must Die

Motivation
During I noticed something odd. Until the 15th century on a Year page the entries in the Deaths section generally outnumbered the Births section. The reason seemed evident; during their lifetime a person would become important enough to have their date of death archived for posterity although in many cases no records existed regarding the date of their birth. But from 1420 onwards the quality of administration regarding clergy, nobility and other privileged groups apparently reached a point that the number of birth dates start to overtake the death dates in the Year-pages. Within a century, the Births dates outnumber those in the Deaths section greatly, sometimes [by a factor of 5 or more]. That is strange since if the date of birth is known, in most cases this would also be true for the death date. I suspect that Wikipedians  often omit to add the corresponding date of death after stating a person's date of birth on a year page. The following chart shows the stated links per DOY page aggregated per (pre-)medieval century (shown numbers are outdated):

Solution
I realised that it should be possible to check for these omitted entries via the WIKI-client application. This turned out to be true. The general algorithm looks like this:
 * Get the raw response text of the Births section of the Year page.
 * Per person encountered do the following:
 * Store information (link-name, date of birth, link text)
 * If an exact date of birth is stated, get the raw text of the person's biography, based on the name of the entry.
 * In the bio page look for the date of death, if present.
 * Write the results to the project sheet of the Excel-application.
 * If not present in the Deaths section of the corresponding Year-page, generate the link text.

Since the groundwork of the wikiclient-application was already done I suspected it would take me a couple of hours of programming to implement the new functionality. This proved not to be the case. I cost me an extra 30 hours to realize the solution. Especially retrieving the date of death from the article proved to be very complex, due to the fact that many different ways/formats exist to add a date. I had to resort to this kind of code a lot: Next picture shows the fruits of my labor.

When a year is processed the text in column J is ready to be copied into the Deaths section of the Year-page stated in column I. I also ran some checks to spot errors within/between pages. The most notable being:
 * Many biographies contain an infobox. If present it is the first place the programme looks for the death date. After that the programme will look for the matching date in the opening sentence of the article. If the two dates differ a warning is displayed after the entry-text.
 * If no infobox is present in the bio the date of birth should be stated in the opening sentence. It is the starting point of the search for the death date. A warning is generating in the link-text if the birth date is not encountered.
 * In the source Year pages, section Births, per entry the year of death is often stated. If this year is different from the year of death found in the matching article the application also shows a warning.



Results
Since this project is W.I.P. I cannot state anything conclusive. What I can say is that this additional function yields a lot of missing entries. For instance, consider year 1475 whose results are displayed in the image above:
 * Section Births of the year 1475 shows 25 persons.
 * Of these 25 persons, only 14 entries state the exact date of birth.
 * In the matching bio's two dates of death are unclear. Two others show discrepancies (marked with orange back color) that need correcting.
 * Of the remaining ten persons, four are already mentioned in the Deaths sections of their death year.
 * Six death-links are missing and will be added to the corresponding year-page.

Update: By 1623 I got so fed with adding obscure and badly sourced death-links of bio's that I defined following two conditions for insertion:
 * 1) Mininum nr of chars of linked article: 3000
 * 2) Article should contain at least one valid inline citation or general reference