User:Mill 1/Project Chaining back the Years

"This is going to take me ten years." I thought. In the end it took only six.

Preface
The articles that list the recent deaths consistently rank among the most popular on Wikipedia. However, it must have been in the summer of 2018 that I first got interested in the older versions of them. At the time the dead were listed per month ('dpm's') and per year ('dpy's'). I noticed wild differences between them in formatting, guidelines, coverage and sourcing. An explanation is that presently the dpm of the current month is edited intensively as the month progresses. And a lot a watchers make sure the guidelines are followed during and after the running month. However, this was not the case for pages listing deceased in the pre-Wikipedia era; they were put in dpy's afterwards. Annoyed by the discrepancies between these types of pages I set out to standardize the formatting of the dpy's first. .

During that time I noticed something else which would become the main motivation for the initial phase of this endeavour: days were missing! There seemed to be days that nobody had died. This could not be and my OCD-tendencies immediately kicked in. An idea formed in my head: why not create dpm's for all months going back to 1995? It would solve the issue of the dpy's becoming very long and I could add missing days when processing a month. I would take the year 2005 as a starting point because I discovered that from 2006 onwards at least one deceased is listed for every day until the present. I remember flinching at the idea when I realised I had to process more than 4000 days. "This is going to take me ten years." I thought. In the end it took only six.

This article tries to give an overview of the activities envolved during the project that I dubbed Chaining back the Years. It also states some interesting milestones and statistics.

Three rounds
In hindsight improving the deaths lists fell apart in three separate rounds of activities during which each existing dpm and dpy was processed.


 * Round 1: Breaking up the Deaths in Years (September 2018 – October 2020)
 * Round 2: Adding NYTimes references (November 2020 – October 2021)
 * Round 3: New rules: let's process every day (again) (November 2021 – October 2024)

You can find information on the initial versions of the dpm's per round here.

Round 1: Breaking up the Deaths in Years
Period: September 2018 – October 2020 Articles: Deaths in January 1997 – Deaths in December 2005

The first phase started by making the dpy's even longer before forking them into twelve separate dpm's. Regarding every month I needed to perform checks, find missing notable deceased for the list ('entries') and compile the wikitext that I could paste in a dpm. Obviously this was way too much work to accomplish by hand.

So before beginning I extended the functionality of the Excel application that I had already used for several other projects. It would proof to be indispensable when processing a month:

1. Dpm checks
Before entries would be added/updated, the month at hand would be checked. Existing entries in the list would be cross-referenced with their corresponding bio's to look for discrepancies:
 * Are the existing entries in the correct day sub section?
 * Do the existing entries link to a valid biography article?
 * Do the corresponding bio's contain the correct "[YEAR] deaths" and "[YEAR] births" categories?
 * Does every day sub section contain at least two (later three) entries?

2. Process specific days of the dpm
After the initial checks the actual work on the article could commence. At first, I focused on filling the gaps in the days of death but soon I decided every day should contain at least two entries. Processing a specific date started by clicking the 'Chk'-button in the 'Death per date' worksheet. Next tasks would be executed:
 * Resolve the list of bio's whose subject had died on a specific date. More info can be found here.
 * Show the list alphabetized and per bio display if it is a stub or has any 'problem flags' like.
 * Apply custom filtering to the list. More info on that here
 * Per bio try to resolve next parts of an entry by analyzing the bio's wikitext:
 * The (linked) name of the article
 * The date of birth and death to determine the subjects age
 * The nationality and reason for notability (by analyzing the opening sentence of the bio)
 * The cause of death

Result filtering
From the start it was also clear to me that some inclusion filtering needed to be applied to the found new (and existing) entries. On some days more than 50 persons with a bio died. Stating them all would make the dpm's unwieldy and error prone. And lesser figures (often stubs and virtual orphans) distract from more notable entries.

So I experimented with conditions like not being a stub or having problem flags. Did not work. However, the tool looked for the date of death (DoD) only in the infobox of the person's bio. As a consequence, a biography having an infobox acted as a first filter. I also made the application look at the bio's text size (excluding the text in the infobox and stated categories, the 'net size'). This was the second filter. I settled for 4000 characters as the minimum 'net size' of a biography. This first attempt at grading WP:N worked, but it never sat well with me (and others). It was one of the reasons to initiate round 3.

3. Concluding processing a month Processing a month would be concluded by two manual activities:
 * Search for additional causes of death regarding the entries.
 * Are there any 'reason for notability'-descriptions in the entry that needs trimming?

Chronology of activities
Work started on 1 September 2018 by applying the same format to sections and list entries to all the dpy's The first couple of months I worked on the 24 existing dpm's of 2004 and 2005, processing each month assisted by the Excel tool as described. At the same time I was also in the process of finishing another project. On 2 February 2019 I standardized the guidelines and day sub sections of all dpm's between 2004 and 2015. Applying those changes finalized the first round of improvements regarding the dpm's of 2004 and 2005.

I could now focus on the Deaths in Year-pages. Next list shows when an entire year was completed after which it could be split up into 12 dpm's, finalizing their first round of improvements:
 * 2003: 10 February 2019
 * 2002: 17 February 2019
 * 2001: 17 February 2019
 * 2000: 23 February 2019
 * 1999: 19 May 2019
 * 1998: 9 November 2019
 * 1997: 4 October 2020

Regarding 1998 and 1997 (and 1996) a new dpm was created right after a month had been processed. The 12 dpm's were not created simultaneously anymore as is explained here. Processing of the years 1993-1998 was done in this processing page which would be initialized every time after a dpm was completed.

User:Mill 1/References/The New York Times
Round 1 saw one final improvement. From the beginning I had noticed that the dpm's lacked references citing the deceased date (and cause) of death. I had started adding some citations to entries but it seemed to be a drop in a bucket. That's why I introduced a new feasable requirement: at least one reference per day sub section. Around June 2020 I first started thinking about automating citations. The archive API of The New York Times especially offered great possibilities. So I wrote some code to experiment interacting with the NYTimes API to retrieve obituary data and create citations from them. I pasted the output in another processing user page: /References/The New York Times. The results were spectacular. I could now use this list of generated references as a source. So after processing a day I would also manually add citations of matching entries to the day sub section of the dpm. The first month I processed this way was September 1997. I worked my way back to January 1997, improving and bugfixing the code.

Eventually the software evolved into the WikipediaReferences-application. You can read more about it here. On November 14, 2020 (I learned from the GitHub commit) the application was finally able to add NYTimes-references to the corresponding entries of an entire dpm automatically. I decided to reprocess all the existing dpm's (1997 – 2005) so that their number of stated references would increase considerably. Work started with January 1997 on the same day heralding the start of the next round.

Milestones

 * 1 September 2018: the first edit is made
 * 10 February 2019: the first dpm is created
 * 10 February 2019: the first dpy is nuked
 * 2 March 2019: all dates since the start of the millenium to date have been accounted for
 * 11 July 2020: the first day sub section is processed including generated NYTimes references

After 25 months round 1 was concluded by creating the last dpm of 1997. By this time I already must've decided to extend the 'chaining back' period back to January 1990

Round 2: Adding NYTimes references
Period: November 2020 – November 2021 Articles: Deaths in January 1995 – Deaths in December 2005

As already described at the end of the previous section the succes of WikipediaReferences application prompted me to re-process all dpm's that existed at that time (November 2020). Automatically adding NYTimes references using the tool would also become the additional third activity when wrapping up a month (see 3. Concluding processing a month in Round 1 for the other two activities).

Processing a month using the WikipediaReferences application
Processing a particular dpm usually consisted of these steps:
 * After the regular processing of a dpm was concluded and the last entries were added/updated I would run the software to evaluate a dpm. See screenshot: I would select 'p', followed by some input to tell the application which month and which Wikipedia source page to process.
 * First the app would perform initial checks like looking for duplicate entries. The process is aborted if any issues are encountered.
 * If the initial issues are resolved the month in question is evaluated by comparing the NYTimes obit data with the entries in the dpm. After that the app offers to generate the the wikitext, including the added/updated references. However this was seldom the case. In most cases other actions were required first after which the evaluation was run again. Two types of actions exists:
 * If NYTimes obituary data exists for a listed entry than the resolved death date in the obituary is compared with the date of death in the entry's corresponding bio. Very often discrepancies would exist. One reason is that the death date stated in the bio is wrong. . These discrepancies had to be corrected first.
 * The software would also spot potential entries: regarding the particular month NYTimes obit data would exist for bio's that were not present in the dpm. In fact, some many potential entries were suggested that I applied a notability filter on them. I would add most of the suggested entries manually to the dpm source page.
 * After the correction/additions step I would re-run option 'Print month of death'. Sometimes several times until no more issues were encountered by the application.
 * After succcesful evaluation of the dpm I would instruct the app to generate the wikicode in a text file.
 * Processing a specific dpm is concluded by pasting the contents of the text file in the source page of the dpm and checking the result.

Chronology of activities
Right after I uploaded the last code changes I started using the software on the existing dpm's. I really hit the ground running processing the years 1997 - 2000 within 6 weeks, adding and updating over a thousand citations (as well as adding quite a few entries suggested by the application).

By September 2021 I had processed all existing dpm's, increasing the number of references on a page considerably. . I could now resume my efforts in the processing page were I prepared brand new dpm's starting with Deaths in December 1996. By now the software was firmly embedded in the way of working.

1995
However, work was interrupted by another job. An editor had forked Deaths in 1995 into 12 dpm's without any regard for the different style and format, after which he added many entries. It took me a sh*tload of time bringing the new dpm's up to par. The task involved a lot of corrections by hand as well, adding causes of death, shortening entry descriptions, meanwhile battling this lunatic. When cleaning up 1995 I also identified many unnotable entries, many of whom didn't even have an enwiki bio. And by this time I already decided to reprocess all the days of existing dpm's partly to apply the new notability algorithm to entries. This would mean that many 1995 entries would be cleansed from the lists. That's why I decided it would be a huge waste of time applying the WikipediaReferences tool to the 1995 entries; it would take a lot of effort correcting entries that would be removed at a later stage anyway. This is the reason why (alhough chronologically incorrect) this was actually a Round 1 job.

Milestones

 * 14 November 2020 the first dpm is processed using the fully functional WikipediaReferences application
 * 12 September 2021: processing the first new dpm using the tool

Still using the wiki_client Excel tool, Round 2 came to an end on 31 October 2021 with the creating of Deaths in September 1996

More details on the progress regarding Round 2 can be found here. In the table click on on title 'Round 2' to sort on the date when the processing of a dpm was finished.

Round 3: New rules: let's process every day (again)
Period: November 2021 – November 2024 Articles: Deaths in November 1989 – Deaths in December 2005

So by now I've been at it for a couple of years and during that period two issues started bugging me more and more:
 * 1) The notabilty algorithm is faulty; I'm adding entries whose bio's are semi orphans. At the same time I miss notable entries because their bio's don't have infoboxes.
 * 2) Most entries do not have citations. After completing Round 2 this was improved somewhat but many dpm's now contain references that almost exclusively point to The New York Times as a source.

Wikidata
During my activities I had come across Wikidata when inspecting bio's. At some point I must have noticed that the data stored in a human Wikidata item could serve my purposes, especially these data properties: Investigating the Wikidata query capabilities made me realise that using Wikidata as a source offered huge advantages over using an entry's corresponding Wikipedia page. It would help me regarding the two issues, resolve the cause of death automatically and offer an alternative for the description part of an entry to generate. There was also one final perk using Wikidata as source: the death date statement of many items contained references supporting the claim. This information could be used to generate references for entries automatically when processing a dpm. These were all great improvements. I realised that I had to re-process every day between 1990 and 2005 AGAIN. But since it was clear that it would hugely increase the quality and reliabilty of the dpm's I decided in a heartbeat I would do it. I still had to create the software though which ultimately would become the WikipediaDeathsPages web application.
 * Item's description (=reason for notablity regarding humans)
 * Date of death (DoD) statement
 * Date of birth (DoB) statement (needed to resolve an entry's age)
 * Cause of death statement
 * Number of wiki's in which the human is present

At the heart of the app would be the query that would fetch the Wikidata data regarding a specific date of death. Unfortunately I am unfamiliar with the SPARQL query language. Luckily Request a query exist. With the help of volunteers over the course of a [https://www.wikidata.org/wiki/Wikidata:Request_a_query/Archive/2021/10#How_to_get_a_list_of_humans,_including_some_of_their_properties,_who_died_on_a_particular_date? couple of months] I was finally able to define the query. As input it would only require the date of death. The output is shown below as a table. As you can see it contains the basic data (alphabetized by article name!) I needed to generate the entries for a specific day (in this case 25 August 2001):

Rethinking notability
As already explained the algorithm that decided if a deceased should be listed was flawed. I had already noticed that more relevant people appear on more wiki's (winner). I also came to believe that more links to a bio suggests greater notability. The Wikidata query returned the number of site links per entry. The Wikipedia link count api could resolve the number of incoming links. At some point I came up with the concept of the "notability score" of a potential entry. This score is expressed as a product of the two aforementioned data points. For instance take John Chambers (make-up artist):

Number of site links: 12 (see column 'sl' in above table) Number of pages linking to the bio: 237 (Link Count tool result, API result) Hence John's notability score would 12 * 237 = 2.844

After much experimenting I settled for a minimum score of 48 for an entry to be listed. Although still not perfect it worked way better than the previous algorithm, with this as the end result.

Wikidata references
When I was building the Wikidata-query I had noticed that some online sources were stated quite often as references for death date statements for humans. Because of the structured way this information was stored I could use it to generate citations fo my entries. Obviously the online source is checked for existence and its contents searched for the date of death (DoD) before the information is used to create a reference.

Next sources are evaluated, in following specific order:
 * 1) Encyclopædia Britannica
 * 2) The Guardian
 * 3) The Independent
 * 4) Internet Broadway Database
 * 5) DB~e
 * 6) Biografisch Portaal
 * 7) FemBio
 * 8) filmportal.de
 * 9) Fichier des personnes décédées

This is an example of a generated reference based on the Wikidata DoD statement claims of José Craveirinha:

Sports sites references
During implementation of this I discovered an alternative way of automatically utilizing online sources. Websites use specific url patterns to identify resources on the host. Some of the websites use name-based patterns. For instance the site Cycling statistics uses next url to identify rider Jacques Anquetil:

https://www.procyclingstats.com/rider/jacques-anquetil

Knowing the specific pattern I could 'guess' url's using the label name of an entry. When processing DoD November 2, 2004 for instance rider Gerrie Knetemann would be one of the deceased returned by the Wikidata-query.

The software would send https://www.procyclingstats.com/rider/gerrie-knetemann as a request. If the web page exists its html is searched for the DoD. If encountered the web page can now act as a citation and next web reference is generated:

This way of looking for citation sources is done when no Wikidata DoD-references were encountered. The mechanism was applied to next (sports) web sites, in following order:
 * baseball-reference.com
 * pro-football-reference.com
 * basketball-reference.com
 * hockey-reference.com
 * olympedia.org
 * worldfootball.net
 * procyclingstats.com
 * where2golf.com

Note: To decrease the number of http requests per entry I first looked in the entry's bio to determine if the person was known for any of the sports being evaluated. Only then the url would be compiled and called.

Second tier Wikidata references
If no sports site reference could be resolved next Wikidata reference sources are evaluated (in that order):
 * Library of Congress
 * SNAC
 * Bibliothèque nationale de France

Since these sources are stated very often as Wikidata DoD claims they now appear in abundance as references in the dpm's:

I finally had established an acceptable way of resolving notabilty and generating citations. Now I only had to cast it into a userthat would be mefriendly solution.

Wikipedia Deaths Pages
From the start it was clear the solution was to be a web application. Because of the amount of text a console app would not be suitable and by then I had enough experience using web application framework Angular that I felt comfortable creating a single-page application to meet my front end needs. I can not determine when I started developing the web site. Fact is that the new software was first used on 16 November 2021 (see Milestones). A lot of tweaking to the code followed in the following weeks. I remember expanding the citations functionality and bugfixing the Wikidata query.

When the first version was released the site contained all the functionality to process a dpm the way the Excel tool did, but with the implemented improvements.

To achieve this, functionality present in the Excel tool had to be programmed again for instance:
 * Initial dpm checks
 * Resolving data in the entry's bio, for instance the entry's description
 * Numerous text manipulation functions

More in-depth information on the app can be found here. But how was the web site used when processing a dpm?

Processing a month using the Web application
A dpm article would be updated by following steps
 * 1) Perform the initial dpm checks. Consult  in Round 1 for specifics. Additional checks were looking for article redirects and named references. See the screendump for an example of the checks results.
 * 2) Any issues found have to be solved first e.g. moving an entry to the correct day-subsection in the dpm, fixing redirects, removing nowiki-entries, adding categories or correcting the DoD in the entry's bio.
 * 3) If all issues are solved processing the days in the dpm can commence.

Code excerpt
Example of the C# code handling a piece of the challenge to determine the description part of the entry (which denotes the reason for a person being notable). public string ResolveDescription(string wikiText) {   wikiText = RemoveReferences(wikiText);

string description = GetInitialDescription(wikiText);

if (description == null) return null;

description = description.Replace("U.S. ", "American ", StringComparison.OrdinalIgnoreCase); // because of the end candidate '.' description = description.Replace("United States ", "American ", StringComparison.OrdinalIgnoreCase);

// Trucate string; [,] [perhaps/probably] [best] known [mostly] for  .. etc. string[] endCandidates = new string[] { "Infobox", "infobox", "{|", "{{", " who ", " whose ", " notable ", " noted ", " known ", " better ", " spanning ", " originally ", " widely ", " responsible ", " remembered ", " best ", " most ", " perhaps ", " reputed ", " born ", " considered ", " particularly ", "." };

int posEnd = GetPositionDescriptionEnd(description, endCandidates);

if (posEnd == InitialPosEnd) throw new InvalidWikipediaPageException($"None of the {endCandidates.Length} 'description end' candidates found (including '.') within {InitialMaxLengthDescription} chars from 'description start'. Change the opening sentence of the article. Description: \r\n{description}");

description = description.Substring(0, posEnd); return RemoveWikiLinks(description); }

private string GetInitialDescription(string wikiText) {   string[] descriptionStarts = new string[] { " was a ", " was an ", " was the ", " was one of ", " was " }; // " was " LAST!

int pos = GetPositioninWikiText(wikiText, descriptionStarts);

if (pos == -1) return null;

return wikiText.Substring(pos, Math.Min(InitialMaxLengthDescription, wikiText.Length - pos)); }

Temp (under construction)
Additional afhankelijkheid: Wikidata wijkt af..

algorithm based on Wikidata
 * New tooling: web application
 * New notability rules
 * Add cause of death automatically

Wikidata editor

Chronology of activities
vanaf okt 2023:

Number of pages linking to the bio new method: 46 instead of 237 (result, API result )

Milestones

 * The first day generated using the new software was 1 August 1996 on 16 November 2021

Side effects (under construction)
During the entire process I would find many errors in the analyzed bio's. I must have corrected thousands of bio's during the course of this project. The most common fixes to bio's:
 * Adding the nationality of a person in the opening sentence
 * Correcting the date of death of a person
 * Correcting categories regarding the year of death (and birth)

Also:
 * Wikidata: 2000 edits
 * Seven repositories om GitHub containing Wikipedia-related software
 * Created articles, f.i. Lesley Cunliffe and Kambara Tai
 * Created new dpy's for the years Deaths in 1980 – Deaths in 1989 because in 'datum' the were removed from the Year-pages

Statistics (under construction)

 * Number of created dpm's: 170
 * Number of added entries (approx.) TODO
 * Number of edits (approx.) TODO
 * TODO: see WikipediaChecks, tables showing per year (separate article?):
 * number of entries per day
 * number of references per day
 * Create charts showing the progress per round

Epilogue
One question remains: Why? Why would anyone spend that much time on these trivial lists? Sure, I stumbled across a mess when I was looking for a challenge to help me become a better programmer. And in a way I became a slave of the applications I created; the custom and personal software worked so well that I felt the responsibilty of seeing it through. No one else would be able to pull this off. Perhaps I just wanted to leave something behind, albeit insignificant. Or maybe, as Tony Stark put it: "Everybody needs a hobby."