User:Graham87/SHA-1

This page describes a method I developed in 2015 to find edits that have been deleted from old Wikipedia databases.

The problem
Since 2008, I have been trying to find edits that are in older Wikipedia databases that were mistakenly deleted (see my page history observations subpage. A problem with this endeavour is that most of the fields in the relevant parts of the Wikipedia database, the page and revision tables, are (or have been) liable to change during the course of editing. A page's title changes whenever it has been moved. Even the page_id and rev_id fields, which are primary keys, are not guaranteed to be stable; the page ID field was reset whenever a page is deleted until the release of MediaWiki 1.27, and this was also true of the revision ID field until Wikipedia was upgraded to MediaWiki 1.5 in June 2005. Neither of these fields are useful when dealing with edits that were cleared from the Wikipedia database in 2004!

My solution
The revision table has two fields that in combination will almost always have unique and reliably constant values: the rev_timestamp and rev_sha1 fields. The rev_timestamp field contains the timestamp of each edit (i.e. the time that it was made, to the nearest second) and the rev_SHA-1 gfield contains the SHA-1 value of the text of each edit. Therefore, while there have been incidents (especially in 2002) when timestamps have been corrupted, by and large they will be consistent across time. It is extremely unlikely that two edits will have the same SHA-1 values unless they have the same text.

I used MySQL queries on copies of the January and May 2003 database dumps which have been upgraded to work on MediaWiki 1.25.1, like this:

This query was used for the January 2003 database dump; for the database dump from May 2003, I used a similar query but I replaced the timestamp with 20030111070503, this being the final relevant timestamp in the January 2003 database dump, and changed the filename to "sha2.txt".

The above query does not work on more recent versions of the Wikipedia database, like that at Wikimedia Labs, because most of the edits were made after 2003. Therefore, I used the following query in this case, and edited out the surplus edits manually:

Here is a query similar to one I used on the Wikimedia labs archive table, with the limit being adjusted to roughly correspond with the may 2003 database dump:

In the case of Wikimedia Labs, the results of a query performed from the command line using the grid engine are outputted to a file automatically.

Here is some Python code that I wrote to process the resulting files.