User:Codeofdusk/ee

__INDEX__ This user subpage contains a slightly modified version of my extended essay for the IB Diploma Programme. For more about me, see my main user page here.

Inspired by conversations with graham87, I wrote my extended essay on Wikipedia page histories. The essay, "Analysis of Wikipedia talk pages created before their corresponding articles", explains why some Wikipedia articles have their first visible edits to their talk pages occurring before those of the articles themselves. The essay received 21 out of 35 marks, or a B grade on an A (maximum) to E (minimum) scale, from the IB.

You can read the essay below. Supplementary files, including the essay in other formats, are available on Github.

Introduction
Wikipedia is a free online encyclopedia that anyone can edit. Founded in 2001 by Larry Sanger and Jimmy (Jimbo) Wales, the site now consists of over forty million articles in more than 250 languages, making it the largest and most popular online general reference work. As of 2015, the site was Ranked by Alexa as the 5th most visited website overall.

Wikipedia allows anyone to edit, without requiring user registration. The site permanently stores histories of edits made to its pages. Each page's history consists of a chronological list of changes (with timestamps in Coordinated Universal Time [UTC]) of each, differences between revisions, the username or IP address of the user making each edit, and an "edit summary" written by each editor explaining their changes to the page. Anyone can view a page's history on its corresponding history page, by clicking the "history" tab at the top of the page.

Sometimes, Wikipedia page histories are incomplete. Instead of using the move function to rename a page (which transfers history to the new title), inexperienced editors occasionally move the text of the page by cut-and-paste. Additionally, users who are not logged in, or users who do not have the autoconfirmed right (which requires an account that is at least four days old and has made ten edits or more) are unable to use the page move function, and sometimes attempt to move pages by cut-and-paste. When pages are moved in this way, history is split, with some at the old title (before the cut-and-paste) and some at the new title (after the cut-and-paste). To fix this split history, a Wikipedia administrator must merge the histories of the two pages by moving revisions from the old title to the new one.

For legal reasons, text on Wikipedia pages that violates copyright and is not subject to fair use must be deleted. In the past, entire pages with edits violating copyright would be deleted to suppress copyrighted text from the page history. However, deleting the entire page had the consequence of deleting the page's entire history, not just the copyrighted text. In many of these cases, this led to page history fragmentation. To mitigate this, Wikipedia administrators now tend to delete only revisions violating copyright using the revision deletion feature, unless there are no revisions in the page's history that do not violate copyright.

Originally, Wikipedia did not store full page histories. The site used a wiki engine called UseModWiki. UseModWiki has a feature called KeptPages, which periodically deletes old page history to save disk space and "forgive and forget" mistakes made by new or inexperienced users. Due to this feature, some old page history was deleted by the UseModWiki software, so it has been lost.

In February 2002, an incident known on Wikipedia as the "Great Oops" caused the timestamps of many old edits to be reset to 25 February 2002, 15:43 or 15:51 UTC. Wikipedia had recently transitioned to the Phase 2 software, the precursor to MediaWiki (their current engine) and the replacement for UseModWiki. The Phase II Software's new database schema had an extra column not present in the UseModWiki database. This extra column was filled in with a default value, which inadvertently caused the timestamp reset.

Each Wikipedia page also has a corresponding talk page. Talk pages allow Wikipedia editors to discuss page improvements, such as controversial edits, splits of large pages into several smaller pages, merges of related smaller pages into a larger page, page moves (renames), and page deletions. Since talk pages are just Wikipedia pages with a special purpose, they have page history like any other Wikipedia page, and all the aforementioned page history inconsistencies.

An indicator of page history inconsistency is the creation time of a Wikipedia page relative to its talk page. Logically, a Wikipedia page should be created before its talk page, not after; Wikipedians can't discuss pages before their creation! The aim of this extended essay is to find out why some Wikipedia articles have edits to their talk pages appearing before the articles themselves.

Data collection
To determine which articles have edits to their talk pages occurring before the articles themselves, I wrote and ran a database query on Wikimedia Tool Labs, an OpenStack-powered cloud providing hosting for Wikimedia-related projects as well as access to replica databases, copies of Wikimedia wiki databases, sans personally-identifying information, for analytics and research purposes. The Wikipedia database contains a page table, with a  column representing the title of the page. Since there are often multiple (related) Wikipedia pages with the same name, Wikipedia uses namespaces to prevent naming conflicts and to separate content intended for readers from content intended for editors. In the page title and URL, namespaces are denoted by a prefix to the page's title; articles have no prefix, and article talk pages have a prefix of talk:. However, in the database, the prefix system is not used; the  column contains the page's title without the prefix, and the   column contains a numerical representation of a page's namespace. Wikipedia articles have a  of 0, and article talk pages have a   of 1. The  field is a primary key uniquely identifying a Wikipedia page in the database.

The revision table of the Wikipedia database contains a record of all revisions to all pages. The  column contains the timestamp, in SQL timestamp form, of a revision in the database. The  column contains the   of a revision. The  column contains a unique identifier for each revision of a page. The  column contains the   of the previous revision, or 0 for new pages.

The database query retrieved a list of all Wikipedia pages in namespace 0 (articles) and namespace 1 (talk pages of articles). For each page, the title, timestamp of the first revision (the first revision to have a  of 0), and namespace were collected. My SQL query is below:

select page_title, rev_timestamp, page_namespace from page, revision where rev_parent_id=0 and rev_page = page_id and (page_namespace=0 or page_namespace=1); Due to the size of the Wikipedia database, I could not run the entire query at once; the connection to the database server timed out or the server threw a "query execution was interrupted" error. To avoid the error, I segmented the query, partitioning on the  field. During the query, I adjusted the size of each collected "chunk" to maximize the number of records collected at once; the sizes ranged from one million to ten million. To partition the query, I added a  clause as follows:

select page_title, rev_timestamp, page_namespace from page, revision where page_id&gt;1000000 and page_id&lt;=2000000 and rev_parent_id=0 and rev_page = page_id and (page_namespace=0 or page_namespace=1); I wrapped each database query in a shell script which I submitted to the Wikimedia Labs Grid, a cluster of servers that perform tasks on Wikimedia projects. An example wrapper script follows:

is an alias on Wikimedia Labs for accessing the database of the English Wikipedia, and  is the SQL query. The Wikimedia Labs Grid writes standard output to  and standard error to , where   is the name of the script. My set of wrapper scripts were named  through , one script containing each line of the SQL query (see appendix 1 for the wrapper scripts submitted to the Wikimedia Tool Labs). Running  concatenated the various &quot;chunks&quot; of output into one file for post-processing.

Post-processing
The database query retrieved a list of all articles and talk pages in the Wikipedia database, along with the timestamps of their first revisions. This list contained tens of millions of items; it was necessary to filter it to generate a list of articles where the talk page appeared to be created before the article. To do this, I wrote a Python program, eeprocess.py (see appendix 2 for source code) that read the list, compared the timestamps of the articles to those of their talk pages and generated a csv file of articles whose talk pages have visible edits before those of the articles themselves. The csv file contained the names of all articles found, along with the timestamp of the first revision to the article itself and the article's talk page. After downloading the concatenated output file from Wikimedia Labs, I ran my post-processor against it.

The first run of the post-processor found a list of 49,256 articles where the talk page was created before the article itself. Further investigation showed that many of these articles had talk pages created with in seconds of the article, which are not useful for my purposes; they are not indicative of missing history.

In hopes of reducing the list, I added a command-line option to the post-processor,, that requires an article's talk page to be a specified number of seconds older than the article for inclusion in the list. In other words, an article's talk page must be at least  seconds older than the article itself to be included in the list. I then ran the post-processor with several values of, saved a copy of the output of each run, and counted the number of articles found in each .csv file. To count the number of rows in an output file, I fed the file to standard input of the  utility by piping the output of the   command to. I used the  switch to count the number of lines in the file. I then subtracted 1 from each result to avoid counting the header row. Table 1 contains the number of articles in the output of the post-processor given various values of.

eeprocess.py reads the SQL query output in a linear fashion. Since the program must read one row at a time from the file, it runs in $\mathcal{O}(n)$ time. In other words, the speed of the program is directly proportional to the number of rows in the input file. While this linear algorithm is extremely inefficient for large SQL queries, it is necessary for accurate results; the program must read each page name and timestamp into a corresponding dictionary for the page's namespace.

To check if an article's talk page is older than the article itself, eeprocess.py used the  module in the Python standard library to convert the SQL timestamp of the first revision of each article into a Python   object, a datatype in the standard library for representing dates and times. These  objects are then converted to unix time using the   method. The difference of these timestamps is taken and checked against ; if the difference is greater than or equal to , it is included in the list. Instead of comparing unix timestamps, I could have treated the timestamps as integers, taken their difference and checked if it was greater than or equal to a predetermined value; this would have been more efficient, but an accurate  option would have been near impossible to implement.

Automatic analysis
After filtering the list to find articles whose talk pages were created at least one day before the articles themselves (with the  option to eeprocess.py), I wrote another Python program (see appendix 3 for source code) to compare the list against a database dump of the Wikipedia deletion and move logs, taken on 20 April 2017. The program writes an analysis of this comparison to a .csv file.

My program, eeanalyze.py, scanned for two possible reasons why the article's talk page would appear to have edits before the article itself. If an article was deleted due to copyright violation, the article will be deleted with "copyright" or "copyvio" (an on-wiki abbreviation of "copyright violation") in the text of the deletion log comment field.

Normally, article deletions must be discussed by the community before they take place. However, in some cases, articles may be speedily deleted (deleted without discussion) by a Wikipedia administrator. Criterion G12 (unambiguous copyright infringement) and historical criterion A8 (blatant copyright infringement) apply to copyright violations. If an article is speedily deleted under one of these criteria, a speedy deletion code for copyright violation ("A8" or "G12") will appear in the comment field of the deletion log. If a matching string is found in an article's deletion log comments, eeanalyze.py flags the article as being deleted for copyright violation.

Another possible cause is an incorrect article move; in some cases, an article is moved by cut-and-paste, but its talk page is moved correctly. When this happens, the article's history is split, but the talk page's history is complete. To fix this, the article's history needs to be merged by a Wikipedia administrator. eeanalyze.py searches the page move logs for instances where a talk page is moved (the destination of a page move is the current article title), but no move log entry is present for the article itself.

eeanalyze.py also generates eemoves.csv, a file containing a list of "move candidates", page moves where the destination appears in the list of articles generated by eeprocess.py. While I ultimately did not use this list during my analysis, it may yield additional insight into the page history inconsistencies.

eeanalyze.py uses the mwxml Python library to efficiently process XML database dumps from MediaWiki wikis, like Wikipedia. For a MediaWiki XML database dump, the library provides, a generator of   objects containing log metadata from the dump. Initially, the library only supported dumps containing article revisions, not logs. I contacted the developer requesting the latter functionality. Basic support for log dumps was added in version 0.3.0 of the library ; I tested this new support through my program and reported library bugs to the developer.

eeanalyze.py reads the database dump in a linear fashion. Since linear search runs in $\mathcal{O}(n)$ time, its speed is directly proportional to the number of items to be searched. While linear search is extremely inefficient for a dataset of this size, it is necessary for accurate results; there is no other accurate way to check the destination (not source) of a page move.

In theory, I could have iterated over just the articles found by eeprocess.py, binary searching the dump for each one and checking it against the conditions. While the number of articles to search ($n$ ) would have been reduced, the streaming XML interface provided by  does not support Python's binary search algorithms. Additionally, if it was possible to implement this change, it would have slowed the algorithm to $\mathcal{O}(n\log{n})$ because I would need to sort the log items by name first.

Classification of results
Once the automatic analysis was generated, I wrote a Python program, eeclassify.py (see appendix 4 for source code). This program compared the output of eeprocess.py and eeanalyze.py and performed final analysis. The program also created a .csv file, eefinal.csv, which contained a list of such articles, the timestamp of their first main and talk edits, the result (if any) of automatic analysis, and (when applicable) log comments.

A bug in an early version of eeprocess.py led to incorrect handling of articles with multiple revisions where. The bug caused several timestamps of the first visible edits to some pages to be miscalculated, leading to false positives. The bug also caused the output to incorrectly include pages that had some edits deleted by an administrator using the revision deletion feature. When I discovered the bug, I patched eeprocess.py and reran eeprocess.py and eeanalyze.py to correct the data. While I am fairly confident that eeprocess.py no longer incorrectly flags pages with revision deletions, eeclassify.py attempts to filter out any pages that have been mistakenly included as an additional precaution.

In some cases, Wikipedia articles violating copyright are overwritten with new material as opposed to being simply deleted. In these cases, the revisions violating copyright are deleted from the page history, and a new page is moved over the violating material. eeclassify.py searches for cases in which a page move was detected by eeanalyze.py, but the comment field of the log indicates that the page was a copyright violation ("copyright", "copyvio", "g12", or "a8" appears in the log comments). In these cases, eeclassify.py updates the automatic analysis of the page to show both the page move and the copyright violation.

eeclassify.py found a list of articles whose talk pages appeared to be created before the articles themselves due to the Great Oops and UseMod KeptPages. It did this by checking if the timestamps of the first visible main and talk edits to a page were before 15:52 UTC on 25 February 2002.

Before the English Wikipedia upgraded to MediaWiki 1.5 in June 2005, all article titles and contents were encoded in ISO 8859-1 (nominally Windows-1252). This meant that many special characters, such as some accented letters, could not be used. After the upgrade, many pages were moved to new titles with the correct diacritics. However, not all pages were correctly moved, leading to history fragmentation in several cases. eeclassify.py scans for this case and flags affected articles.

The program generated statistics showing the reasons why the talk pages of certain articles appear to be created before the articles themselves, which it wrote to standard output. Table 2 shows the statistics generated by eeclassify.py: the number of automatically analyzed articles with their corresponding reasons.

Analysis of results
Out of the 25,941 articles with the first visible edits to their talk pages appearing at least one day before those of the articles themselves, only 1,880 articles could be automatically analyzed. The reason that so few articles could be automatically analyzed is that there is a large number of unusual cases of page history inconsistency.

For example, in the case of "Paul Tseng", the creator of the article began writing it on their user page, a Wikipedia page that each user can create to describe themselves or their Wikipedia-related activities. Users can also create sandboxes in the user namespace, areas where they can experiment or write drafts of their articles. Users also have talk pages, which can be used for communication between users on the wiki. Typically, these sandboxes are subpages of the user page. However, in this case, the creator of the "Paul tseng" article did not create a separate sandbox for the article, instead writing it directly on their main user page. When they completed the article, they moved both their user page which contained the article text, as well as their personal talk page, to "Paul tseng". Clearly, the user had received messages from other users on the wiki before this move, so the talk page of "Paul tseng" contained personal messages addressed to the creator of the "Paul tseng" article. Upon discovering this, I reported the situation to a Wikipedia administrator, who split the talk page history, placing the user talk messages back in their appropriate namespace. On the English Wikipedia, it is good practice to place a signature at the end of messages and comments, by typing four tildas ( ~ ). Signatures can contain the username of the commenter, links to their user or talk pages, and the timestamp of the comment in coordinated universal time (UTC). The talk page was created by SineBot, a bot that adds such signatures in case a user fails to do so. If a user fails to sign three messages in a 24-hour period, SignBot leaves a message on their talk page informing them about signatures, creating the user talk page if it does not already exist. To make sure that no other similar cases have occurred, I checked if SineBot has created any other pages in the talk namespace. It has not, so this seems to be a unique occurrence.

Firefox has a built-in Wikipedia search feature. In old versions, entering "wp" (the Wikipedia search keyword) without a search term would redirect users to https://en.wikipedia.org/wiki/%25s. As a temporary workaround, a redirect was created to send these users to the Wikipedia main page. The associated talk page was used to discuss both the redirect and %s as a format string used in various programming languages. The redirect has since been replaced with a disambiguation page, a navigation aid to help users locate pages with similar names. The talk page has been preserved for historical reasons. Clearly, it contains edits older than those to the disambiguation page.

In the case of the "Arithmetic" article, the talk page was intentionally created before the article itself, so it does not indicate missing history. A user moved some discussion about the article from the "Multiplication" talk page to a new page, which would later serve as the talk page for the "Arithmetic" article. While it is definitely an unusual case, it all seems to add up in the end!