User:Teratornis/Mechanical turk

06:31, 15 June 2008 (UTC): this page contains my notes about the possibility of building an efficient human-bot hybrid for identifying and allocating repetitive tasks on Wikipedia that are too difficult for current bot programs to do alone. I invite comments on the talk page.

Inspiration
I happened to read about the Amazon Mechanical Turk, after a user on the Richard Dawkins Forum called my attention to it.

While I was answering this question on the Help desk: Help_desk/Archives/2008 June 14 (permanent link), I got an idea about how to fix the enormous number of bare-URL footnotes on Wikipedia.

One of the featured article criteria is:
 * (c) consistent citations—where required by Criterion 1c, consistently formatted inline citations using either footnotes or Harvard referencing (Smith 2007, p. 1) (see citing sources for suggestions on formatting references; for articles with footnotes or endnotes, the meta:cite format is recommended).

WP:FAIL says the number of featured articles on Wikipedia is increasing far too slowly, compared to the number of articles ( currently). This is not surprising, considering the outlandish difficulty of working with footnotes and citation templates on Wikipedia. The fraction of users who understand how to edit citations up to featured article standards is very small, and not growing very fast. The overwhelming majority of Wikipedia's registered users (and a probably comparable number of unregistereds) are unlikely to invest many hours to learn how to edit citations with the current system. The intellectual overhead in the current system is very high, but the actual work that needs doing is usually not more than tedious once the user finally identifies where to do it.

In other words, fixing citations doesn't require much creativity, but the amount of knowledge an editor needs to be able to do it all out of proportion to the creativity requirement.

It would be wonderful to write a bot program which could analyze the citations in a Wikipedia article, and bring them all up to a consistent standard. However, this would probably require a bot program capable of passing the Turing test. The Amazon Mechanical Turk system, however, does pass the Turing test, quite easily in fact.

Perhaps it may be possible to construct a similar system for use with Wikipedia, a human-bot hybrid, or a hum-bot (pronounced: hyoom-bot). (To-do: figure out how to say that in IPA.)

The bot program component of the hum-bot could identify and collect citations needing repair, and present them to human editors with as much supporting material as the bot can supply (citation template text partially filled out as much as it automatically possible, and instructions with links to examples explaining to the human exactly what to do).

Candidate articles include recent newsworthy events that have lots of sources, and lots of edits from subject enthusiasts who don't have lots of Wikipedia experience. Examples: Peak oil, Oil price increases since 2003, and Treaty of Lisbon. For the bot to get a foothold, it needs the article to contain  tags. Articles with completely unwiki references will be too hard for a bot to analyze, most likely.

Outline of a citation repair hum-bot
A hum-bot that repairs inconsistent citations in an article could work like this:


 * Initial bot pass:
 * Bot program scans articles, picking out the  tags.
 * Bot classifies each ref tag according to what it is, for example:
 * a citation template (the bot can check for duplicates or possible duplicates, too)
 * a bare URL
 * a bare URL followed by some text
 * a external link containing a URL and some link text
 * a Harvard reference (since I don't know anything about Harvard referencing yet, I'll ignore articles containing it for now)
 * Bot program stores the ref tags that don't contain templates.
 * Bot program marks each ref tag in the original article with an HTML comment, so it can insert the human-corrected citations later.
 * Bot program presents ref tags for humans to analyze and repair, on some sort of a Web page. It could be a wiki page, or it could be on a dedicated tool site such as the one the Universal reference formatter runs on. I'm not sure how best to do this, so the following will change:
 * Bot program displays two windows, radio buttons, explanatory text, and links:
 * The original ref tag, with a few sentences of context from the article. This is not editable.
 * An edit window which the bot pre-fills with a Cite web template, similar to what citation tools such as WPCITE and Universal reference formatter provide. The bot fills in the template fields it thinks it can obtain from the page behind the URL, if any, and provides additional field names with empty values for the user to optionally fill in.
 * Radio buttons the user can use to select a different citation template (Cite news, Cite journal, Cite book, etc.). When the user selects a different template, the edit window updates the template name, and migrates as many fields as possible from the previous template to the new template.
 * Explanatory text tells the user how to fill out the template. For example, the user will probably need to get some or all of: the article title, author, date, and publisher.
 * Links to more explanatory text, links to examples showing how the user should edit the citation template, a link to the URL in the original ref tag, and links to search sites which can search for the article behind the URL if the link is broken.
 * Human users enter the system, and select tasks to work on. The task granularity could be one reference, or all the references in one article. One reference per task might be simplest.
 * The user tries to repair the citation, either marking it as repaired, or abandoning it as not yet repaired, for another user to repair.
 * Optionally we may require a second user to check each repaired citation before the system confirms it as repaired.
 * Bot program inserts repaired citations back into the article.

Justification, prior art
Before actually attempting to build such a system, or persuade someone else to try building it, we might first try to measure the scale of the problem. E.g., use a bot to scan a large number of articles to determine how many have inconsistent reference styles. See what others have done or are doing:


 * WikiProject Citation cleanup
 * Category:Citation and verifiability maintenance templates
 * Citation style
 * WP:EIW

Query
22:36, 15 June 2008 (UTC): I requested comments from the members of WikiProject Citation cleanup:
 * Wikipedia talk:WikiProject Citation cleanup

I also asked User:Smith609 (author of the Universal reference formatter which Google scholar cite wraps around) to review this page and comment:
 * User talk:Smith609