User:Wikid77/Autofixing cites

This essay, wp:Autofixing cites, describes methods to have Wikipedia templates or Bot programs to auto-correct (or "autofix") many of the citation references in pages.

Bot updates
For years, the automated Bot programs have been editing pages to add missing titles, or check for deadlink URLs. In particular, User:DASHBot has been checking for offline URLs and editing pages to insert the related "archiveurl=" and "archivedate=" parameters.

Lua wp:CS1 cite templates
Because templates which are based in Lua script (modules) can run very quickly, they have more potential to perform autofixes, compared with older templates which use only markup. However, there are some newer, fast Lua utility templates which allow even markup-based templates to run some fast string-substitution templates when processing cite parameters.

The Lua-based wp:CS1 cite templates currently handle:
 * Singular page numbers in "pages=" are displayed as "p. n" not "pp.".
 * Page ranges in "pages=" are autofixed to change hyphens '-' to en dash ('–').
 * Many aliases are supported to treat variants as if autofixed spelling.

In March 2014, new autofixes have been developed for future use:
 * Bare URLs "|http_" will be autofixed as "url=" if empty.
 * Many uppercase parameter names will be accepted, such as "Date=" for "date=".
 * Common abbreviations will be autofixed, such as "vol=" or "pg=" as "pages=".
 * Common substitutes will be autofixed, such as "published=" as "publisher".
 * Autofix misspellings, as: "locaiton=" to "location=" or "dtae=" to "date".
 * Unknown parameters, which are not obvious misspelled keywords, will be echoed as literal text with a colon (':'), such as "note=xxx" to show "note: xxx".

Benefits of autofixing
The use of autofixed cites provides many various benefits:
 * New users who are unsure of the exact spelling can get close-enough results.
 * Busy users, rushed to make typos, will have a safety net to autofix typos, whether they have time to proofread or not.
 * Previously mangled cites become instantly usable/linkable on first preview, such as many URLs autofixed to link on first view.
 * Older versions of pages, perhaps with deprecated parameters, will display cleanly due to autofixing of old parameters.
 * New parameters, not yet officially supported, can be used, with literal display ": " so as to provide useful information even before the release of related software to specially format the new parameters.
 * Cascaded errors are deterred so that serious problems are easier to spot, such as invalid "accessdate=" displayed only when a bare URL is autofixed to show dates.
 * Autofixing provides "error triage" where simple typos can be rapidly "nursed to health" by autofixing, while the other mangled or garbled cites will be in smaller category lists, and extremely botched cites can be diagnosed due to the "process of elimination" of autofixed cases.
 * Average visual quality of pages improves, with typos smoothed away when autofixed, leaving a page free of glaring red-error messages.
 * The backlog of 10,000 pages, with various cite-parameter errors, can be reduced to a few hundred pages which contain the serious garbled cite parameters.
 * The red-error mindset, strongly associated with CS1 cites during most of 2013, will fade as very few cites will show the glaring error messages in the future.

Positive synergy
The overall process, of autofixing of the invalid cites, provides a positive synergy by auto-supplying the trivial corrections, to free more time for active editors to fix the harder problems, as well as giving more visibility by minimizing typo errors into a tiny superscript note: "[fix cite" or "[fix url" or similar. In fact, in hundreds of example pages, there were no unfixed errors in dozens of cites, which essentially leaves no major concerns about the cite-parameter data (at all), and thereby gives the writer almost total focus on the content of the page and more-serious issues of wp:NPOV balance. With time permitting, those editors (or wiki-gnomes), with more free time, could re-edit the tens of thousands of pages later, to turn the simple autofixed typos into saved, clean text (hopefully when also fixing larger problems on those same pages).

Example of autofixed cites
The following show some cases where invalid parameters are autofixed.


 * {| style="border: 1px #aaa solid"


 * autofix:
 * autofix:
 * autofix:


 * current:

Note, in the above autofixed example, the missing "url=" parameter is set with the text "http:..." from the 5th parameter, and linked to title "News Report" while the double-hyphen in pages "8--9" is filtered as a single dash, 8–9. Next, the 'Guardian' is shown, followed by "office: London" as extra text. By comparison, the current cite is awash in a sea of alarming red-error messages which can overpower the text yet demand attention to the simple details which have been quietly autofixed in the first case.
 * }

Many respell keywords can be detected by checking prefix/suffix letters of each parameter name, where "author=" can be detected by checking prefix/suffix combinations of "au__or" or "a__hor" to match invalid names: "autor=" or "arthor=" or "auhtor=" etc. For some parameters, there is a common respelled form, such as "published=" often used for "publisher=" along with rare misspellings like "publlisher=" or "pulbisher" or "pubsher" etc. See in example below:
 * {| style="border: 1px #aaa solid"


 * Example: {cite web/auto |last=Doe|titolo=Title|Url=//x.com|dtae=May 2011 |pubsher=BBC |vol=IV|pg=9 |otters=Fred Smith|translator=Mary Dohh |locaiton=London |First=Tom |Editors=Smith, Dee|BBC News}}
 * Example: {cite web/auto |last=Doe|titolo=Title|Url=//x.com|dtae=May 2011 |pubsher=BBC |vol=IV|pg=9 |otters=Fred Smith|translator=Mary Dohh |locaiton=London |First=Tom |Editors=Smith, Dee|BBC News}}


 * autofixed:

Although few pages have contained so many invalid parameters (some have), the above example shows many of the common typos, such as "vol=" for "volume=" (and even rare "dtae=" for "date="), which have occurred in more than 1,000 pages. However, the autofix algorithms will correct hundreds of potential problems in over 10,000 live pages, including hundreds of draft pages in user-space. It detects invalid "pubsher=" as the "publisher=" parameter, while autofixing over 13 red-error messages, to allow live typesetting of the page as if nothing much was a problem for readers.
 * currently:
 * }

For a split URL, with pipe/bars between segments, the URL is salvaged to retrieve as much data as possible, plus shows the quiet warning, "[join url]" which links to a help-page to explain the split URL text. See example below:


 * {| style="border: 1px #aaa solid"


 * Example:
 * Example:


 * autofixed:

Among the types of autofixes, a split URL can be the most difficult to autofix, because the parts of the URL might be rejoined in a new order, with different results than the original URL.
 * currently:
 * }