User talk:Wikid77/Autofixing cites

Created
The essay "wp:Autofixing cites" was created by long-term user Wikid77 (me), on 12 March 2014, based on issues raised when helping to develop the Lua-based wp:CS1 cites templates with Lua script modules during October 2012 to April 2013. The purpose of the essay is to describe how autofixing works, and discuss, on this page, the various tactics to improve the results of autofixing cites. -Wikid77 (talk) 21:09, 12 March 2014 (UTC)

Autofixing URLs
The autofixing of web address links (Uniform Resource Locators - URLs) has been a crucial problem due to the divergent split of title, separated from the external weblink when a "url=" parameter is invalid. The current tactic is to autofix (auto-link) a URL from the numbered parameters {1}, {2} or {3} if the "url=" parameter is missing, or rejoin a URL which contained an equals sign ("=") and caused the "http" portion (or "//" protocol-relative) to appear as a parameter name, with the remainder of the URL as the parameter value. Hence, the tactics to autofix URL include:
 * If "Url=" is specified, but "url=" is blank, then "url" is autofixed.
 * If {1} has "http" or "//" and "url=" is still blank, it is autofixed.
 * If {2} has "http" or "//" and "url=" is still blank, it is autofixed.
 * If {3} has "http" or "//" and "url=" is still blank, it is autofixed.
 * If a parameter begins "http:" and "url=" is still blank, then autofix.
 * If a parameter begins "https:" and "url=" is still blank, then autofix.
 * If a parameter begins "//" and "url=" is still blank, then autofix.

Similar logic could be used to autofix "chapterurl=" or "archiveurl=" but those would depend on checking the related usage of "chapter=" or "archivedate=" (etc.) to avoid connecting a URL to the wrong title parameter. -Wikid77 21:09, 12 March 2014 (UTC)
 * The logic above looks reasonable. I think that this task would best be performed by a bot that continuously scanned and, which are the only places where I have seen this error occur. (Articles with this error sometimes appear in other CS1 error categories as well (e.g. ), but fixing the URL often removes them from those categories as well, with a single edit.)


 * Using a bot instead of auto-fixing the URL in place would alert editors to their error (if they watch the page's history) in a non-invasive way. We'll need to fix the error eventually, and a bot is the best way to do that, since these errors crop up at a rate of 10 or 20 per day. – Jonesey95 (talk) 23:50, 12 March 2014 (UTC)


 * I think autofixing these within a template is a bad idea. Primarily as the errors aren't then fixed: they're still in the markup and will still need fixing at some point, by e.g. a bot or user. This gets much harder if a template is hiding them in an attempt to fix them, as presumably in doing so it makes erroneous markup look like valid markup. Many if not most errors are fixed by users, especially since the template improvements that flag errors visibly.
 * The second reason, though it's potentially an even worse long term problem, is fixing errors in a template will allow editors to get away with errors. It won't matter how this is documented, editors will do whatever works, copying their own and other editors' examples rather than referring to the documentation which is difficult to read while editing a page. Over time the quality of markup will degrade significantly. Long term also as editors find more ways to introduce errors the temptation will be to 'fix' them by improving the autofix template, leading to ever worse markup being accepted and so becoming the norm.
 * So it should be done with a bot. This means errors are fixed, and a record is made of the fix. Any logic possible in a template should be not only possible but easier to achieve with a bot, as code can be as complex as needed when runtime/page rendering performance is not a concern. Problems such as the error not being an error but a deliberate attempt to do something can be spotted and caught usually by a page watcher, which is much harder if a template is silently fixing problems.-- JohnBlackburne wordsdeeds 11:11, 13 March 2014 (UTC)

Autofixing shows "[fix cite]" warning: Although the red-error messages will be bypassed in many cases, each cite will still show a superscript note as "[fix cite]" or similar, to alert users, plus link to a tracking category. Also, not all red-error messages will be suppressed, and invalid-date messages will still be viewable for users who keep the CSS message class visible. Because Bots detect invalid parameters from the COinS metadata, then errors could be left there to signal a Bot which updates the page. -Wikid77 12:00, 17 March 2014 (UTC)

Count of unsupported parameters
I would be very interested in seeing a count of the actual unsupported parameters that exist in. For example, how many times is fist (first) used, and how many times is translator used?

Seeing a list showing how many times each unsupported parameter is used would help us understand how useful this proposed auto-fixing would be, whether the auto-fixing is done by a bot or by the Lua module. It would also verify or invalidate the assertion in the essay that "The backlog of 10,000 pages, with various cite-parameter errors, can be reduced to a few hundred pages which contain the serious garbled cite parameters."

Does someone here have the ability to generate such a list from the 8,000 pages in the category? – Jonesey95 (talk) 23:44, 12 March 2014 (UTC)


 * Search counts of unknown-parameter names: I have been using Google Search with the "site:" option to count pages with phrase "unknown parameter" but also WP's new CirrusSearch option could be used by appending "&srbackend=CirrusSearch" to a wp:wikisearch URL line. To use Google for counting "xxx", I put the following options:    &bull; Google Search with:    "unknown parameter" "xxx ignored" site:en.wikipedia.org For example:
 * Count unknown "fist": [//www.google.com/search?q=%22unknown%20parameter%22%20%22fist%20ignored%22%20site:en.wikipedia.org ...search?q="unknown parameter" "fist ignored"...]
 * Count unknown "translator": [//www.google.com/search?q=%22unknown%20parameter%22%20%22translator%20ignored%22%20site:en.wikipedia.org ...search?q="unknown parameter" "translator ignored"...]
 * Count unknown "published": [//www.google.com/search?q=%22unknown%20parameter%22%20%22published%20ignored%22%20site:en.wikipedia.org ...search?q="unknown parameter" "published ignored"...]
 * Recent counts have been: fist=8 (frist=9),translator=156, published=221. Using those counts, I noticed how over 410 pages were logged for either unknown "published=" or "Publisher=" which should be lowercase "publisher=". Google also counts the user-space drafts, so those could be skipped by Google option "-User" to exclude pages containing word "user". After running multiple searches, I got a Capcha warning from Google suspecting I was an automated searcher program. Using WP's CirrusSearch feature:
 * Wikisearch unknown "translator": ..."unknown parameter" translator ignored&title=Special:Search&srbackend=CirrusSearch
 * Because CirrusSearch has trouble matching text with "=" then it cannot match the phrase "xxx= ignored" as joined, and so "ignored" must be a separate word in the wikisearch (which matches anywhere in more pages). Also because CirrusSearch inspects the live pages, then the count is live and when all are fixed, then no pages will match the wikisearch. More later. -Wikid77 (talk) 10:55, 13 March 2014 (UTC)


 * That's a good start. I was hoping for some database dump cleverness that showed a long list of all of the unsupported parameters and how many times they are used; I don't know how to create and analyze a database dump yet.


 * Someone can use this list of usages of translator to go through and change them to others pretty easily.


 * One note about published: some editors have used it not as a substitute for publisher, but for date, as in  from Leonard F. Jarrett. An automatic change of published to publisher is not valid, but since there are only about 221 pages, they should be easy to go through and fix manually. – Jonesey95 (talk) 16:22, 13 March 2014 (UTC)


 * Well, while our back was turned, a single editor using AWB has fixed over 2,000 of the 8,000 articles in in about four days. There are 5,852 articles in the category at this writing.


 * A few more helpful, well-armed gnomes like that, along with the bots we have (BattyBot 25, Monkbot 1, ReferenceBot) and a couple of bots in development (Monkbot 2, a newly-rewritten Citation Bot), could put a nice dent in the 300,000-article backlog. – Jonesey95 (talk) 03:57, 17 March 2014 (UTC)
 * Also focus on fixing major articles first: While it is helpful to have 2,000 more articles fixed for red-error messages, we should keep checking random groups of tagged pages and try to hand-fix the major, high-view pages in each group. I have also noticed how some AWB edits have left other cite problems in pages because not all cite messages are fixed in an AWB edit focused on one kind of cite error. -Wikid77 (talk) 14:04, 17 March 2014 (UTC)
 * If you supply such an article list, I will work on fixing those articles. I do not have a way to generate such a list.


 * As for fixes leaving behind problems that were already there, that happens, as with all edits. Fixing errors so that citations display correctly is the goal. I have found that in pursuit of that goal, it is easiest and fastest to focus on one kind of error at a time in multiple articles, then to move on to another kind of error. Other people may have different methods. – Jonesey95 (talk) 16:42, 17 March 2014 (UTC)


 * There is a suggestion to have some kind of translator parameter at Template talk:Citation. —PC-XT+ 05:17, 25 March 2014 (UTC)

Autofixing URLs with bar/pipe
To fix URLs which contain a bar/pipe character "|" then the split text is re-appended with "|" +keyword + "=" and the remainder of the URL. If a URL contains 2 bar/pipes, then Lua tends to keep them in sequential order. However, by focusing on known cases of bar/pipes in URLS, then the autofixing can target those cases to double-check for correct rejoining of URL lines. -Wikid77 13:13, 17 March 2014 (UTC)