Wikipedia:WikiProject Citation cleanup/Repairing algorithmically generated citations

A number of tools can be used to generate citation templates algorithmically. The functionality is built into the VisualEditor and RefToolbar. Citation bot can be run on a page, perhaps summoned by the Citation expander gadget. The user scripts reFill (formerly reflinks), ProveIt, and ReferenceExpander are able to generate citations from a url.

This can be a great timesave, but the citations are not always correct or complete. Cautious editors will always double check the output of these tools to verify the correctness and completeness of the generated citations, but even very experienced editors may place too much trust in the tools, and create citations that must be repaired or cleaned up afterwards.

How citations are generated
Most of the tools listed above rely on the Mediawiki library Citoid. The remainder, and Citoid itself, rely upon code snippets from the Zotero community called "translators", that scrape metadata from individual domains, which Citoid then converts into cite templates. When the Zotero translators work accurately – like for large academic publishers or the New York Times – we get great citations. When an appropriate translator cannot be found (whether it doesn't exist or isn't in the suite of translators Citoid uses), Citoid will use Zotero's generic translator or, rarely, its own fallback functionality.

These general purpose translators will sometimes introduce errors, omit parameters, or process dead links, creating citations that require repair or cleanup. Even the domain specific translators can fail if the linked page is populated on the client side with javascript instead of on the server side.

The tools relying on Citoid then insert the reference somewhere in the article being edited, without examining Citoid's output. Certain values in certain parameters (like numeric data in ) will cause an article to be placed in one or more applicable maintenance categories, which is done at the time of edit by Module:Citation/CS1.

Failing to locate applicable parameters
Authorship is commonly not determined by the functions that generate citations. Publication dates are second most frequently missed out if the page header doesn't have it in a meta tag but it is somewhere within the HTML body element. Sometimes titles are missed, which will generate an error from Cite web at the time of edit, and a visit from User:Qwerfjkl (bot) to the editor's user talk page.

For book citations: volume, edition, and page number are never generated, even if the url processed by the citation generator links directly to a single page in a book. If a citation is generated to a book chapter that has been digitised or transcribed into html anywhere other than Google books or Internet Archive, the citation will be created using Cite web instead of Cite book. Usually the title of the book, the title of the chapter, and the name of the website will be mixed up or partially missing. Chapter or entry authorship will always be lost if the book is a compiled volume with many contributors.

Incorrect parameter values
Sometimes, a citation generator will find what it thinks to be one piece of information, which is actually something different. Common failure modes include putting the website or publisher into an author field, not removing the website or author from the end of a title, putting contact information into an author field, and replacing the title with the name of a website. Almost any kind of confusion is possible.

There are known special cases for book citations, depending on the domain. For Google books, if a volume has editors, they will be misattributed as authors. Chapter or entry contributors are not usually possible to determine algorithmically, so these will always have to be added manually (if the information is even available), and the parameter names of the editors manually changed from last1=, first1=, etc., to editor1-last=, editor1-first= and so on. The misattribution of editorial contribution is a Zotero translator error.

For books hosted at Internet Archive, the version information may not match the actual digitised work, place of publication will be forced into the publisher= parameter, and the others= field will be filled with the name of a library, or "Internet Archive", or "unknown library". The title of the book will be changed from title case to sentence case. Direct links to pages will be converted into links to the general volume information. Periodicals hosted at Internet Archive will be misidentified as books, and their authorship will be attributed to their publisher. Sometimes a direct page link will require "borrowing" the book from Internet Archive like a library. If restoring a direct page link to a work that requires "borrowing", the parameter should be added to the citation template. Most of the Internet Archive related issues are on their end, as a result of how their information is stored and presented.

Processing dead links
Sometimes, a link will no longer point to its original content, but will still serve a page. The site may have restructured, the content may be paywalled, or the domain may have been usurped. The citation generating tools will treat any webpage as a working link, even when obviously and explicitly labeled as a 404 error (page not found). In these cases the automatic reference should be removed entirely, and the bare URL dead link retained and tagged with Template:Dead link, or an archived version located, per the guidance at Link rot.

Incorrect template calls
In almost all circumstances, citation generating tools will format their citations using Cite journal, Cite book, or Cite web. Citations to journals are very likely to be correct and have very high overall accuracy. Rarely Template:Cite magazine, Template:Cite patent, or Template:Cite news will be used, almost invariably correctly. Cite book is very frequently correct, except for periodicals hosted at Internet Archive, journals compiled into volumes, conference proceedings, and occasionally others.

Template:Cite web is frequently problematic. It should often be one of the following:
 * Template:Cite book
 * Template:Cite journal
 * Template:Cite AV media
 * Template:Cite news
 * Template:Cite conference
 * Template:Cite press release
 * Template:Cite encyclopedia
 * Template:Cite magazine
 * Template:Cite court

The source will always have to be inspected and the appropriate template substituted for Cite web, which may require renaming parameters.

Suboptimal website values
Wikipedia citation templates are used to create COinS metadata, and the website= parameter is supposed to hold the human readable name of the website, except in cases where it is known primarily by its domain name. Citation generation tools will fill this parameter with the domain name very frequently. This information is not wrong, but it's also not helpful, since it can be trivially determined by inspecting the URL, and it goes against the intended function of the parameter for metadata reuse.

Removal of internal anchors from URLs
An internal anchor is what can appear after the # character at the end of a URL. For very long webpages, these can be essential for easy navigation, and act similarly to chapters in a book. They are removed from URLs in the generated citations, so they'll need to be readded manually afterwards, and the title= parameter updated appropriately, since Cite web does not support the chapter= parameter, and the work= parameter (the name of the larger work in which the title appears) is aliased to website=, so the two cannot coexist. In some cases the template would be more appropriate converted to Cite book.

Current repair projects
The ReferenceExpander script originally (and currently, ) processed the first URL it located between a pair of ref tags, and replaced the entire content of whatever was between the ref tags with the citation generated by processing the URL. In addition to the issues listed above, this sometimes had the effect of deleting a second reference bundled with the first, deleting quotes or explanatory notes, removing correctly cited authorship and editorial attribution, removing page numbers, and other associated damage to manually formatted references. It also unnecessarily escaped the % character in URLs, which can cause them to break if they contain non-ASCII characters.

Around 2,500 edits using this script between January and April 2023 have been identified, and a cleanup and repair effort is underway at User:XOR'easter/sandbox/ReferenceExpander. A further 800 or so mainspace edits using the script in the second half of 2022 are identified at query 72745, and have been processed into worksheet format at WikiProject Citation cleanup/ReferenceExpander batch 2.

Current progress:
 * Task 1: checked:; remaining:.
 * Task 2: checked:; remaining:.

Potential damage tracking and future repair efforts
The final phase of cleaning up edits made using the ReferenceExpander script is foretold by queries 75209 and 75208, totaling approximately 2100 diffs. Edits of this age are more likely to have been corrected in the course of normal editing, and early users of the script were more conscious of its output.

ReFill / Reflinks seems to be a more popular tool with a lower associated degree of damage due to not overwriting existing references apart from bare URLs. It does identify itself in edit summaries, and could be approached similarly to ReferenceExpander edits, but that may not be worth the effort.

Module:Citation/CS1 maintains tracking categories of possibly erroneous citation information, the most promising of which include:
 * (low false positive rate)
 * (very low false positive rate)
 * (high false positive rate due to corporate authors)
 * (high false positive rate due to corporate authors)

Citoid bugs tracked at phabricator

The overall scope of the problem is not yet clear.

Suggestions for improvement
At its root, to obtain more accurate algorithmically generated citations, the Zotero translators must be improved, which cannot be handled on Wikipedia, and requires a high level of technical skill (proficiency in JavaScript, regular expressions, CSS, and HTML).

Citoid can be improved by writing translators for its fork of the Zotero library, which can be done in-house but requires the exact same technical skills as coding for Zotero. Right now Citoid does not surface which translator it uses, which could aid in tracking problem cases. Requested at mw:Talk:Citoid. Citation templates would have to be updated to hold a parameter for the translator used, and tracking categories created.

Citoid could generate warning messages for inaccurate translators, or general reminders to double check its work. Requested at mw:Talk:Citoid.

The citation templates could be updated to hold a hidden parameter that is set to a value whenever they are generated algorithmically instead of manually. This is like tracking the translator used, but less useful and very slightly less work. It could be removed or given an opposite value after citations are double checked by humans. Alternatively, citation generators could add a hidden comment to the wikicode like <!–– tool-generated citation ––>.

Category:CS1 maint: numeric names: authors list‎ could be elevated to error status, alerting editors that their author fields contain bad information, and a bot could place a notice on the editor's user talk page after publishing such an edit.

Whenever repairing algorithmically generated citations, practice leaving an edit summary that links this page, or a specific cleanup project, or a note along the lines of "please always double check automated references."