User:SMcCandlish/TidyRefs

is a user JavaScript for your common.js page. It adds two options to the "Tools" menu (on the left in most skins), "  " and "  (vertically)". These only appear in the menu when in editing mode.


 * "" normalizes all horizontal ref citation code throughout the article to have consistent spacing within, quoted attribute values, and lowercased tag and attribute names. This includes p and s instances, and also includes fixing visually disruptive vertical ones to be horizontal.
 * "" - When developed, this will format p tags vertically in some sane manner, with consistent spacing plus the quote-marks fixes, and should only be used in a page-bottom citations section that is using vertical citations in list-defined references (LDR) style.
 * In an article using LDR, the article body will contain horizontal citations, and the LDR references at the bottom may be vertical (though this is not required). In such a case of mixed citation formatting, the way to use these scripts is to copy–paste the vertical LDR references into a user sandbox, run "  " on the entire article (don't save it yet), run "  (vertically)", when available, on the vertical references in the sandbox, and copy the vertical references back out from the sandbox and paste them over the undesirably horizontalized ones at the bottom of the article.

Usually, neither function should be used without also making a more substantive change (at least fix a typo or something) in the same edit, per the human-editor rules at WP:COSMETICBOT.

This script does do anything with CS1/CS2 citation templates (,, , etc.), that are the content between the p tags (i.e., it does not clean up   to  ). The script for doing that, to run along with TidyRefs, is User:SMcCandlish/TidyCitations.

Installation instructions
Put the line:

in either your common.js or the skin.js of your current skin, save the page, and bypass your browser cache.

The function was deprecated in the July 2017 release of MediaWiki 1.29, and mw.loader is prefered. But  is not obsolete and still works, in case you prefer the old method of manually installing with  or using ScriptInstaller.

Usage
TidyRefs will add two menu items to the  when in edit mode:

Clicking them will harmonize citations in a mess as bad as this (with several instances of invalid markup):



to one of the following:

 < Tidy > :

 < Tidy > (vertically): [Nothing will be done yet! Format forthcoming, after more study of what vertical citations are doing.]

Features

 * Put double quotes around attribute values (leaves them alone if already quoted):  →
 * Change single quotes around attribute valules to the required double:  →
 * Fix invalid nested double quotes:  or   →
 * Enforce spacing between  pairs:   →
 * Remove extraneous spacing around  pairs:   →
 * Remove extraneous spacing between,  , and  :   →
 * Remove extraneous spacing around the citation content between the ref tags:  →
 * In the horizontal version of the script, this cleanup also applies to line-breaks, not just whitespace on the same line. I.e., makes short work of vertically formatted citation code in mid-article – however, the, etc., templates inside the p have to be cleaned up with a separate script, User:SMcCandlish/TidyCitations (which can be done in the same edit).
 * Remove extraneous spacing attribute values:   →   (but not desired ones between words/names:   is untouched).
 * Enforce a space in front of  (the spaced version is understood by more parsers):   →
 * Bonus cleanup: does the same thing with, , broken " ", etc → ; and same with the uncommon , etc. (again, handled properly by more parsers; see also here and here).
 * Remove a space in front of  by itself:   →
 * Fix broken tags by removing extraneous spacing at the start of o:  →   and   →
 * Fix broken tags by removing extraneous spacing inside c:  →
 * Can handle any sane attribute value, even one containing &gt;.
 * Detects and fixes multi-word attribute values that aren't quoted:  →
 * Reduces all-caps and camelcase tag and attribute names to lowercase:  →
 * Applies all these fixes at once, in a single pass.
 * Detects all of the attributes of , in any order, even the new ones not supported on en.Wikipedia yet. The two that work here already are  and  . The two that do not yet are   and.

It has been tested against articles as long and complex as Tartan and Donald Trump without producing any unexpected or undesired results, and works quite quickly despite the complexity of the regular expressions and the number of JavaScript operations, across input that (for Wikipedia) is very large.

Forthcoming features

 * More detection and repair of invalid ref markup, especially of sorts that MW doesn't throw an error message about, especially empty attributes like, or a bare   or   followed by nothing.
 * Detect and convert curly-quoted values as in  (it turns out that various editors do this, either on mobile or from editing in a word processor and pasting into our edit window).
 * Vertical version for refs formatted that way at page-bottom (WP:LDR).
 * Remove spacing between ref tags. Will probably do this by injecting a temporary token after each tag, then parsing for that and the start of a new one, instead of doing more complex regex to read the entire preceding tag again.
 * Remove spacing between ref and other citation templates like and . [It may not actually be feasible to do with such templates that precede the , only those that follow.]
 * Remove spacing between ref and the non-citation content that precedes it (with some exceptions like tables).
 * Remove linebreak after c if more content in the same paragraph is present.
 * Maybe another "bonus" cleaner-upper to fix invalid  to required
 * Maybe un-"hiding" bare URLs (see here):  →   (CitationBot will try to do something useful with the latter but will not touch the former, and they verge on useless to readers since they just show up as something like "[39]" instead of the URL.)
 * Need to detect cases of empty, by people who don't understand the syntax, and correct it to


 * User:SMcCandlish/sandbox/ref name js testpage – Feel free to add more, but don't save the page after running the script; the whole point is to edit the page, tweak the script and run it, re-edit the page, tweak the script and run it again, etc., using the same test data, until issues go away.

Known limitations
This script is very close to magic, but is not magic. A few unusual circumstances may confuse it.
 * Not really bugs but perhaps "failures to be as maximally forgiving of garbage input as possible", when fed markup that is technically invalid but which MW doesn't presently treat as an error (which means someone somewhere might actually do it):
 * If fed the incomplete markup  with empty but present attributes, it will misparse this and output
 * Similarly, the incomplete markup  will be misunderstood and result in
 * Incomplete markup of the form  will not be misparsed but will be skipped entirely.
 * Several other bits of weirdess like that. The only practical solutions for them are pre-pass filters that detect and fix them before moving on to the main operations of the script; instituting this will be really tedious to do.
 * The script may fail and produce incorrect results if run on blatantly invalid input of kinds that it is not already written to handle – kinds that MW itself cannot handle, and which will show up as visibly broken citations in the rendered page (either cites that don't render at all, or code garbage showing up in the content). Known examples:
 * Extraneous junk inside the ref such as stray characters, or unrecognized attributes . Presently, en.Wikipedia only recognizes  and  ; it is not clear when we are getting   (deployed at WikiSource and a few other projects) or   (in beta since 2019, somehow).
 * An attribute value that starts quoted but the quote never ends, e.g.
 * An attribute value with mismatched quotation marks in the form:  (MW simply sees this as another case of the quote never ending).
 * The invalid null markup  or   with no attribute and value.
 * The exact string  being used as content inside the value of an attribute:   (an attribute value with   with a space between them is fine, though).
 * Boneheaded (outside the context of template code) attempts to use MW xtags the start or end tag of the ref element, as in   or  . That's just too broken to contemplate. Same goes for attempts to put an HTML comment inside a tag, like
 * An attribute value with mismatched quotation marks in the different form:  (MW will actually render a citation that simple, but if it has a second attribute like , then that attribute will fail).
 * The script detects the specific ref attributes,  ,  , and   (with or without a space before  ). If you use one of those strings as content inside the value of another attribute, then the material will be misparsed. The odds of anything like   existing in the wild are extremely low, but this would definitely break the script. MW actually does parse that markup, so it's infinitesimally possible for this to happen.
 * If a citation has the same attribute twice, this will not be detected by the script as a valid citation:  (MW presently just accepts the second and ignores the first, and does not throw an error, for some reason). This  be a common enough error to have the script check for it and do something about it. We'll see. It would be a lot of work.
 * The script does not detect the presence of, syntaxhighlight, nowiki, or wikimarkup blocks equivalent to created by code being put on lines indented by one or more space characters, or any other means of laying out code blocks to present wikimarkup examples. If such an example contains   or   code that is subject to cleanup by this script, it will be cleaned up. If this is not desired, try replacing   with   in the example code.
 * This script is for parsing textual content with citations in it. If you run it against template code, JS code, CSS code, Lua module code, interface pages, and other weird stuff, you are entirely on your own. In theory, it will actually work if it encounters a string of citation code in such a page, but supporting such use is outside the scope of this script.

Additional technical notes


 * Invalid input of  should not be parsed by MW as valid, but presently  and produces a usable citation, so this script detects it and repairs it by putting quotes around the attribute value.
 * Invalid input of  or, even worse,   should not be parsed by MW as valid, but in a case this simple it  and produces a usable citation. However, if the inner-quoted material contains a space, then the citation will break, so MW's (probably accidental) handling of this circumstance is faulty. The script detects such a mess and repairs it to   in both cases.
 * The and  "bonus cleaner" does not detect rare attribute-bearing instances, like ,   and so on.

wikEd compatibility
wikEd (an advanced editor you can install via "Gadgets" in the Wikipedia "Preferences" menu) is generally incompatible with scripts, add-ons, or extensions that rely on or change the standard text edit box, and TidyRefs is one of those scripts. The workaround is to temporarily turn off wikEd by pressing the button, making the changes with TidyRefs, then re-enabling wikEd.

There may be a way to fix this, but I would have to install it and figure out what it's doing in detail.

Credits
Kudos for inspiring this work goes to Nick (user:9473764) at Stack Overflow, who first produced a "basic" (actually very complicated) regex that could handle the gist of a ref citation with a  attribute under most circumstances. The material has gotten much more complicated since then to account for numerous legitimate and erroneous use cases, including the four attributes that the tag supports.

The "framing" code around the meat of this script is based on User:SMcCandlish/TidyCitations.js; see credits inside it and on its documentation page.

Regex101 was immensely helpful in the development of this. If you want to examine what the main regex is doing (the ones for other attributes than  are just variants of it), see https://regex101.com/r/xubdCt/20 which has an "Explanation" panel that walks through it step-by-step.

ChatGPT helped work through a few things, though by the time the regex got much more complicated than Nick's original, the "AI" was no longer able to accurately help much (it was not really able to correctly predict most of the changes it suggested, and kept causing massive regressions). The LLM did help a little with mostly-serviceable JS code snippets for a few tasks, including treating regex capture groups as JS vars, and that saved some time and annoyance.

Change log

 * 1 January 2024‎ – Development began (after several days of regular-expression testing at https://regex101.com.
 * 21 January 2024 – First version safe to use in mainspace for horizontal-citation cleanup; documentation started.
 * 22 January 2024 – Minor bug fix: had to account for strings "group", "name", "follow", or "extends" within an attribute's value (not being an attribute itself followed by ).
 * 23 January 2024 – Now fixes broken ref and c tags (e.g., etc.), and removes extraneous whitespace in the citation data immediately after ref or   and before c
 * 24 January 2024 - Now detects and fixes invalid  markup. Detects and is not fooled by, but does not repair, invalid   markup without a value. Lowercases any ALL-CAPS or CamelCase ref tag and attribute names (MW treats them as valid, rare though it may be). Bonus cleanup added, of, , and various invalid variants like  ,  , etc. to ; plus  versions.

Infrequently asked questions

 * Isn't there some rule against this?
 * No. See RfC here: there is definitely not a consensus against using citation-formatting tools, and a large discussion to affirm the acceptability of using citation tools is not needed. It would be possible to do something disruptive with one, like mass-changing a bunch of articles in a bot-like fashion and not checking the output and letting a bunch of errors get through. But who is doing that? Close continues: questions of editor behavior should be addressed as needed at noticeboards. See also other RfC here: changes to visual output for the reader generally require consensus, as do systematic changes across an entire article changing from one consistent citation style to another consistent citation style, but changes of coding that occur while updating the content of a citation and/or adding citations do not require consensus. Even reader-facing changes are permissible when making the visual output of citations consistent within an article where there has been no history of consistency. Also from the closer: An editor hopping from article to article converting everything to a template would be a 'no' without consensus. Next see third RfC here: There is a clear consensus that the usage of vertical and horizontal templates does not fall within the purview of WP:CITEVAR. ... the inclusion of wikitext formatting within a style guideline is a form of WP:CREEP as the coded structure of the citation does not visually alter the article and provides no difference to the reader ... The existence of established policies such as WP:BRD, WP:EW, WP:OWN, and WP:BUREAU eliminates the need to codify something as specific as this. ... the code structure does not require consensus to change ..., thought editwarring over such trivia is prohibited about this as it is about everything. In summary, forcing everything to be CS1 templates in articles using another citation style consistently is not okay (change of major citation style), but cleaning up the wikicode without changing to a different major citation style is fine. In short, the efforts of certain editors to get every aspect of internal formatting of citations deemed to be part of a "citation style" that was "protected" by WP:CITEVAR has been repeatedly rejected by consensus (despite strenuous efforts in that direction by various parties).
 * Why didn't you do this with a MediaWiki parser?
 * None of them that are fully functional are in JavaScript, so they can't be installed and used as WP user scripts. Using some other parser for this in some way would require using an external tool hosted at Toolforge, and pretty much no one is going to do that. Maybe there's some way to hook into such a tool through an internal JavaScript here, but I'm unaware of how to do so. Also, I just liked the challenge of writing a (multi-step) regex quasi-parser that can mostly handle a simple four-attribute element, in the face of various people on StackExchange saying it's not possible. There is a JavaScript parser called wtf_wikipedia, but it can't convert output back into markup (it's aimed at extracting text from it for reuse elsewhere); another called wikiapi looks promising but requires node.js and seems really to be a way to remotely edit a WP page through some other application, not a means of using JS while on WP to futz with the content.
 * Why does this put quotes around attribute values?
 * Because it is a, of robust, guaranteed-working, and future-proof markup. It stops citations from being broken later, fixes broken ones now, is better for reuse of Wikipedia content, and trying to go the other direction is a totally lost cause. The quotation marks are any time the attribute value has a space, has punctuation, or has any character that is not part of the original ASCII character set (which means characters from any non-Latin-based writing system like Greek, Cyrillic, CJK, etc., and the majority of the Latin-alphabet characters with diacritics). Most editors do not realize this (or realize only the space part). Consequently, many existing citations are in invalid markup (most commonly hyphens, underscores, dots, slashes, and other puntuation, e.g.  ). Right this moment, MW seems to parse most of them properly most of the time anyway, though this becomes more iffy when there are two attributes, like also a  . This kinda-sorta support for the bad markup cannot be depended upon indefinitely, as it is not officially supported and is against the documented requirements of MW's own ref extension. The more complex an unquoted attribute, the more likely the MW parser is to fail with it, and the behavior could change with any future MW version. Worse, any unhelpful ref name like   or   is very likely to be improved by later editors to make more sense, e.g.   and   and will break citations if they do not remember to add the quotation marks that should have been there to begin with. As for reuse, any given third party is likely to attempt to parse our material with whatever they have, and most XML parsers are not going to handle bad markup of this sort. Even if someone uses as purpose-built MW markup parser, the odds of it perfectly replicating MW's quirks with regard to invalid markup in its "XML-like syntax" for ref tags are very low (doing it would require quite an effort on the part of the parser writer, and exactly what quirks MW will parse is something that's going to change from version to version).  Finally, various tools, including WMF's own highly ... discussed VisualEditor enforce the quotation marks anyway, so trying to resist them is a quixotic waste of time and just annoying to everyone who understands what the quotation marks are there for and how often they are needed. PS: On the accessibility claim that quotation marks are harder for certain users to type due to mobility issues, doing   is not actually going to (immediately) break anything for very simple ref names, and no one will yell at you for doing it (we hope!). The only issue would be removing quotes just because you don't like them or revert-warring against other editors doing later cleanup to add them.