Wikipedia:Bots/Requests for approval/CitationCleanerBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved

CitationCleanerBot
Operator:

Time filed: 01:09, Monday September 5, 2011 (UTC)

Automatic or Manual: Automatic

Programming language(s): AWB

Source code available: On request (will evolve over time)

Function overview: Cleaning citation templates.

Links to relevant discussions (where appropriate): N/A, kind of a "duh" thing.

Edit period(s): Will usually edit after dumps on articles likely to contain fixes.

Estimated number of pages affected: 10-20K?

Exclusion compliant (Y/N): Yes (AWB)

Already has a bot flag (Y/N): No

Function details: This mostly concerns fixes like the following
 * http://www.jstor.org/stable/123456798 &rarr; 123456789
 * cite news/cite web &rarr; cite journal when appropriate
 * Removing accessdates when no URL is present
 * Convert certain bare url to cite journals. &rarr; (similar to Citation bot 8) so Citation bot can expand them. (let's leave this for another BRFA)
 * AWB genfixes
 * Removal of rarely used empty parameters (on a per-template basis). For example, all empty laysummary in cite journal are leftover clutter from a copy-paste of the template documentation. oclc is likely to have a use for a book, but for journal articles they are leftover clutter from a copy-paste. issn can be useful for a journal, but is just silly for a book.
 * Some other minor fixes, like issn hyphenation, would be bundled with it over time Note: disabled this explicitly, new BRFA or WT:BRFA note required for other tasks, approvals cannot be given for ambiguous further "minor fixes". —  HELL KNOWZ  ▎TALK 20:32, 10 September 2011 (UTC)
 * The bot does not add or remove information, it just cleans up the existing stuff

Discussion

 * "Convert certain bare url to cite journals." What about WP:CITEVAR? Citation bot is manually initiated by editors, so there is no automated style change; this BRFA however is for an automated bot.
 * Regarding ISSN hyphenation, it will probably have the same opposition/comments as ISBN hyphenation Wikipedia_talk:ISBN and Wikipedia_talk:Bots/Requests for approval/RjwilmsiBot 6. You need to at least notify some noticeboards.
 * You need to be more specific about "other minor fixes". If they are really minor, they can be approved without BRFAs at WT:BRFA if you add them later. However you put ISSN hyphenation with "minor", so I have to ask about this. — HELL KNOWZ  ▎TALK 12:03, 5 September 2011 (UTC)


 * For the bare url conversions, see Bots/Requests for approval/Citation bot 8. I would get a list of articles with such conversions done at the end of each run, and then run citation bot on them myself, which would bring them in line with the majority use of citation vs cite xxx. If another style is used, filled out citations are easier to convert than bare urls, and will be better than bare urls in the meantime.
 * ISSNs are always hyphenated as XXXX-XXXX (unlike ISBNs, which can be unhyphenated, although this is not recommended officially), and I've after cleaning up several thousand articles, I've yet to come across an unhyphenated ISSN that was inserted manually (rather than by bots/scripts) or which was not at odds with the rest of the article. The bot could do ISBN hyphenation, but I specifically left it out since that's a legitimate stylistic alternative actually found in the outside world (e.g. Google Books does not hyphenate ISBNs, many books have unhyphenated ISBNs on their information page, but all journal databases do hyphenate ISSNs, and no journal features an unhyphenated ISSNs on their cover/information page).
 * Other minor fixes would include whitespace striping in citation parameters (something like How the  West \n Was     Won &rarr; How the West Was Won, once I figure how to implement it, or journal disambiguation Nature &rarr; Nature, etc...
 * Headbomb {talk / contribs / physics / books} 16:28, 5 September 2011 (UTC)
 * I'm OK with ISSNs then, though I'm not an expert, so leaving this up to someone who knows better. Fine on minor fixes, as long as you notify about these on WT:BRFA or something.
 * Citation Bot is a manual tool, this is an automated one. Citation Bot was approved as a manual tool, and that doesn't imply consensus for automating that. It's one thing to have an edit initiated manually, it is another to have a batch of pages edited automatically. While I'm all for citation templates and couldn't agree more they are better than bare urls, that's not everyone's opinion. You are bound to stomp on someone's garden eventually and we'll be on our merry way to AN (again) :) So I can only suggest advertising this broader, imposing thresholds, or asking other BAGgers. — HELL KNOWZ  ▎TALK 17:35, 5 September 2011 (UTC)

Seeing as part of this is mostly uncontroversial, for the following: with genfixes enabled. Try to balance the different tasks throughout the 100 edits. I'm leaving the "convert bare url to cite journals" task open to discussion/advertising. &mdash; The Earwig   (talk)  19:47, 5 September 2011 (UTC)
 * http://www.jstor.org/stable/123456798 &rarr; 123456789
 * cite news/cite web &rarr; cite journal
 * Removing accessdates when no URL is present
 * Removal of rarely used empty parameters
 * ISSN hyphenation, but no other "minor fixes" (yet)

I'm currently reviewing the edits, see if there's anything wrong with any of them. Headbomb {talk / contribs / physics / books} 21:21, 5 September 2011 (UTC)
 * In ISBN-x is converted to -x. This is GIGO stuff (the article isn't made any worse by the bot).
 * In, there a bad handling of a bad use of doi. That's GIGO, but avoidable GIGO. This has been fixed.
 * In, it unlinked two PDF. It shouldn't have done that, and I tweaked the logic accordingly.
 * And that's pretty much that. Headbomb {talk / contribs / physics / books} 21:35, 5 September 2011 (UTC)
 * For 1, it would be useful if the bot could tell that it is garbage, and post about it somewhere, add a cleanup category, or whatever. I think ISBNs that do not match /((\d-?){9}|(\d-?){12})[\dX]/ are invalid; perhaps there's a better regex around. Ucucha (talk) 00:05, 6 September 2011 (UTC)
 * Would probably be better job for a database/toolserver report than for this bot. I'm not not against incorporating some ISBN checker with the bot, but I would rather have an established solution / cleanup template for this before doing so. Headbomb {talk / contribs / physics / books} 00:18, 6 September 2011 (UTC)

I made some other tweaks to the bot (added a few more urls to recognize and clean, fixed a few regexes, and made it skip articles it was likely to mess up after discovering an issue). I've tested them semi-automatically on a variety of articles, but it would probably be a good idea to trial them. So could I get another trial? Headbomb {talk / contribs / physics / books} 16:02, 6 September 2011 (UTC)
 * Given that you chopped out that one buggy part, I'd like to see another/extended trial just to make sure everything works okay now. =) -- slakr \ talk / 04:42, 7 September 2011 (UTC)
 * I still note a few bugs that I would like to see resolved. For example, this should not happen; it is duplicating pmc with the same exact data. Yes, it's relatively minor, and supposedly a problem with AWB and not the bot itself, but it should be looked into. Trial until you think everything's been sufficiently tested. &mdash;  The Earwig   (talk)  04:59, 7 September 2011 (UTC)


 * Gonna do roughly ~500 edits
 * Screwed up here, this was fixed.
 * Improper removal of accessdate. Not yet fixed (I thought it was, but apparently not). Investigating.
 * Alright, I stopped at 400 for now, no other problems to report. I'll keep the last 100 for when I solve that stupid bug. Happened 3/400 times; it's pretty rare but it's avoidable, so it should be fixed before unleashing the fury. Headbomb {talk / contribs / physics / books} 07:56, 7 September 2011 (UTC)
 * BAGAssistanceNeeded Well turns out this is a bug in AWB (Wikipedia talk:AutoWikiBrowser/Bugs). There's not much I can do to solve this ATM, but I can blacklist articles likely to be affected by the bug and do these manually. The false positive rate would probably be under 1/10,000. Would that be acceptable? Headbomb {talk / contribs / physics / books} 17:47, 10 September 2011 (UTC)
 * BAN template is for neglected BRFAs, week or more; you're on BAG, you should know that ^^ — HELL KNOWZ  ▎TALK 20:32, 10 September 2011 (UTC)

Trials seem fine. Looked through 100 or so edits, clarified all that I wasn't sure about. Don't see any major issues. Rare AWB genfixes that could be seen as minor/cosmetic, but cannot really be avoided due to AWB logic. Older issues resolved, AWB template issue is bug report filed, currently implemented with a blacklist work-around. Task is being monitored by botop and lists are compiled from dumps, so shouldn't cause any problems with weird cases. — HELL KNOWZ  ▎TALK 20:32, 10 September 2011 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.