Wikipedia:Bots/Requests for approval/Fluxbot 6


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved

Fluxbot 6
Operator:

Time filed: 03:39, Friday, July 22, 2016 (UTC)

Automatic, Supervised, or Manual: Supervised

Programming language(s): n/a

Source code available: AWB

Function overview: HTML Fixes that are causing pages to be identified as Category:Pages using invalid self-closed HTML tags.

Links to relevant discussions (where appropriate): VPT#New maintenance category

Edit period(s): Ad-hoc batch runs

Estimated number of pages affected: open-ended :thousands of pages, edits to hundreds per run

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: I've been working on cleaning up Category:Pages using invalid self-closed HTML tags in advance of the upcoming code changes, mostly running from my own account. I would like to use my bot account primarily so that edits to User talk: can be made quietly using. This is primarily fixing the most common html errors:
 * 1) Self-closing div tags
 * 2) Self-closing span tags ( span id="a" /> )
 * 3) Syntax errors with s,small,big,center tags (e.g. .. )


 * This task seems prone to false positives in 1-5% of edits, so will need to be run supervised. I am open to running or not running AWB genfixes on articles if anyone has a preference.  Thank you, —  xaosflux  Talk 03:39, 22 July 2016 (UTC)

Discussion
Comment (and Support): I have fixed a few hundred of these and have found a similar percentage of false positives in editing with an AutoEd script, no matter how well I write my regexes. I agree that a supervised run, done carefully, should work well. Would you be willing to share your proposed regexes?

Just for clarity, I would like to see this task approved for all namespaces, not just User Talk. There is a lot of work to be done in Talk, Wikipedia Talk, and Wikipedia.

Feel free to crib from my AutoEd script at User:Jonesey95/AutoEd/month.js.

Also, you might look at CHECKWIKI/WPC 002 dump for examples of pathological patterns that might be seen as problems or opportunities, e.g. and. – Jonesey95 (talk) 05:31, 22 July 2016 (UTC)
 * Pinging, who has been doing this work quite effectively. – Jonesey95 (talk) 05:38, 22 July 2016 (UTC)
 * Theses definitely need to be corrected. I have been running some fixes for these and also have noticed the false positives, so supervision is necessary. I don't think running with genfixes should be a problem. I assume span will be corrected using anchor and/or {{subst:anchor}} (or equivalent) and similarly for div. How will you handle unquoted and unbalanced quotes for id= in span and div? —&thinsp;JJMC89&thinsp; (T·C) 06:10, 22 July 2016 (UTC)
 * so far I have not been making that assumption, and simply closing the tag as is (e.g. becomes   ). —  xaosflux  Talk 11:59, 22 July 2016 (UTC)
 * Regexes such as:
 * (  *to*    $1>
 * ( *to*  $1>
 * — xaosflux  Talk 12:04, 22 July 2016 (UTC)
 * That is equivalent. ( gives .) —&thinsp;JJMC89&thinsp; (T·C) 14:36, 22 July 2016 (UTC)
 * Some of my find/repalces's aren't even regex's just literal string replacement - as this is being run supervised; e.g. (changing to ). — xaosflux  Talk 12:05, 22 July 2016 (UTC)


 * I'm fine running in any namespace, I've been doing cleanups anyway - the bot request is so I can basically do the same work I've been doing with my normal account without triggering the new messages warning. — xaosflux  Talk 11:57, 22 July 2016 (UTC)

Support: I'm glad someone is taking on the talk space portion of the error category, which makes up ~1167/2185 entries. ~623/1167 are archived talk pages though (Will touching those raise concerns? If so, can the MediaWiki software be made to ignore archives?). Non-archived User talk: only comprise 228/2185, so I definitely support expanding to more/all talk space. ~ Tom.Reding (talk ⋅dgaf) 12:23, 22 July 2016 (UTC)
 * MediaWiki doesn't really have a concept of "archive pages", they are just pages. That being said, editing user_talk/subpages does not trigger the new messages indicator, so isn't really the worry. —  xaosflux  Talk 14:29, 22 July 2016 (UTC)
 * Archived pages should be fixed. These deprecated tags, if they are not fixed, will presumably cause pages to display improperly at some point, and we don't want archived pages to suddenly appear different (and broken). – Jonesey95 (talk) 16:22, 22 July 2016 (UTC)
 * Oh I agree, I mean they are not a worry for "new message indicator" - they can be fixed at anytime without needed the  flag that only bots have. —  xaosflux  Talk 16:27, 22 July 2016 (UTC)


 * BAGAssistanceNeeded What would BAG like to see to move forward? — xaosflux  Talk 12:24, 23 July 2016 (UTC)


 * Safe to test this, I think, on all types of pages. —  Earwig   talk  20:54, 23 July 2016 (UTC)


 * The trial went about as expected, of the first 50 pages where edits would have been needed, 4 needed to be skipped as they were complex. The 25 user talk edits went off very well, while the 25 article edits all appear to be good edits, in many cases they did not solve the overall page problem, leaving some bad html tags behind - some additional regexes may help reduce the number of passes a page may need to be solved; however some of the pages are just really messy.  I've got a feeling many of the "easy" ones have been cleaned up by hand already.  Any thing else you would like to see ?  Thanks, —  xaosflux  Talk 22:26, 23 July 2016 (UTC)
 * Please don't change  to   like in [//en.wikipedia.org/w/index.php?title=Kill_Keith&diff=prev&oldid=731220826] and many other edits. The former is both allowed and recommended to invoke a reference which is defined elsewhere. See for example Citing sources. It's an undocumented (as far as I know) feature that an empty   has the same effect. It will probably confuse most editors and I recommend reverting those changes. ref isn't even a html tag but defined by mw:Extension:Cite. Can you post a complete list of the self-closed tags the bot is coded to change? PrimeHunter (talk) 23:27, 23 July 2016 (UTC)
 * Another thing, [//en.wikipedia.org/w/index.php?title=User_talk:Hello2rmn&diff=prev&oldid=731220213] does not have a helpful edit summary: " replaced: <div style="margin: 0; font-family: sans-serif; font-weight: normal; font-size: 100%; border-top: 1px solid #a3b0bf; text-align: center; color: #000; margin-top: 2em; margin-bot... "
 * Could it be something showing the important part and using "..." for unimportant details like: " replaced: by "? PrimeHunter (talk) 23:39, 23 July 2016 (UTC)
 * Thank you for the feedback, I'm taking  out of my lists, and will revert any of those.  —  xaosflux  Talk 00:22, 24 July 2016 (UTC)
 * User talk run had 1 "ref", reverted. — xaosflux  Talk 00:27, 24 July 2016 (UTC)
 * Article run rolled back as well - the "ref" problem was mostly here, let me know if I can re-trial with the corrections. — xaosflux  Talk 00:45, 24 July 2016 (UTC)
 * Thanks. I noticed an error in an earlier AWB edit [//en.wikipedia.org/w/index.php?diff=731162330]. A span tag in the last change was correctly closed before reaching a self-closed ref tag later on the same line. If we wanted a closing tag for the ref it shouldn't be a span. Is the current code designed to avoid such errors? PrimeHunter (talk) 00:51, 24 July 2016 (UTC)
 * Yes, the spans should only strictly match the spans now, I've removed everything about ref's completely. — xaosflux  Talk 00:58, 24 July 2016 (UTC)
 * Except for → , the edits look good. You way want to adjust ( to (]+?") ?\/> so that the regex isn't too greedy. Also, consider not adding the replacements to the edit summary; the long example PrimeHunter pointed out isn't really helpful. —&thinsp;JJMC89&thinsp; (T·C) 01:18, 24 July 2016 (UTC)
 * Agree. — xaosflux  Talk 01:55, 24 July 2016 (UTC)
 * I don't object to a new trial run. I suspect the article skip rate will be large based on the number of edits that only changed ref tags. Could you post a list of skipped pages so we can examine whether their problem is transclusions, something missed by your regex, or maybe something else? PrimeHunter (talk) 01:30, 24 July 2016 (UTC)
 * I never expect these to be 100% solving the problem - most of "big wins" I've gotten so far were all in template. The primary use of this bot for the runs is for user_talk: so that the new message flag doesn't get set. — xaosflux  Talk 01:55, 24 July 2016 (UTC)


 * BAGAssistanceNeeded BAG'ers, let me know when it is OK to run another trial to validate the the issues above are resolved please. (50 edits should be fine). — xaosflux  Talk 01:57, 24 July 2016 (UTC)

2nd trial

 * All right, same as before. — Earwig   talk  18:00, 24 July 2016 (UTC)


 * User_talk: 25 User talk trial - appears to be successful; For 25 edits - 3 selected pages had to be skipped due (likely invalid) tag ordering that the regex's didn't like. Of pages that I had no regex for and were automatically skipped, the most common invalid tag is   ; it seems to be about 60%/40% a typo that should be   or a pointless tag that can be deleted. —  xaosflux  Talk 03:19, 25 July 2016 (UTC)
 * Other namespaces:
 * 9 (main) edits - with all the "ref" stuff removed from trial one these look OK now. No manual skips were needed, many automated skips - many pages have one-off errors such as "/tag/" or "tag//" that I didn't attempt to script.  Note, there are many issues of self-closing   tags - used almost the same way as the bad "refs" in trial one - I didn't attempt to repair these as I'm not exactly sure what our "best practice" for this tag is. —  xaosflux  Talk 03:32, 25 July 2016 (UTC)
 * 16 User: to round out the 50 edits. Only 1 manual skip to get to 16 - complex page layout bad for regex; lots of automated skips - User: pages has lots of odd issues such as tags out of order that also include bad tags, fixing the one bad tag won't really fix the page so they will need more attention. — xaosflux  Talk 03:39, 25 July 2016 (UTC)
 * Please let me know if you see any errors. — xaosflux  Talk 03:39, 25 July 2016 (UTC)
 * This looks like a very good result. Any conservative script built to deal with these errors should skip 10 to 30% of pages with errors, depending on the namespace, since some of the errors require human eyes to scan them and others require more elaborate fixes than a simple script can provide. If the script-assisted edits can clear 70% or more articles from the error category, humans will be able to work on the more interesting cases. I recommend approval. – Jonesey95 (talk) 04:57, 25 July 2016 (UTC)
 * Thanks Jonesey95, event when running this, there are no plans to run "automatically" - the FP rate is too high. — xaosflux  Talk 13:50, 25 July 2016 (UTC)

should be handled manually. In HTML 4 it is used for citations, but in HTML5 it is used to indicate the title of a work. In some cases is being used like  and should be converted to. Articles using to indicate a reference should have the citation style converted to use  (list-defined references) or a shortened footnote style depending on the use. —&thinsp;JJMC89&thinsp; (T·C) 05:20, 25 July 2016 (UTC)
 * Trial edits look good.
 * That's what I was thinking, these need to be manually evaluated and repaired depending on the usage. — xaosflux  Talk 13:50, 25 July 2016 (UTC)
 * Agree with both of the above re cite tags. I have also seen, where an editor typed "cite" instead of "ref". – Jonesey95 (talk) 15:09, 26 July 2016 (UTC)


 * BAGAssistanceNeeded Hi BAG, anything else you want to see? — xaosflux  Talk 00:46, 27 July 2016 (UTC)


 * Looks good to me. —  Earwig   talk 19:00, 29 July 2016 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.