Wikipedia:Bots/Requests for approval/SheepLinterBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at Bots/Noticeboard. The result of the discussion was

SheepLinterBot
Operator:

Time filed: 16:08, Friday, October 21, 2022 (UTC)

Function overview: Fix any font tag Linter error

Automatic, Supervised, or Manual: automatic, may be supervised to reduce mistakes

Programming language(s): AWB

Source code available: User:SheepLinterBot/1 for regexes, User:SheepLinterBot/1/Signature submissions per the table

Links to relevant discussions (where appropriate): 1 (especially this) 2 3 4

Edit period(s): varies

Estimated number of pages affected: varies, usually few hundred to few thousand for one sig, millions in total

Namespace(s): any applicable that have obsolete font tags linter errors updated 26 November 2022

Exclusion compliant (Yes/No): no

Function details: (This BRFA was originally made to fix TWA-related linter errors but was withdrawn (and postponed) because I kinda changed my mind back then.) Fixes any signature with font tag linter errors I may request that the bot fix, so the estimated number of pages may vary per sig. The linter errors it fixes varies depending on what I put in the queue, although I may use regex expressions to try to clear all the other font tags linter errors at once.

Originally it replaces MalnadachBot 12 due to issues involving many edits in a single page to fix linter errors; you can see here why the bot makes many edits to a single page to fix linter errors. Some of the regexes come from here to start and then I came up with more to minimize the number of font tags being left over after an edit.

Edit as of 22 December 2022: Originally it was planned that I fix signatures that have other linter errors as well, but because doing such on base user talk pages triggers the "you have new messages" notification even when the edit is minor for my main account, I will also request approval to fix signatures that have other Linter errors as well, i.e. missing end tags. Actually, nevermind; that is gonna be left for MalnadachBot 12. This bot task aims to take over MalnadachBot fixing font tags.

Discussion
I have checked all of the 100 latest edits. There are some errors: If you have made more such edits from your main account, please fix them. IMO it is okay if the bot skips more pages, the important thing should be that the bot does not replace one error with another. ಮಲ್ನಾಡಾಚ್ ಕೊಂಕ್ಣೊ (talk) 15:00, 3 December 2022 (UTC)
 * Comment: The proposed bot operator should probably read all of this long discussion and explain how they intend to address the concerns raised there. – Jonesey95 (talk) 16:31, 21 October 2022 (UTC)
 * Agreed with the above; are you reducing the number of times the bot will edit the page, or will the bot only make a single edit to each page? Primefac (talk) 08:07, 25 October 2022 (UTC)
 * Edit periods will vary because it's my discretion to run the bot; these regex expressions will (hopefully) reduce the number of times the bot will edit the page. Most often it'll edit only one time, although if there are at least 100 obsolete HTML tags or so, or there is a mix of color, face and size attributes then it may take a few edits (hopefully not peaking at Malnadachbot's 10+). Also note that it may take a few days to build those regexes because in reality I have a lot of work to get done so I may have less time to build those regexes and put them all at once. Sheep  (talk) 11:54, 25 October 2022 (UTC)
 * Update: (it's been a while) OK, so I've done a hundred edits using AWB with these regexes. Note that these mostly catch font tags with one attribute only. This may hopefully reduce the number of outstanding font tags to fix. Also if there are non-Latin characters, or the font tags are outside the link, the regexes may not catch those. Edit: I did read that discussion and I'm now coding in some regexes to minimize the number of font tags being left over after each edit. Sheep  (talk) 23:10, 21 November 2022 (UTC)
 * This is a strange (non-)response to the comment above. Has the proposed bot operator read that discussion?
 * The edit summary should link to a page that explains the Linter errors that are being fixed.
 * Notice at this history page that MalnadachBot visited the page twice, Sheep8144402 visited once, and there are still font errors on the page. While I think that it is good that the number of errors was reduced with each edit, editors at the discussion linked above objected to the multiple visits. Maybe the bot could examine the page for font tags after it makes its proposed edits but before saving; if there are font tags remaining, the bot could abandon the edit. – Jonesey95 (talk) 04:53, 22 November 2022 (UTC)
 * Also that was before I edited the regexes because before, if there were any other tag inside the font tag, my regexes wouldn't catch those, so now I've edited them so they can catch them now. Sheep  (talk) 13:07, 22 November 2022 (UTC)
 * I checked my edits and found that 383/1968 font tags were left over, so that meant % of the font tags were left over. These edits were made using these regexes. When I use these regexes to make another hundred edits, the number of font tags left over are 171/1254, which is %. I aim to get as close to zero as possible so I can maximize the number of edits the bot will make (assuming the bot skips edits when it can't get all font tags with the regexes at once). Sheep  (talk) 00:21, 23 November 2022 (UTC)
 * That sounds like good progress. I looked through the most recent 30 edits, and I see some easy additions that you could make to your regexes. I think that you can get to about 90 or even 95% with a few more iterations. If the bot abandons edits with font tags remaining on the page, that should result in a pretty successful bot that avoids most of the complaints that plagued the hard-working MalnadachBot. – Jonesey95 (talk) 03:06, 23 November 2022 (UTC)
 * I've now coded in regexes so they can catch font tags with at most two attributes; however they can't catch any font tag inside a font tag, and using the = character between the regex font tags will not work even when escaped. Again I've made a hundred edits with these regexes and found 224/1521 (%) font tags left over. The unusually high % is because 72 of the font tags missed do not have quotation marks, which my regexes were designed to catch font tags with quotation marks. I've edited these regexes and caught these tags, which reduces the % to %.
 * Note that sometimes, my regexes create more Linter errors (example). So it seems my regexes are not 100% accurate. However I will make the bot skip any edits if the page still has font tags to reduce the number of extra edits needed. Sheep  (talk) 01:01, 24 November 2022 (UTC)
 * That percentage looks good to me. I recommend to BAG that this bot be allowed to go to trial, with the understanding that if there are font tags remaining on a given page after the bot has applied its regexes, the bot will skip editing those pages entirely. The bot should be fed a selection of pages that result in some skips and some edits. To help with selection of pages: any VPT archive page is likely result in a skip, since they are full of unusual signatures, and many XFD and User Talk pages should result in successful edits, since they tend to be short. – Jonesey95 (talk) 01:57, 24 November 2022 (UTC)
 * This is on the recommendations of Jonesey95, who I know is on the front lines in this effort. Primefac (talk) 11:16, 28 November 2022 (UTC)
 * Also since this is going to be an AWB bot, mind adding it to the checkpage so it can run? Sheep  (talk) 13:20, 28 November 2022 (UTC)
 * (contribs) Please note that some of the replacements weren't correct, so I had to stop the bot during the trial to fix whatever errors it has caused. It's because some replacements aren't updated when I updated regexes to include whether there's a double quotation mark or not, so these errors show up. It made 20 edits at the time I had to stop. These replacements have been fixed, so everything should be AOK.
 * The bot has checked 78 pages and did 50 edits, so the edit-to-page ratio is about 1: using these set of regexes; the bot edits ~% of all pages it checks. It is lower when there are many font tags on the page, but higher when most of them have just a few. Sheep  (talk) 21:16, 28 November 2022 (UTC)
 * Please modify the edit summary so that it links to this BRFA. I examined the last 30 edits in the set (after the regex fix) and did not find any errors. Nice work. I recommend an extended trial. – Jonesey95 (talk) 21:46, 28 November 2022 (UTC)
 * I've now coded in regexes to catch font tags with 4 or 5 digit color hexadecimal codes and font style tags with color, face or size, and to include every character possible except the equals sign (since including it would break regexes). Also before they could catch font tags with either double or no quotation marks; now they can catch font tags with single quotation marks (apostrophes). I've done another hundred edits with these regexes, but rather than visiting exactly 100 pages (which ends up with a small % of font tags left over), I decided to skip pages when there are still font tags remaining. This bot would have this behavior.
 * The edit-to-page ratio (which shows the number of edits made after checking that many pages) is 1: (~% pages edited). So I'm pretty sure that when this gets approved, my bot should be able to fix about 4-5mincluding font tags not counted by Linter obsolete HTML tags linter errors. Sheep  (talk) 22:35, 2 December 2022 (UTC)
 * I've now coded in regexes to catch font tags with 4 or 5 digit color hexadecimal codes and font style tags with color, face or size, and to include every character possible except the equals sign (since including it would break regexes). Also before they could catch font tags with either double or no quotation marks; now they can catch font tags with single quotation marks (apostrophes). I've done another hundred edits with these regexes, but rather than visiting exactly 100 pages (which ends up with a small % of font tags left over), I decided to skip pages when there are still font tags remaining. This bot would have this behavior.
 * The edit-to-page ratio (which shows the number of edits made after checking that many pages) is 1: (~% pages edited). So I'm pretty sure that when this gets approved, my bot should be able to fix about 4-5mincluding font tags not counted by Linter obsolete HTML tags linter errors. Sheep  (talk) 22:35, 2 December 2022 (UTC)
 * Special:Diff/1125231675 – span tag with color attribute outside a wikilink does not color text inside it, but font-family works fine. In this situation, the color should be put in seperate spans inside the link as shown in Special:Diff/1125340258.
 * Special:Diff/1125231788 – when a font tag is outside more than one wikilink, seperate spans should be put inside each wikilinks as shown in Special:Diff/1125315624. In this particular example, the font outside the links was wrapping a nbsp for which coloring does not apply, so there is no need to keep the outer span. However the outer span should be kept if it wraps any text along with wikilinks.
 * Special:Diff/1125231764 – similar to above, but the font also wraps text. So the outer span should retained and one more span used inside the link like Special:Diff/1125342008.
 * Special:Diff/1125231819 – same as above; fix Special:Diff/1125343022.
 * Welp, just doing the search  is not ideal then, so maybe we come up with a plan; I have this page where you can submit signatures that (1) the regexes can't catch, or (2) have font tags that wrap wikilinks so therefore the color doesn't render  have just updated 3 January 2023 font tags. And rather than fixing those particular signature(s) which will cause something similar to this to happen, the regexes will be applied to fix other font tags, and if there are still font tags remaining, the bot will skip that page.  Sheep  (talk) 15:51, 3 December 2022 (UTC)
 * If possible, adjust your regexes to be more selective about what they choose to modify. It is better to fix 1,000 pages with no errors than to fix 10,000 pages with 400 errors, IMHO. – Jonesey95 (talk) 21:15, 3 December 2022 (UTC)
 * I have now added regexes to catch font tags wrapping links; some of the ideas were from here to start and then the rest were from my regexes above. Regexes have been adjusted to exclude [ and ] so there would be fewer errors like these. Unfortunately these cause the bot to skip such instances like these, but it should resolve most of the issues from above. Sheep  (talk) 01:05, 4 December 2022 (UTC)
 * Update: OK, it's been a while, but I've now done three hundred edits (instead of a hundred) using these regexes. The reason I do three hundred is to allow more edits to be checked before this bot task actually runs. Unfortunately if font tags wrap wikilinks, the page may be skipped. However to ensure that some of such pages can still be edited, I have adjusted regexes to catch font tags that wrap a wikilink and text around it.
 * The edit-to-page ratio in this case is 1: (~% pages edited). It seems my bot will average about 1:1.5 based on these tests. Sheep  (talk) 00:47, 15 December 2022 (UTC)
 * Also those edits were made when regexes are made to ignore hidden comments, images, internal/external wikilinks, math and nowiki. Sheep  (talk) 01:46, 18 December 2022 (UTC)
 * Update: An extended trial may also be used for this since I also requested approval to fix signatures with other Linter errors as well. Originally it was made so that I fix signatures with other Linter errors, but many of such signatures appear on User talk pages, which when I edit them, trigger a notification to the affected users. Some examples for this:
 * Flyguy649's signature, appears on 694 pages, whereas 681 are user talk, 1 or 2 misnested tags
 * BlueCaper's signature, appears on 342 pages, whereas 293 are user talk, 2 missing end tags
 * All of these result in 964 user talk pages edited, which will trigger a notification to most users, and which is not a good thing for me. The bot will still skip pages with font tags still remaining. Edit: The true number is far higher than this since User talk namespace has the most obsolete tags of all namespaces. Sheep (talk &bull; he/him) 21:04, 22 December 2022 (UTC)
 * Actually, that's gonna be left for MalnadachBot 12. I don't feel like taking over the entire MalnadachBot task 12 since there is already approval for that task. BAG assistance needed I do feel like an extended trial to ensure there are no errors; I also may develop regexes to maximize the number of pages edited while avoiding errors. However no response by BAG within past few weeks since the trial was complete. Just so you know, the whole purpose I do tests of edits (usually a hundred, this is an example) is to ensure the bot operates correctly when using regular expressions to fix these tags. Sheep  (talk &bull; he/him) 21:42, 25 December 2022 (UTC)
 * I'm currently listing a sample of signatures on this page. Basically whatever signatures get submitted are processed and listed in a page similar to this. Mean page size is calculated by adding up the kilobytes of these pages and then dividing by the number of pages with the signature. Either an extended trial through the sample or a bunch of random pages is fine for me. Sheep  (talk &bull; he/him) 04:24, 2 January 2023 (UTC)

Trial 2
Primefac (talk) 11:24, 11 January 2023 (UTC)
 * (contribs) No errors this time. I used the sample so to demonstrate what I mean by this task. While I was going across the sample, I used the ten signatures' replacements along with the regex replacements so that they can all be replaced at the same time. Some points I want to consider while doing this trial:
 * I skipped base user talk pages since the bot account is currently unflagged and editing base user talk pages would trigger a notification to them, which I don't want.
 * The "font style/class" regex is designed to catch every character, but if it has color, the wikilink color wouldn't render. Because of this, I excluded square brackets from the set of characters the regex would check, and put something and the bot will fix the signature correctly ( something  ). updated 14:10, 12 January 2023 (UTC)
 * The first set of regexes that fix font tags is considered twice, before and after processing the page, and the equal sign is now considered in the set of characters to check between font tags, which is why some consecutive font tags were fixed.
 * can be caught by regexes and turns into, but I don't know if that is the correct replacement for the signature since font style="verdana" isn't correct. In earlier times, whenever I stumbled with that signature, I would replace it with   since "verdana" is a font face and I thought it would be acceptable to use "font-family" CSS as the replacement. And in later times I thought it was not valid so I just replaced it with  . I do not know what I should go with.
 * Otherwise, everything should be AOK. Sheep  (talk &bull; he/him) 01:17, 12 January 2023 (UTC)
 * For things like User:Diez's signature, I would replace it with what it looks like currently with valid css. Most users just move markup around in their signature till they get something they like. This seems like they were experimenting with adding font family before settling on something that doesn't render it. I would replace it with . The trial looks good otherwise. ಮಲ್ನಾಡಾಚ್ ಕೊಂಕ್ಣೊ (talk) 02:18, 12 January 2023 (UTC)
 * As noted as to why the regexes that fix font tags with color attribute only are used multiple times, that is because each time they are used, they fix only one instance of consecutive font tags. This is an instance of me using the first four regexes seven times to fix font tags.
 * Regexes work fine most of the time, but there are edge cases where they sometimes don't work properly with the equal sign to check in the character set. They are useful for fixing some consecutive font tags; however when the equal sign is used, not all font tags get replaced. Using the regex \< *font +size *\= *(\"|\'|) *(0|1|1px|-[2-5]) *(\"|\'|) *\>(.+)\<\/ *font *\>, foo barbaz gets replaced with foo barbaz . The regex was supposed to catch the first closing font tag but it instead went for the second. I do not know why.  Sheep  (talk &bull; he/him) 01:09, 13 January 2023 (UTC)
 * Just a note that it's because I did not make the quantifier lazy. \< *font +size *\= *(\"|\'|) *(0|1|1px|-[2-5]) *(\"|\'|) *\>(.+)\<\/ *font *\> turns foo barbaz into foo barbaz, but \< *font +size *\= *(\"|\'|) *(0|1|1px|-[2-5]) *(\"|\'|) *\>(.+?)\<\/ *font *\> turns foo barbaz into foo bar baz  . However, that regex is now edited to include foo and similar; it is now \< *font +size *\= *(\"|\'|) *([0-1]|1px|[0-1]\.[0-9]*|-[2-5]) *(\"|\'|) *\>(.+?)\<\/ *font *\>. Problem solved. However that means it can't fix font tags inside and outside wikilink(s) properly unless I reorder the first set of regexes, which I did; font tag regexes with two attributes go first.
 * There should be no instance of this happening since the bot would skip pages that still have font tags linter errors. In case you don't know, the order for fixing font tags goes as follows: signature replacements, to regex replacements, to the first four regexes six more times. Sheep  (talk &bull; he/him) 17:05, 13 January 2023 (UTC)
 * Also noting that I have now coded regexes to fix font tags with color face and size. Hopefully it can further increase the edits % on average. Sheep  (talk &bull; he/him) 17:32, 13 January 2023 (UTC)
 * I've done three hundred edits once again and I will now compare the two tests with the edit-to-page ratio, which measures the % of checked pages edited while checking pages.
 * 15 December 2022: Using these regexes, the ratio is 1:1.653. (60.5%)
 * 15 January 2023: Using these regexes, the ratio is 1:. (%)
 * Notice the comparison that the ratio has increased closer to 1:1. It is not possible to get it to exactly 1:1 (achieveable using a perfect set of regexes), although I will still code regexes to catch more font tags. I did three hundred rather than one hundred so the ratios could be more accurate. Unfortunately, this will be the last test before the bot task is approved. (Update as of 14:51, 17 January 2023 (UTC): I am going to do the very last test of 300 edits in the upcoming hours since as of this post I am in high school right now, so I cannot use AWB during my school hours.)
 * I would like to point out one thing that would make the page harder to read. When using the second set of regexes to fix font tags, for some reason when there's already the same tag in the wikilink, another exact tag would be added. For example, ( bar ) would be replaced with ( bar  )  . I had skipped the page containing it despite it counting through the ratio, though that was done for accuracy reasons.  Sheep  (talk &bull; he/him) 20:20, 15 January 2023 (UTC)
 * While this BRFA is open I will still continue to develop regexes to fix more font tags and skip fewer pages while keeping the error rate as low as possible. Before implementing them to AWB, I would test the regexes by using a fake signature in another website. In the meantime, since there are no errors in the extended trial, this can be approved, and then the process of fixing font tags can begin. Or, you can approve this for the last extended trial, with a mix of random pages and the sample. updated 14:51, 17 January 2023 (UTC) Sheep  (talk &bull; he/him) 02:26, 13 January 2023 (UTC)

Comparison of three tests of 300 edits
The very last test of 300 edits before this bot task is approved is now complete. Here are the results: 1 page was skipped due to characters in the Unicode Private Use Area, and 1 page was skipped due to not having font tags (there was a false positive when trying to get pages with font tags), so they had to be discounted in the ratio. Also, I had to manually skip one page due to two consecutive span tags in a wikilink when trying to fix font tags wrapping one wikilink and text around it. Apparently with the font style/class regex, a signature ended up getting replaced with span tags outside a wikilink. So either I have to make it two separate regexes, or you can submit the signature to my submission page so the bot can get the fix correct.

Regexes are made to ignore external/interwiki links, images, nowiki, math and hidden comments. To know how strong my regexes are, I use two things for two scenarios. If the bot was made to not skip pages with font tags, I would use the font tag percentage. If the bot was made to skip pages with font tags, I would use the edit-to-page ratio. Currently I use the ratio because I will make the bot skip pages with font tags; it's better for other editors complaining about MalnadachBot making many edits to a single page to fix font tags, doing such creates fewer errors when editing pages, and it is also easier to calculate. After the last test of 300 edits, either this BRFA can move on to the last extended trial, with half of the edits made with random pages and the other half from the sample, or it can go to straight approval. Sheep (talk • he/him) 13:29, 18 January 2023 (UTC)

BAG assistance needed No edit by BAG since seven and a half days. Sheep (talk • he/him) 00:12, 20 January 2023 (UTC)
 * For what it's worth, 7.5 days isn't really that long from a BAG perspective, though I do suppose 20 is pushing it...
 * Primefac (talk) 11:32, 31 January 2023 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Bots/Noticeboard.