Wikipedia:Bots/Requests for approval/DrilBot 3


 * The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Symbol keep vote.svg Approved.

DrilBot 3
Operator: Drilnoth

Automatic or Manually assisted: Fully automatic, unsupervised

Programming language(s): AutoWikiBrowser

Function overview: Go through Category:License migration candidates and add |redundant=yes to images which need it needed.

Edit period(s): While I'm around to check on it. Basically continuous during the day in my local time.

Estimated number of pages affected: Most all those tagged with both Cc-by-sa-3.0 or Cc-by-3.0 with GFDL and other templates in Category:Templates using the license migration system. Unsure of exact total.

Already has a bot flag (Y/N): Y

Function details: With the new Image license migration, bots are needed to help out with the enormous task of updating license documentation. I believe that the "redundant" parameter is most easily done by bot, since an image already tagged with both applicable licenses doesn't need any notice at all.

DrilBot would add  to the GFDL-type tag of any image which: A) Does not include the RegEx , since that indicates that migration information has already been added. B) Has a Cc-by-sa-3.0 or Cc-by-3.0 tag (or a redirect to one of them) in addition to one of the tags in Category:Templates using the license migration system. C) Does not include the RegEx  (see below).

If a file looks like it may be tagged as being both free and non-free, the bot will instead at a "needs-review" parameter because the image is most likely tagged as being both free and non-free at the same time, which needs human eyes.

Additionally, DrilBot may fix common template redirects (I know about WP:R2D, but if its editing the page anyway I think that standardization is good in the case of template transclusions) and perform other common cleanup similar to its CHECKWIKI fixes (from its first task), but excluding changes not wanted on most image files.

I haven't actually coded this bot yet, but I believe that it should be fairly simple with AWB's built-in features. If you want to see exactly how I'll execute this, I can write it during this BRFA. –Drilnoth (T • C • L) 17:24, 20 June 2009 (UTC)

Discussion
After running some simple tests manually, I'm now confident that this can be coded without too much difficulty. Also, it would probably make more sense to limit how I get the lists not only to the category mentioned above, but also transclusions of the various templates (there's a number of reasons for this); that shouldn't make much of any difference on actual functioning, however. –Drilnoth (T • C • L) 20:47, 20 June 2009 (UTC)


 * Probably asking too much, but is there some way it could evaluate if an image can be tagged as Copy to Wikimedia Commons? If it is only a GFDL/CC image, I believe it can be tagged in this fashion, and for 192,000 edits I would like to get as much as possible out of each edit.  MBisanz  talk 21:19, 21 June 2009 (UTC)
 * That shouldn't be too hard either, although I question the value of it... since most all articles in categories like Category:Creative Commons Attribution-ShareAlike 3.0 images need to be moved anyway, why bloat the category populated by files which were manually added? That said, I'm pretty neutral on this and will implement it if it makes sense. (also, it won't be all 192,000 edits, since it will only tag those which are redundant or indicated as also being non-free). –Drilnoth (T • C • L) 16:45, 22 June 2009 (UTC)
 * Also, I might be able to figure out a way to pick up images which might have certain problems... e.g., lacking a source. AWB can't, to my knowledge, be effectively used to tag an image with something like di-no source and also notify its uploader, but it could add a category like Category:Wikipedia files with possible problems, so that a human can easily go through them. Things like information templates which contain no content are tell-tale signs of images lacking sources, as are pages that lack content other than section headers and license templates. There would be some false positives in this, which is why it would be put in a category for human review, but it might be useful (from those two examples, I know that I can do the first one; the second might not be real doable with AWB). Just another way to get the most out of the edits, if it seems reasonable to you. –Drilnoth (T • C • L) 17:12, 22 June 2009 (UTC)

I think this is a great idea. I'm not warm to the idea of adding Copy to Wikimedia Commons, though, because there are too many opportunities for false positives. (For instance, a GFDL photo of the Eiffel Tower has no problems in the U.S., and so is allowable on en.wiki, but is considered a derivative work in France, so cannot be uploaded to Commons.) – Quadell (talk) 22:31, 22 June 2009 (UTC)
 * Good point, that can be removed from the code quite easily. I could actually expand this to add the relicense parameter if wanted, too... the only sticking point is criteria #4 of Image license migration... a bot can't really tell, to my knowledge, if an image was published elsewhere and, if so, when it was uploaded. Although if the templates are being used properly, I could probably apply it to GFDL-self and similar templates pretty safely, if that is wanted. –Drilnoth (T • C • L) 22:39, 22 June 2009 (UTC)


 * Per #4, anything uploaded into Wikipedia under the GFDL before November 1st, 2008 should be fair game to relicense. It's a little tricky, but a bot that checks whether the date of the most recent upload was earlier than Nov. 1st should be able to handle those.  The other half, images uploaded after Nov. 1st, will probably need to be looked at by hand.  Dragons flight (talk) 23:36, 22 June 2009 (UTC)
 * Sorry; I meant to say AWB bot. PY could probably do it, but AWB would need some internal coding changes I think, which seems kind of pointless to do for this specific task. Would you say that GFDL-self and the similar templates are safe, or should I ere on the side of caution and just not do anything with that? –Drilnoth (T • C • L) 00:32, 23 June 2009 (UTC)

Let me say, in general, that this is exactly the kind of work that we want bots to be doing with respect to the license migration. I'm glad someone has stepped up to work on it. :-) Dragons flight (talk) 23:38, 22 June 2009 (UTC)

Also, Cc-by-sa-3.0,2.5,2.0,1.0 should be included on the redundant making list. Dragons flight (talk) 23:42, 22 June 2009 (UTC)
 * Actually, the code to find Cc-by-sa-3.0 should pick up that &uuml;ber-license cc-by-sa license as well. –Drilnoth (T • C • L) 23:44, 22 June 2009 (UTC)

Current coding
Just adding in this section so that you can all see what code I'd like to use if this is approved (this is a work in progress, so is not complete in any way). Note that it uses AWB's find-and-replace options, so it is really just a set of regular expressions. I'll also be testing this supervised before letting it run automatic to make sure that there aren't any bugs.

I know that my RegEx is kind of messy because I haven't quite figured out how to make it most efficient... if anyone has any suggestions on how to simplify it, please let me know. The "DrilBotFairUseProblem" is needed for the subrules to work, because then AWB notices that the main rule was applied successfully; it's removed in the last subrule so will have no visual effect. All matches are case-insensitive and applied only once unless otherwise noted. –Drilnoth (T • C • L) 18:21, 22 June 2009 (UTC)


 * It should now cover: Adding the migration and needs-review parameters when appropriate, tagging with Copy to Wikimedia Commons when appropriate, and adding the as-yet-uncreated Category:Wikipedia files with possible problems to any page which contains a information template which has no information in it. –Drilnoth (T • C • L) 20:10, 22 June 2009 (UTC)
 * Removed Commons tagging per Quadell's comment above. –Drilnoth (T • C • L) 22:41, 22 June 2009 (UTC)
 * Remove information-type tagging, as one of those empty templates already adds the page to three maintenance categories. –Drilnoth (T • C • L) 03:38, 23 June 2009 (UTC)

Rule If contains: Cc-by-3\.0|Cc-by-sa-3\.0 If not contains: migration|fair ?use|non ?-? ?free Find: (\}\}) Replace with: $1\nDrilBotOKtoTagRedundant Subrule Find: \{\{self2?\|GFDL\|cc-by-sa-3\.0\}\}|\{\{self2?\|cc-by-sa-3\.0\|GFDL\}\} Replace with: Subrule Find: \{\{self2?\|GFDL\|cc-by-3\.0\}\}|\{\{self2?\|cc-by-3\.0\|GFDL\}\} Replace with: Subrule Find: \{\{self\|GFDL\|cc-by-sa-3\.0\,2\.5\,2\.0\,1\.0\}\}|\{\{self\|cc-by-sa-3\.0\,2\.5\,2\.0\,1\.0\|GFDL\}\} Replace with: Subrule Find: \{\{GFDL(-author|-retouched|-self|-self-with-disclaimers|-self-en|-user|-with-disclaimers|-en)?\}\} Replace with: Subrule Find: \{\{Multilicense replacing placeholder( ?\| ?class ?\= ?people)?\}\} Replace with: Subrule Find: \n?DrilBotOKtoTagRedundant Replace with empty string Rule If contains: fair ?use|non ?-? ?free If not contains: migration Find: (\}\}) Replace with: $1\nDrilBotFairUseProblem Subrule Find: \{\{self2?\|GFDL\|cc-by-sa-3\.0\}\}|\{\{self2?\|cc-by-sa-3\.0\|GFDL\}\} Replace with: Subrule Find: \{\{self2?\|GFDL\|cc-by-3\.0\}\}|\{\{self2?\|cc-by-3\.0\|GFDL\}\} Replace with: Subrule Find: \{\{self\|GFDL\|cc-by-sa-3\.0\,2\.5\,2\.0\,1\.0\}\}|\{\{self\|cc-by-sa-3\.0\,2\.5\,2\.0\,1\.0\|GFDL\}\} Replace with: Subrule Find: \{\{GFDL(-author|-retouched|-self|-self-with-disclaimers|-self-en|-user|-with-disclaimers|-en)?\}\} Replace with: Subrule Find: \{\{Multilicense replacing placeholder( ?\| ?class ?\= ?people)?\}\} Replace with: Subrule Find: \n?DrilBotFairUseProblem Replace with empty string

Gentle-bot: start your engines. – Quadell (talk) 12:50, 23 June 2009 (UTC)
 * Can do; I'll be running it fully supervised at least during the trial to make sure that the coding is good. –Drilnoth (T • C • L) 14:09, 23 June 2009 (UTC)
 * So should I also add  to pages using GFDL-self, GFDL-self-with-disclaimers, etc.? If the template is being used properly then it wasn't from another source and criterion #4 wouldn't apply (there's a lot of these, so I think that it would be very beneficial, although there may be a small handful of false positives caused solely where the wrong template was being used). –Drilnoth (T • C • L) 14:35, 23 June 2009 (UTC)
 * Edits are shown at . I didn't see any problems other than that the bot missed some things that it should have changed, but that was due to typos and bugs in the code which I have now fixed (further bugs in the code shouldn't usually cause actual false positives, but cause the bot to skip the page because it doesn't "find" the proper text). The typo in the edit summaries were also mine, obviously, and have been fixed.
 * In addition to the GFDL-self migration tagging which I mentioned above which I could add (which would probably double the number of edits that the bot makes, if not more), should I maybe treat  the same way as fair-use tagging? If something is indicated as being both PD and GFDL, one of the tags is incorrect. Thoughts? (I'd be happy to do a second trial with these additions, if that would be needed). –Drilnoth (T • C • L) 15:00, 23 June 2009 (UTC)


 * With respect to GFDL-self, I worry about the weird edge cases. For example, a self-licensed work still could have been published elsewhere first.  Would it be possible for you to check the wikitext for external links and exclude those?  Dragons flight (talk) 15:22, 23 June 2009 (UTC)


 * With respect to PD and GFDL tags in the same image, it's not true that one must be incorrect. Consider a photo of an old statue; the photo may be tagged PD in regard to the statue (saying it's not a derivative work), but GFDL in respect to the photo. Similarly for composite images. In fact, a photo may be validly tagged GFDL and non-free: if I take a photo of a book or Transformers toy, I may tag it non-free in regard to the underlying content, but GFDL in regard to the photo itself (any new content). – Quadell (talk) 15:37, 23 June 2009 (UTC)
 * (@ Dragons flight): Good point. I can certainly run a search on the page for  in order to weed out any pages with external links on them (only for the "relicense" tagging; X-links shouldn't matter on "redundant" tagging). There may still be a handful of false positives then, but that should catch most of them... if wanted, I could actually tag such pages as "needs-review" rather than just skipping them.
 * (@ Quadell): Agreed. There are instances where both tags are appropriate, I misworded my comment. My feeling is that if something is tagged with a GFDL tag and either a fair-use or PD tag, then a human should look at it and make sure that there aren't any problems... therefore, "needs-review". I have no intention of tagging such images as "ineligible" for the relicensing, as that would produce to many false positives, but human eyes should probably look at them because, although many are valid, there are quite a number where having both tags is unnessecary, one of the tag, is incorrect, or something is weird (e.g., using both GFDL and non-free use rationale).
 * (I can write up my proposed RegEx with all of this in mind, if you'd like to take a look at that). –Drilnoth (T • C • L) 15:59, 23 June 2009 (UTC)

–Drilnoth (T • C • L) 14:34, 27 June 2009 (UTC)

Godspeed, trusty bot. – Quadell (talk) 00:57, 28 June 2009 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.