Wikipedia:Bots/Requests for approval/Theo's Little Bot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved

Theo's Little Bot
Operator:

Time filed: 01:31, Thursday March 28, 2013 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Python + custom build of mwclient

Source code available: No

Function overview: Goes through all images in Category:Wikipedia non-free file size reduction requests, resizes them to 0.1 megapixels, reuploads them, removes the non-free reduce tag, and tags them with non-free reduced.

Links to relevant discussions (where appropriate): n/a

Edit period(s): Daily

Estimated number of pages affected: Currently, the category has over 2,000 files in it. I imagine that after an initial run, the pages affected each day will be noticeably smaller.

Exclusion compliant (Yes/No): No; pages will have already been explicitly tagged for this.

Already has a bot flag (Yes/No): No

Function details: DASHBot previously handled this (see request), but the bot was blocked a month or so ago due to an unrelated issue. This is a new "from-scratch" bot (in other words, not running DASHBot's code) based off of [//toolserver.org/wiki/~theo/resizer a web tool] that I created a few days ago per a toolserver-l mailing list request. It uses the generally accepted formula given at Non-free_content to determine the correct dimensions for a 0.1 megapixel file, then resizes the image using the Python Imaging Library. After performing another check to make sure that the file is indeed still tagged for resizing, the program reuploads the file and and replaces the "needs resize" tag with non-free reduced for an admin to then review before the old noncompliant versions are deleted.

Discussion
 MBisanz  talk 03:48, 28 March 2013 (UTC)
 * [//en.wikipedia.org/w/index.php?limit=50&tagfilter=&title=Special%3AContributions&contribs=user&target=Theo%27s+Little+Bot&namespace=6&tagfilter=&year=&month=-1] Made a few less than 50 because computer fell asleep; I plan to run it on the Toolserver, though, so we shouldn't have this problem. ;) Cheers, — Theopolisme   ( talk )  11:07, 28 March 2013 (UTC)


 * I became aware of this bot request because I saw that the bot edited a few files on my watchlist. I don't know why DASHBot was blocked, but I think that a bot is necessary for this task as human reducers don't seem to be able to cope with the speed people are tagging files with non-free reduce. I think that it generally looks good, but I have some questions:
 * How does the bot reduce images? Does the bot download thumbnail images from Mediawiki, or does the bot use a different tool such as ImageMagick for this task? I'm asking because of a bug I noticed with DASHBot: Mediawiki produces PNG thumbnails of TIF images. Because the thumbnails were in the wrong file format, DASHBot refused to upload reduced copies of reduced TIF files, but until the operator was made aware of this problem, the bot would still tag TIF files with {{subst:furd}}. What will your bot do if it detects a TIF file? Will it reduce the file, will it ignore the file or will it do something unexpected? Note that SVG and XCF files also return PNG thumbnails by Mediawiki, so the question is also valid for those files.
 * How does the bot choose whether a tagged file should be reduced or not? I think that DASHBot refused to reduce files with less than 160,000 pixels (or something similar), leaving the files unaltered.
 * Does the bot confirm that the file indeed is unfree before reducing it?
 * What does the bot do if it detects a non-image file? For example, the category contains this video file, and here I think it should be reduced by shortening the running time instead of reducing the file resolution. I think that the bot should ignore anything which is not an image since I don't see how a bot would know what to remove from sound and video files. --Stefan2 (talk) 12:53, 28 March 2013 (UTC)
 * Thanks for your questions! To reply in order:
 * Unlike DASHBot, this bot uses the Python Imaging Library -- first, it downloads the raw image file from the Wikipedia servers, then resizes it to 0.1 megapixels. So, no, it does not rely on the MediaWiki algorithms. The bot silently skips SVGs and some other less common file formats, since currently the ability to reduce those isn't built into PIL.
 * The bot reduces all files that are larger than 0.1 megapixels.
 * All it does is check for the presence of the non-free reduce template; this is obviously somewhat prone to vandalism, but the damage is not at all "irreversible" -- and tagging-files-for-reduction-vandalism doesn't seem to be very common.
 * Currently, the bot silently skips all files that it cannot handle.
 * Thanks again for the questions, and let me know if you need further clarification. — Theopolisme   ( talk )  13:37, 28 March 2013 (UTC)
 * I've noticed an odd thing with JPEG images: your bot removes all EXIF metadata in addition to reducing the image. This may be unwanted, especially if the image is a photo of a copyrighted statue and the EXIF contains information about the camera. Also, if you remove the EXIF metadata from some images, Mediawiki will change the physical orientation of the image. For example, if you remove the EXIF metadata from File:Kaifi Azmi in Annual Mushaira.JPG, then Mediawiki will rotate the image by 270 degrees clockwise.
 * DASHBot used slightly different parameters, but I'm not sure if there was any particular reason for that. I think that it makes very little sense to reduce 100,001 pixels to 99,999 pixels. If a free image is mistagged with non-free reduce, then I would assume that the deleting administrator notices this and reverts instead of deleting revisions. --Stefan2 (talk) 01:17, 29 March 2013 (UTC)
 * DASHBot was prone to the same "2 pixel" issue, as you put it—but then again, since it's so easy for the reviewing admin to simply press 'revert', I don't think this is a cause for concern. As far as EXIF metadata goes...✅ using pyexiv2. Any other thoughts or concerns? — Theopolisme   ( talk )  04:11, 29 March 2013 (UTC)
 * I wasn't active in file matters at that time, but I think that DASHBot's BRFA was preceded by one or more community discussions which eventually led to some kind of consensus on what settings to use. If you use other settings, you might be editing against that consensus.
 * The idea with WP:NFCC is that the rule is that files shouldn't be larger than needed. In part, the rule is that you shouldn't use an entire work, so images such as File:Microsoft Word 2011 Icon.png are shown at a reduced resolution although the full size of the icon probably is a lot less than 100,000 pixels. 100,000 is typically enough, so DASHBot was set to reduce files to approximately 100,000 pixels. I think that the target wasn't exactly 100,000 pixels but something very close to it, so it is probably irrelevant if you use exactly 100,000 pixels instead of DASHBot's value. However, if the file is insignificantly larger than 100,000 pixels, then it can often be argued that the extra pixels may be needed, and it looks stupid to reduce a file if you almost don't remove any pixels at all, so DASHBot was told to ignore some files with more than 100,000. I'm not sure why this was done, but if you ignore it, you might be disregarding consensus in some previous discussion about the topic. See for example User talk:Diannaa/Archive 25 where three users are discussing the ideal size of an image. --Stefan2 (talk) 17:04, 29 March 2013 (UTC)
 * Actually, DASHBot didn't take pixel density into account; rather, all it checked was file dimensions. In response to your second concern, I've added a check to see if the percentage difference between the original file's pixel count and the new file's pixel count is greater than 5%; otherwise, the image in question is skipped. — Theopolisme   ( talk )  19:01, 29 March 2013 (UTC)

The backlog is only growing; can I get approval on this task? Not to put any pressure on anyone, but...hey. — Theopolisme ( talk )  02:54, 31 March 2013 (UTC)
 * I recommend a second trial of 50 edits.— cyberpower ChatOffline 14:41, 29 March 2013 (UTC)
 * I think that it would be nice to see an example of the EXIF fix above. If the operator is allowed an extra trial of 50 edits, there will most likely be an example of that somewhere in the trial. --Stefan2 (talk) 17:04, 29 March 2013 (UTC)
 * I think I still have a few edits left over from my first 50 edit trial, so I'll test a JPEG that I know has EXIF data in just a minute. — Theopolisme   ( talk )  19:01, 29 March 2013 (UTC)
 * ✅; see example file File:By_All_Means_(album)_images.jpg—click on "Show extended details" and you'll see that the EXIF metadata remained in place. — Theopolisme   ( talk )  19:18, 29 March 2013 (UTC)


 * Do try to be around though if you can during the first run, in case problems come up.  MBisanz  talk 06:48, 31 March 2013 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.