Wikipedia:Bots/Requests for approval/Xcbot 2


 * The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Symbol delete vote.svg Denied.

Xcbot
Operator: Xiong Chiamiov   ::contact::  help!

Automatic or Manually Assisted: automatic

Programming Language(s): bash (and uses part of pywikipedia)

Function Summary: goes through images from Special:Imagelist (offset by 1 week - discussed below), attempts to optimize them with optipng or jpegopti, and re-uploads them if they are optimized more than a certain cutoff percentage

Edit period(s) (e.g. Continuous, daily, one time run): daily/weekly

Edit rate requested: 1 edits per minute (discussed below in function details)

Already has a bot flag (Y/N): no

Function Details: Summary covers the basics. There are several variables to which it would be nice to have some input on however: On a side note, both the upload and download have wait times between retries (so they don't bash the server) and set number of attempts (so it doesn't hang when the server's down/slow). Downloading uses wget, uploading uses upload.py from pywikipedia.
 * Currently, the script is written to take images that are 1 week old, to attempt to avoid wasting resources on images that will be soon speedied. I wasn't quite sure what a good amount of time would be, so I arbitrarily chose 1 week because I needed something to code.
 * This project was prompted the sheer repetitiveness of optimizing images manually. I don't want to upload any image I've optimized, as for some the difference is not very great, and I don't feel the extra disk space (from storing another copy) is justified by the reduced bandwidth.  Currently, the script is set to only upload when optimization is greater than 8% 25kB, but that, of course, could be given a much better value by someone who understands a bit more about the nuts-and-bolts of the wikipedia servers.
 * It is trivial to implement a delay in a bash script so that is waits a set number of seconds before continuing. This I have no idea what I should set to.

Also, please note that this account was created some time ago for another bot idea I had (which happened to be a much inferior version of welcome.py from pywikipedia), plus I have used it for human-assisted repetitive editing, and recently, testing for this script. To that effect there are some number of edits to this account that are not a result of this script. Just so you know.

I'd love to take your input, since there are quite many things that I would never think about, both in the technical aspect of the bot and the reasons for (not) approving it. So in that spirit, feel free to say what you think! Xiong Chiamiov  ::contact::  help! 07:10, 2 March 2008 (UTC)

Oh yeah, the code in all its ugliness is available here! Xiong Chiamiov  ::contact::  help! 07:30, 2 March 2008 (UTC)

Discussion
Is there a minimum size cutoff as well? 8% for a <100KB file isn't all that much, and IMO not really worth saving that 8KB with a new upload. Q T C 00:51, 3 March 2008 (UTC)
 * That's actually something I was thinking about this morning during my shower. I'm trying to remember, was there a way to get the file size from the API...? *digs around* ah hah, I knew I saw it somewhere!  Then again, from a little Googling, it looks like I can do it directly from bash.  I really should be doing my discrete structures homework tonight (ugh!), but I'll hack in a solution to that problem.  Thank you for bringing that to my attention!  On that subject, what do you think the cutoff size should be?  Or should it be much more elegant, with higher optimization percentages for lower file sizes?  Hmm.  Xiong Chiamiov   ::contact::  help! 05:49, 3 March 2008 (UTC)
 * Oh yes, and this was essentially an exercise in learning bash for me (the only thing before that was a simple batch file to recompile my sound driver), so if anyone with more knowledge of the language has suggestions that seem really stupid to you, I am more than open to them! And don't worry, you won't offend me!  Xiong Chiamiov   ::contact::  help! 05:53, 3 March 2008 (UTC)
 * *shrug* Dunno. You could of also just gotten the file size from Special:Imagelist which you're using already.  Q  T C 05:51, 3 March 2008 (UTC)
 * Oh dear, see what I mean? Yes, since, I have that page already, I'll probably just pull the info from that.  Xiong Chiamiov   ::contact::  help! 05:55, 3 March 2008 (UTC)
 * Changes have been made. Along that idea, though, I'm thinking of just getting the filesize before and after, and only uploading if  the difference between them is beyond a certain amount.  Numbers used are purely for the purpose of having numbers.  Xiong Chiamiov   ::contact::  help! 06:51, 3 March 2008 (UTC)
 * The code now uploads only if a certain number of bytes has been saved. Xiong Chiamiov   ::contact::  help! 20:26, 5 March 2008 (UTC)
 * The only issue I could see is that most images get dynamically resized on the fly, so all you would be optimizing is the original. Thus, this might create an issue of copy-of-a-copy degradation, since, for example, jpeg compression is lossy.  Though, I have no idea if the optimization will translate to the child images once they're generated from our image servers.  Perhaps you can run a couple of manual sample tests using your normal account and a resized image?  E.g.:
 * Upload SomePicture.jpg, add [[Image:SomePicture.jpg|thumb|250px]] to a sandbox and save.
 * Right click the generated image and check file size.
 * Reupload a manually-optimized Image:SomePicture.jpg.
 * Purge the cache on the sandbox to which you added the original.
 * Right click the generated image thumbnail again and check the new file size.
 * Report both sizes to see if optimization makes any real difference. This will demonstrate whether optimization actually has a significant effect on the automatically generated thumbnails (i.e., where most of the bandwidth is probably being used), or whether it's simply having an effect (if any) on the image page.
 * Also, you might check out the api to make parsing uploaded images easier (e.g., you can use something like this request to see a list of uploaded images instead of screen scraping. Check out the documentation for more advanced things you can request. -- slakr  \ talk / 23:12, 6 March 2008 (UTC)
 * By the way, you can check image size by using prop=imageinfo while adding the size iiprop. E.g., |timestamp|user|comment|url|sha1|metadata this query will give you everything you could ever need about the latest revision of a given image in easy-to-parse form. -- slakr \ talk / 23:18, 6 March 2008 (UTC)
 * I've been in the process of adding a significant amount of code and rethinking how the list of images would be gotten. I've just got to fix some regexes and then I'll throw the code up here and revamp the summary to make it more clear.  In fact, that API query was just along the lines of what I'm rewriting it to use.
 * FYI, jpegoptim (which I'm using to optimize the jpegs) claims to optimize losslessly, which is the only reason I considered using it.
 * I hadn't considered the thumbnails, so I'll look into that now and get back to you. Xiong Chiamiov   ::contact::  help! 11:16, 7 March 2008 (UTC)
 * I took the file from Toothbrush.jpg, saved it as a.jpg and b.jpg, then optimized the latter. I uploaded those 2 to XcbotA.jpg and XcbotB.jpg, then included them both on my sandbox.  Although the savings on the original file was almost 50%, the thumbnails both appear to have the same number of bytes.  Per this (thank you Slakr), I will hold off on my development until it is decided if this will benefit the project.
 * BTW, the code I was implementing would start at the beginning logs we have of file uploads (sometime in 2002 I think it was?) and slowly work its way toward the present time, always keeping a 2-week buffer between its work and the current date. If anyone wants to see it before I finish fixing some stuff (which will wait upon affirmation of positive effect), I can put it up.  Xiong Chiamiov   ::contact::  help! 11:38, 7 March 2008 (UTC)
 * Oh yes, and some other minor optimization improvements, like not downloading files that aren't jpegs or pngs (since those are the only ones we can optimize) or files that are smaller than the set minimum limit on savings. Xiong Chiamiov   ::contact::  help! 11:41, 7 March 2008 (UTC)

Per the findings above, I'd like to ask whether or not optimizing images on WP (or commons, for that matter) really helps enough to offset the cost of reuploading images. I'd just like a simple "no" to confirm my thoughts before I put this out of my mind. Xiong Chiamiov  ::contact::  help! 05:32, 11 March 2008 (UTC)

I'd just like an answer. Xiong Chiamiov  ::contact::  help! 18:44, 20 March 2008 (UTC)

It is my opinion that this is best implemented (and is, as a part of our resizing routines) as a part of the MediaWiki software (imagemagick). — Werdna talk 08:09, 21 March 2008 (UTC)


 * The above discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made on the appropriate discussion page, such as the current discussion page. No further edits should be made to this discussion.