Wikipedia:Bots/Requests for approval/Orphaned image deletion bot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved.

Orphaned image deletion bot
Operator: Chris G

Automatic or Manually assisted: Auto

Programming language(s): PHP, using my classes

Source code available: Yes, Once it is written here

Function overview: Deletes orphaned fairuse images per WP:CSD

Edit period(s): Daily

Estimated number of pages affected: between about 20 - 100 deletions a day depending on backlogs etc

Exclusion compliant (Y/N): No, although it will ignore images with hangon

Already has a bot flag (Y/N): N

Function details:
 * gets all the images in the relevant sub cat of Category:Orphaned non-free use Wikipedia files
 * runs the following checks on each image:
 * no hangon tag
 * image is orphaned
 * has a fair use licensing tag (the bot uses Category:Non-free Wikipedia file copyright tags to work out which tags are for non-free images)
 * has been tagged for deletion for at least 7 days
 * if the image passes all the checks the image is deleted, otherwise it is skipped

Discussion
My understanding is that images that have moved & are only linked through a redirect will look like orphans. (I could be wrong about that, but that is my understanding of the current situation.) Is the bot aware of this possibility? --ThaddeusB (talk) 15:16, 26 September 2009 (UTC)
 * Special:WhatLinksHere shows pages linking to a redirect to the page in question. -- Chris 01:28, 27 September 2009 (UTC)
 * Please see https://bugzilla.wikimedia.org/show_bug.cgi?id=18017 for further info. Depending on how this bot is designed, it could be an issue. --ThaddeusB (talk) 03:29, 27 September 2009 (UTC)

since adminbots are always controversial. I suggest at least WP:VPR (for the general community), WP:AN (for admins), WT:NFCC (for people interested in non-free images), and WT:CSD (for people interested in CSD). Template:Cent might not hurt either. Anomie⚔ 15:23, 26 September 2009 (UTC)
 * Adminbots aren't always controversial :) However I have spammed the pages you mentioned. -- Chris 01:28, 27 September 2009 (UTC)

How do you plan to deal with images that are double-tagged with both a free license and a non-free license? --Carnildo (talk) 19:02, 26 September 2009 (UTC)
 * Those will be listed on a userpage for manual human review. -- Chris 01:28, 27 September 2009 (UTC)

Can you please explain what the bot will do in the following situations: 1)The current revision of an image is categorized as non-free, but there is at least one old revision of the image that was tagged as free. 2)An image license or permission template is being actively edit warred over, but the image is at least sometimes orphaned and tagged as non-free, and at least sometimes not either of these. 3)An image categorized as non-free has an active edit war over its use in mainspace, added and removed from several articles, but orphaned at least part of the time, and linked from mainspace at least part of the time. Listing these for human review is a good answer, but can you guarantee the bot will detect such situations in order to list them? — Gavia immer (talk) 01:47, 27 September 2009 (UTC) -- Chris 02:57, 28 September 2009 (UTC)
 * 1) Delete
 * 2) I highly doubt that an edit war over a license template would result in the template being changed from non-free to free (or vice-versa) however in the event this did happen, if the image was tagged as non-free (with no free image tags on the page) it would be deleted - otherwise the bot would skip the image
 * 3) If the image was being added and removed from a page in an edit war, one would hope that someone would have the sense to put hangon on the image page until the edit war was over. There is no easy way for the bot (or even a human admin) to tell which articles have linked to an image in the past but don't link to it anymore.
 * Does the code check that is has been CONTINUOUSLY tagged for deletion for at least 7 days? Does the code check that the uploader has been notified of the pending deletion for at least 7 days?  I've seen an edit war over a template's free status.  A vandal can do a lot of damage with a few strategic edits with this bots assistance.  We have lots of blanking vandals, so we must anticipate this.  How 'bout having the bot log each day a count of actions taken and expected the next day?  --Elvey (talk) 21:04, 16 October 2009 (UTC)
 * Yes. No it seems unnecessary to me IMO. -- Chris 06:58, 18 October 2009 (UTC)

Since this is dealing with matter that can be caused by vandalism since as pages being edited to exclude images, what time period will be waited before considered for deletion from the encyclopedia? When will this bot be written? because personally I would want this written and review by at least another bot creater before going being approved since this is a situaton where major issues can occur (especially with it having the admin flag). Peachey88 (Talk Page · Contribs) 04:56, 27 September 2009 (UTC)
 * Read the function details - the image will be deleted if it "has been tagged for deletion for at least 7 days". The code will be written when I write it - I can't guarantee that someone will review it, but I will make the code available so if someone does want to review it they can. -- Chris 02:57, 28 September 2009 (UTC)
 * No need to worry about finding someone to review it, I was already planning on doing so once the code is posted. Anomie⚔ 04:14, 28 September 2009 (UTC)

Will this impact images in other than main space? I assume that non-free images can only be used in article space, but I don't know enough about images on wikipedia to know what other issues may be. --69.225.5.4 (talk) 07:16, 27 September 2009 (UTC)
 * Non-free images are only supposed to be used in the main space, however since people like to make drafts etc in their userspace and then move it to the article space the bot will ignore namespace (i.e. if the image is being used on someone's userpage the bot won't delete it). -- Chris 02:57, 28 September 2009 (UTC)
 * Okay, this leaves only fair use images within the scope of this particular bot to be dealt with. Thanks. --69.225.5.4 (talk) 06:18, 28 September 2009 (UTC)

I would like to see an effort to adopt the image to the page it claims fair use for. There are a fair proportion of these that get uploaded and a correct FUR, but not correctly inserted into their page. This often applies to album covers and company logos, which may be uploaded by inexperienced editors. So the extra test I propose is that
 * the image has to have been used in the article it is fair use for as specified in a FUR, at some point in its history. Graeme Bartlett (talk) 04:00, 28 September 2009 (UTC)
 * I don't quite understand how this will help. A large amount (in my experience anyway) of the images tagged for deletion under F5, have no FUR and have never been used in an article. Your check would mean these images would stay undeleted. -- Chris 04:10, 28 September 2009 (UTC)
 * Isn't that more a matter for whoever tags the image as orphaned in the first place? Anomie⚔ 04:14, 28 September 2009 (UTC)


 * Well if there is no FUR you wont be able to tell what the article is, so this test wont apply. So I suppose no UFR for 7 days is another reason to delete. Graeme Bartlett (talk) 05:15, 28 September 2009 (UTC)
 * I think when dealing with fair use images it's okay to err on the side of deletion, particularly as these are orphaned images. I assume whoever uploaded them, would be watch-listing them, but, again, the images of concern are fair use images not in an article. I don't think extra care is needed. That's my opinion. --69.225.5.4 (talk) 06:18, 28 September 2009 (UTC)

You say that the bot would check to see if the image was in use- I assume this means in use in the article space? A non-free image in use in the user, portal, talk or whatever space still counts as orphaned, for the purposes of CSD. J Milburn (talk) 11:10, 28 September 2009 (UTC)
 * I've already answered this question above - "Non-free images are only supposed to be used in the main space, however since people like to make drafts etc in their userspace and then move it to the article space the bot will ignore namespace (i.e. if the image is being used on someone's userpage the bot won't delete it)." -- Chris 11:29, 28 September 2009 (UTC)
 * I agree that that's the best way to handle it - fair use images outside of article-space are disallowed, but in many cases there are good-faith reasons such as drafts that make it less clear-cut. Those kind of issues are best handled manually and hence I think it's wise to keep them out of the purview of this bot. In all other respects I'm also quite happy with this bot - I have no disagreements in principle with it. ~ mazca  talk 12:03, 28 September 2009 (UTC)
 * Yes, this makes sense, have the bot do only the clear cut task. --69.225.5.4 (talk) 05:39, 29 September 2009 (UTC)

Be careful of bug 7304. If "foo" redirects to "bar", but after the redirect text, "foo" contains a link to "Bas", a whattlinkshere on "Bas" will say that "Foo" redirects to "Bas", which is not true. For a real life example, see the text of my sandbox, a list of supposed redirects to the "chocolate" page, and. The bot should therefore ignore image redirects outside of the file namespace, to avoid false positives. In the "long cane" example in my sandbox, I put a colon in front of the image to make it a link, as happens on user talk pages and occasionally in articles. If an image is actually used in a redirect page, it will come up as a redirect and an image link at the same time like ; these cases should also be ignored by the bot. Hope this makes some sense; I'm just worried about the bot trying to follow false redirects. Graham 87 13:30, 28 September 2009 (UTC)

Sounds good to me, the daily F5 categories can contain upwards to 2000+ images if an orphan tagging bot make a run after being inactive for a period (granted not very often, usualy no more than ~100). Clearing them out manually is mind numbing drone work, and once the backlog has been there for a couple of days someone will inevitably execute a batch delete on the offending category with Twinkle or something anyway resulting in the de-facto automated deletion of all the remaining images without any checking. Much better to have a bot that actually do some sanity checking, erring on the side of skipping do the bulk of the work and then have humans clear out the few leftovers one by one. --Sherool (talk) 21:19, 28 September 2009 (UTC)

Without (wishing to) creating more complexity is it worth treating any edits to the image page as re-starting the (a?) clock. In other words maybe someone is trying to fix up the license, leave some kind of human related message or what-not. Rich Farmbrough, 21:49, 28 September 2009 (UTC).

I oppose this at present. I am sympathetic to the backlogs problem, but historically (Betacommand bots et al.) handling fair use images through bots like these has not been popular. While it does check for a fair use licensing tag, not all image pages use them. Some users prefer writing bulletpoints or prose 'by hand'.

One difference between human-intervention and bots handling orphaned fairuse images is the increased overhead involved for users wishing to have an image restored. With the former, a user wishing to have an image restored in appropriate cases—perhaps after it was orphaned when all it needed was a re-written rationale, merely needs to ask the deleting admin. In a case where someone removes from an article, and where the need for its removal is debatable, the erasure of the image from 'ordinary public view' is, as well as being swifter, so much more "silent" with a bot. –Whitehorse1 01:52, 1 October 2009 (UTC)
 * All images are required by policy to have a valid license tag and can be deleted under WP:CSD if they are missing a tag. I also don't see your point about increased overhead, if someone wants an image that the bot deleted restored all they have to do is ask me on my talkpage. Also the bot doesn't remove any images from articles. -- Chris 02:10, 1 October 2009 (UTC)
 * Oh, you mean a copyright tag rather than a fair-use rationale. *nod* We're on the same page now. The terminology chosen above was a bit ambiguous. I was more talking about newer users with the extra steps thing. I realize it doesn't remove images, though sometimes that's more a technical distinction: $someuser removes or even comments out image syntax in an article, orphaning it, perhaps in midst of other changes to the article; untouched by human hands during the 7 days that follow, that image is deleted. –Whitehorse1 02:32, 1 October 2009 (UTC)

Code
Source code is here -- Chris 01:17, 3 October 2009 (UTC)
 * Comments:
 * getCategoryMembers will loop forever if there is a loop in the category hierarchy. While that certainly shouldn't happen for the image deletion cats, it's worth noting; at the least, I might throw in a "max depth" option that dies if the bot ends up digging too deeply.
 * Not sure whether preg would be more efficient with  than.
 * Isn't  equivalent to  ? Or did you mean  ? In either case,   would be a bit more robust.
 * If no human keeps up with the "Human review" page (or the OIFD cat), the bot will happily add the same image to the list each day until it gets too huge to edit. True, that's unlikely, but maybe worth considering. Also, consider if it's possible to edit that page once per run instead of for every image found.
 * The "check to see that the image is indeed orphaned" check is b0rken, you want to use $page inside the loop instead of $image.
 * Your logic for "check to see if the image has been tagged for the full seven days" is faulty. As written, it will do "delete if any revision from creation of the page until 7 days ago (or only as many of the first 500 will fit in the api query result size limit) contains the deletion template". That would be easy to game for incorrect deletions: Tag the image and revert yourself, then 8 days later (just before the bot is to run) tag it again. The logic should probably be "delete if all revisions within the past 7 days contain the deletion template" instead; don't forget to explicitly check the first revision in case for some reason it is older than 7 days. That would be easy to game to prevent automated deletion, but a human should easily catch it then (and at any rate that's a safe failure mode).
 * Does $wiki->query throw an exception (or just die) if there is any sort of API error? If not, the "check if orphaned" (and most likely the fixed "check to see if the image has been tagged for the full seven days") are dangerous.
 * Is it worth also deleting the redirects (if any) to the newly-deleted image?
 * HTH. Anomie⚔ 19:24, 5 October 2009 (UTC)

Thanks, I've updated the source. -- Chris 04:40, 9 October 2009 (UTC)
 * getCategoryMembers now checks for possible loops
 * I did some (admittedly very quick/basic) testing, ( |_) is in fact faster - although we are talking microseconds here
 * Fixed
 * The bot now checks if its adding duplicates, it also checks to see if images listed on the page have been deleted. It only edits the page once per run now
 * Fixed
 * Rewritten
 * I've added some more error checking
 * It now deletes any redirects as well
 * A few comments yet:
 * Loop avoidance looks good. If you really want to, you could do repeated subtree avoidance also (i.e. if Category A contains B and C, B contains 1000 subcats, and C contains B, you'd process all of B's 1000 subcats twice); it should be as simple as making $avoidloop a reference parameter. Again, there's no reason that should ever happen for this bot's task, but if that function is reused in a different task it might bite then.
 * The error checking on the imageusage check is insufficient: if the API returns an error of some sort or an HTTP error occurs, the test will decide (possibly erroneously) that the image is unused. The test on the old revisions is sufficient, since if an error occurs it will skip the image (and presumably retry it the next time around).
 * In the interest of efficiency, it might be better to add "&rvend=" for seven days ago to the revisions query (and check the current revision if zero are returned); that will avoid loading 500 old revisions when only the first 3 are relevant.
 * If you do the above, actually following any rvcontinue would probably be a good idea too, Just In Case™. Better to leave no holes than an extremely unlikely one.
 * It may be worth keeping a persistent store of the images for human review, and clearing it out only if the page edit succeeds. You'd also want to consider error checking on the getpage call, if that returns an error it seems it would wipe out the existing list.
 * Anomie⚔ 12:42, 9 October 2009 (UTC)

re point 2, shouldn't  catch the error? If there was a http error there would be no array set and thus the if block would execute. -- Chris 13:50, 9 October 2009 (UTC)
 * You're right, somehow I was backwards on that one. Anomie⚔ 17:54, 9 October 2009 (UTC)
 * In theory (?: |_) should be faster still since it avoids an unnecessary variable creation and assign. But since I had this computer I have never had any compute bound problems that aren't stuff like finding big primes. Which makes me wonder how they manage to make Windows compute bound.... Rich Farmbrough, 01:40, 13 October 2009 (UTC).
 * But if yo are matching Wikimedia page names you should use a + too since Fred_ ___ __ __ __ __ _Smith is valid. Rich Farmbrough, 01:43, 13 October 2009 (UTC).

Ok, I've updated the loop avoidance. It now uses both &rvend= and rvcontinue in the api query. The bot also keeps a local copy of the human review page to prevent problems if the Wikipedia page is vandalized or the api returns a blank page. And the regex is now. -- Chris 07:01, 18 October 2009 (UTC)

BAGAssistanceNeeded -- Chris 11:21, 25 October 2009 (UTC)
 * Having done some of these deletions myself, I always thought a bot should be doing it, so if I say, I'm sure you can work out the details of how to implement it. - Jarry1250 [Humorous? Discuss.] 11:51, 8 November 2009 (UTC)
 * done -- Chris 09:29, 21 November 2009 (UTC)

BAGAssistanceNeeded Needs an admin BAGger to check the deletions. Anomie⚔ 03:12, 2 December 2009 (UTC)
 * I had a look at the first 6, there was one where a logo had, I think, been supreseded (K-dog) where I was slightly concerned that sourcing the old logo might not be easy. Then there was a Centrino logo  - there had been some edit wars over a logo gallery on the article. So the evidence is that the bot its doing what it says on the tin, but you may wish to have a FUR admin examine them to see if there's anything that should give us pause here. As far as I checked the 6 were listed for 7 days, orphaned at the time, and had had the creator notified. Rich Farmbrough, 07:25, 23 December 2009 (UTC).


 * Comment Looks good so far, I support this adminbot getting the flags. --Coffee //  have a cup  //  ark  // 10:35, 25 December 2009 (UTC)

VPT discussion
This thread at VPT might be worth considering before approving this bot further. Unless the bot has a very sophisticated error checking, it would allow vandals a whole new way to cause disruption on a large scale by removing images from articles. Unless the bot can somehow address this, I don't think it's a wise idea to let the bot run. Regards  So Why  18:01, 13 November 2009 (UTC)


 * Personally I'm not too worried. Vandals have a myriad ways to be annoying if they want. Ultimately, I feel that the bot should do all that is feasible, and in line with what a human administrator would. - Jarry1250 [Humorous? Discuss.] 19:00, 13 November 2009 (UTC)
 * This is not just for vandals. The removal of the file may not have been justified for a variety of reasons, or a mistake (example at VPR). I'm also not sure from the above discussion that it's going to handle file moves well. Cenarium (forever) 23:25, 13 November 2009 (UTC)
 * I'm personally not too concerned about images where no one noticed they were removed from their article for an entire week. Especially since it's easy to ask any admin to undelete. Moves should be handled appropriately, as the bot queries all redirects to the image in question and checks imageusage for every one in addition to the original filename itself. Anomie⚔ 00:00, 14 November 2009 (UTC)

Approved
Bot appears to have consensus, per Rich, Coffee, and my own review the trial appears to be successful. Therefore, I am approving.  MBisanz  talk 02:53, 26 December 2009 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.