Wikipedia:Bots/Requests for approval/Ramaksoud2000Bot 2


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved

Ramaksoud2000Bot 2
Operator:

Time filed: 05:28, Friday, December 30, 2016 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Java

Source code available: User:Ramaksoud2000Bot/ShadowsCommons source

Function overview: Tags Wikipedia files that shadow a Commons file or redirect with ShadowsCommons

Links to relevant discussions (where appropriate): Bots/Requests for approval/Stefan2bot, WP:FNC, WP:G6

Edit period(s): Manually started. Occasionally run.

Estimated number of pages affected: ~1000-1500 on first run extrapolating from how many there were for the letters A and B of free media at User:Ramaksoud2000Bot/ShadowsCommons. 225 files on first run. Unknown but small number on subsequent runs.

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): No

Function details: Bots/Requests for approval/Stefan2bot was a one-time run to tag files with ShadowsCommons if they shadowed a Commmons file. That bot worked off an old database report. This bot goes through every file on Wikipedia and determines if it is eligible for ShadowsCommons. Specifically, it tags every file which has a file or redirect on Commons with the same name, does not have ShadowsCommons or keeplocal, does not have any duplicates, and does not exclude this bot. This bot uses a list of files that shadow Commons files from query/15152. That query ignores files up for deletion, files with keep local high-risk, and protected files. Tagging the files populates Category:Wikipedia files that shadow a file on Wikimedia Commons. There, file movers and others go through the category and usually rename the files in accordance with WP:FNC. They then tag the Wikipedia page with db-redircom in accordance with WP:G6. Ramaksoud2000 (Talk to me) 05:28, 30 December 2016 (UTC)

Discussion

 * Are you working off of database dumps as well? — xaosflux  Talk 06:06, 30 December 2016 (UTC)
 * No. I could not find one. I am checking every file in Category:All free media and every file in Category:All non-free media anew to see if it currently shadows a file on Commons. I'd also like to amend my request to exclude files that are only shadowing a Commons redirect. I don't see much point in adding ShadowsCommons there, even though it's allowed by policy, because the Commons files are still usable. There also may not be consensus for renaming files that only shadow a Commons redirect. The wording exists at WP:FNC, but the talk page discussion that led to the addition of FNC#9 never mentioned redirects. That should reduce the pages affected by about 75%. Ramaksoud2000 (Talk to me) 06:47, 30 December 2016 (UTC)
 * Why not use the replica databases on Tool Labs? Otherwise, database dumps for enwiki and commons are both available. Anomie⚔ 18:18, 30 December 2016 (UTC)
 * That is a much more better idea than my original plan. to use the enwiki and commons title dumps, and will only send a read request when identical titles exist in the file namespace on enwiki and commons. Thanks! Ramaksoud2000 (Talk to me)  22:06, 30 December 2016 (UTC)
 * In short, there will be a read request sent to Commons for each file on the English Wikipedia. Ramaksoud2000 (Talk to me) 07:09, 30 December 2016 (UTC)
 * That's a lot of read requests. Per the database dump page, it's heavily preferred to use a database dump over what's functionally a web crawler when possible, and that definitely should be possible here with appropriate programming. If we do go the read request route, it should be heavily throttled for a task like this, with a ~5 second hold between read requests to both enwiki and Commons (i.e. handle one file every 5 seconds). We shouldn't just have a hold after edits, in other words. Further, you may need bot approval on Commons for this many reads. ~ Rob 13 Talk 11:13, 30 December 2016 (UTC)
 * Agree, making ~1.8 MILLION reads for EACH run is insane excessive - what read rate were you planning on running these at (reads/min)? —  xaosflux  Talk 22:05, 30 December 2016 (UTC)
 * Honestly, I only saw a restriction on edit rate in the bot policy, and thought that read rates didn't have a restriction. Obviously, I was mistaken. I thought that if needed, a small delay could be implemented, but it doesn't matter now. Ramaksoud2000 (Talk to me) 10:39, 1 January 2017 (UTC)
 * Just noting I have no objection to appropriately-throttled read requests only on the pages identified as likely shadowed based on the most recent database dump. When you code, be sure to account for the likely edge case of files that have been deleted on enwiki as F8 since the last database dump. ~ Rob 13 Talk 23:58, 1 January 2017 (UTC)
 * That's a good point. I'll run a new query before each run of the bot, so it's unlikely that a file would be deleted in that short time period, but I've updated the code to check. There should only be about 225 read requests on the first run. There will also be at least 10 seconds between read requests, because there are 10 seconds between edit requests, and the program can't send a read request until it edits the previous page. Ramaksoud2000 (Talk to me) 00:14, 2 January 2017 (UTC)
 * A bunch of thoughts:
 * Keep local tagged files should not be skipped. The tag merely says that someone wants to keep the local file, it does not by default entail that the local file should be under that title or bury the Commons file. ShadowsCommons has a  parameter that would be worth applying under such circumstances, though.
 * Keep local high-risk however would be worth skipping, unless we want to attach a  parameter that the bot sets to "yes" if there is shadowing going on (instead of tagging with ShadowsCommons).
 * Probably only worth doing if the abovementioned concept is implemented (most files are protected for high use), the bot may want to request edits on the talk page if the file is protected.
 * As for the read requests issue, I remember there being a Quarry query that can find "shadowed" images and something that User:Topbanana/Eclipsed Files is created from. Perhaps know whether it could be used for this bot.
 * Jo-Jo Eumerus (talk, contributions) 10:38, 31 December 2016 (UTC)
 * Jo-Jo Eumerus, thanks. The Quarry query skips protected files and those with keep local high-risk. The reason I don't want to make edit requests on talk pages is that I think there would be too many false positives. There are too many images like File:Information.svg that are protected but without the appropriate template. Also, since ShadowsCommons with the keeplocal paramater just transcludes keeplocal, I think we can leave keeplocal templates already on files in place, and just add ShadowsCommons on top. Thanks! Ramaksoud2000 (Talk to me) 10:39, 1 January 2017 (UTC)
 * query/950. -- Edgars2007  (talk/contribs) 09:31, 1 January 2017 (UTC)
 * Thank you so much ! That's much easier than what I was trying to do. I have run a new query, and 225 files will be affected on the first run. I have updated the (now much smaller) source code, and the bot is ready to run. Thanks! Ramaksoud2000 (Talk to me) 10:39, 1 January 2017 (UTC)
 * Technically, the protected files should be tagged with Keep local high-risk or unprotected, not merely ignored. Jo-Jo Eumerus (talk, contributions) 11:01, 1 January 2017 (UTC)
 * That is true. However, it's rare for files that aren't high-risk to be protected, so I don't think the bot will miss many (if any) files that need ShadowsCommons. Maybe another bot task at another time could be making those edit requests. Cheers, Ramaksoud2000 (Talk to me) 11:09, 1 January 2017 (UTC)


 * — xaosflux  Talk 01:20, 7 January 2017 (UTC)
 * Thank you. See contribs. It worked as intended. It only tagged files that are different from the Commons file and thus don't allow the use of the Commons file. It also didn't tag any files that it isn't supposed to tag. Note that files like these can be tagged with Now Commons upon review, but are otherwise not detectable duplicates:

Ramaksoud2000 (Talk to me) 03:09, 7 January 2017 (UTC)
 * File:Jean-Beguin-1615.jpg looks identical to the one on Commons, but is actually slightly wider, and not a duplicate.
 * File:The British Empire Anachronous.png is 169KB smaller.
 * File:Eurostar.svg is 1KB larger than the one on Commons, and Mediawiki does not consider it a duplicate.
 * File:EnterpriseDB corporate logo.png is a small fair use file, but is considered PD and is on Commons in large size.
 * File:Jack Kilby.jpg I'm not quite sure why Mediawiki does not consider this a duplicate of commons:File:Jack Kilby.jpg, but it can be tagged with Now Commons if license is verified (unlikely).
 * Are the trial tagged files still needed? I've processed a few. Jo-Jo Eumerus (talk, contributions) 11:19, 16 January 2017 (UTC)
 * D Ramaksoud2000, as you have gotten rid of the huge read requirement, I don't think this will need a "bot flag" - if it is flagged, not sure if these edits should actually be tagged with it - thoughts? — xaosflux  Talk 03:44, 21 January 2017 (UTC)
 * I'm unfamiliar with the custom regarding the bot flag, but if you think it doesn't need it, then that's fine. It reads once per edit, so the bot flag can't be of use there due to the edit throttle. I thought that all bot edits were flagged, but if only minor edits are flagged, then these edits probably shouldn't be. Users watching the file probably know the best name to rename a file to, and some users may be hiding bot edits. Ramaksoud2000 (Talk to me) 04:16, 21 January 2017 (UTC)
 * OK, it is slow and all of its edits seem useful for watchlists, so we will approve without the flag at this time. — xaosflux  Talk 04:47, 21 January 2017 (UTC)
 * Flag not required. — xaosflux  Talk 04:47, 21 January 2017 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.