User talk:The Earwig/Archive 15

Hitting the quota on Google searches
Hey Earwig, I noticed we've hit the 10,000 query daily quota on 4 days in the past month and most other days are coming close to the limit now. Usage last year was about half as much. I'm worried people are going to start getting errors regularly if the usage continues to increase. Any thoughts on how to keep the usage from increasing any more? Kaldari (talk) 00:51, 6 March 2019 (UTC)
 * No chance we can get Google to raise the limit, right? Scanning through recent logs, of the past 12,000 invocations of the tool that used Google, 5000 were automated and 7000 were manual; of both types, 7500 were on enwiki, 3000 on dewiki, low hundreds on a few other wikis; of just automated requests, 2100 enwiki and 2900 dewiki. Basically, what I'm seeing is that there isn't one clear offender that we can point to regarding a 2x spike. Luke081515's bot, assuming that's the entirety of dewiki automated traffic, seems to account for about 25% of search engine usage, which is high but I'm not necessarily sure I would consider it excessive (<15 reqs/hr)? Luke081515, could we reasonably tone down your bot slightly? I would hate to cause dewiki to miss things because of this, but I'm not sure what else to suggest or who else to ask. — Earwig   talk  03:26, 6 March 2019 (UTC)
 * is there anyway you could setup a different instance of this for non -en projects? That would clear up a little room for en and also give the other projects a lot more opportunity for use before hitting the cap. Best, Barkeep49 (talk) 03:37, 6 March 2019 (UTC)
 * Unfortunately 10,000 is a hard-coded limit set by Google. We could set up an additional API key to get around that limit, as suggested by User:Barkeep49, but it would require some extra work as we would also need to modify the proxy service to handle both keys and distribute the requests. Additionally, our credit arrangement with Google, which allows us to provide the service without paying the usual fees, is based on the usage rate from a few months ago with some padding added. So using more than 10,000 queries per day would probably eat through those credits faster than they are replenished. In the long-term we'll probably have to come up with a better solution, but 10,000 per day is all we can manage at the moment. Kaldari (talk) 03:46, 6 March 2019 (UTC)
 * This seems like one of those "appeal to Jimbo" things....I've got a good feeling Google queries us more than 10,000x/day.... — xaosflux  Talk 04:47, 6 March 2019 (UTC)
 * I can modify my requests in the evening, so that the requests don't use the search engine every time.... Another possibility: I asked WMDE some time ago if they can support the extension of the quota this with money, if that's needed for an extension of the limits, basically they did not decline it, but that would first require first, that we know how much the additional costs would be. Is there already a known value or a guess? Best regards, Luke081515 06:58, 6 March 2019 (UTC)
 * Last month we used about $1200 worth (and that was with the Tool Labs outages). Hopefully we can just stay under the quota for a while. Let me know when the change is made and I'll try to monitor it to see how much it makes a difference. Kaldari (talk) 07:14, 6 March 2019 (UTC)
 * I've set  now in my programm, the change is active now, so the requests should not decrease, but the requests will not use the search engine. The bot currently scans every new article in dewiki once. My plan in the future was to see if we can extend the limit, so that I can also scan big additions in existing articles and check them. If you can provide an estimate (the 1200$?) how much that would normally cost per month, I can ask WMDE if they would be willing to support this, or maybe a combination of WMF and WMDE? Best regards, Luke081515 09:16, 6 March 2019 (UTC)
 * Strangely there was a significant increase in queries (rather than decrease) at around 9am UTC (when you made the change). The increase remained until about 2pm when we hit the 10,000 quota and then all queries were denied after that. Maybe Earwig could look at the logs to see what happened. Kaldari (talk) 23:18, 6 March 2019 (UTC)
 * Hm, I guess that it is not my script then, since I took a look just now, the change I've made was correct. Maybe the logs can give us some useful information? My bot uses a unique useragent, it should be easy to find out. Note: I did not throttle my requests, I'm doing the same number of requests as before, but now requesting the tool to don't use the engine. Best regards, Luke081515 23:26, 6 March 2019 (UTC)

I don't log user agents, but I only see about 300 requests that could have come from you, and they are indeed not using the engine, so that's good. Since 09:00 UTC, I see about 1650 requests using the engine, 1550 of those not using the API, so this is not an issue of automated queries. Of those, 1300 are enwiki. However, a large number of these around 10am appear to be vulnerability testing spam (things like page = etc/password) that all fail early and never hit Google. Of the remaining traffic, which seems to number about 1150 real requests, the majority looks like perfectly normal manual traffic. I don't really have much to go off of here. — Earwig   talk  03:58, 7 March 2019 (UTC)
 * Hm, according to Earwig it looks like an extension of the limit would be needed. Do we have a phabricator task for this already? And, if additional money is required, it is possible that we get a rough estimate? Then for example I could ask WMDE if they would be willing to help. Best regards, Luke081515 22:07, 7 March 2019 (UTC)
 * Well, whatever happened yesterday, it's gone back to normal today. Here's the pricing info for the API: "Custom Search JSON API provides 100 search queries per day for free. If you need more, you may sign up for billing in the API Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day." Normally, I would suggest that we just re-negotiate with Google, but the current agreement we have took over a year to negotiate and finalize. Maybe getting WMDE to cover a new API key would be easier. Kaldari (talk) 02:02, 8 March 2019 (UTC)
 * I wrote them a mail, and asked them if that would be possible. Best regards, Luke081515 23:10, 10 March 2019 (UTC)
 * OK, keep us posted. — Earwig [alt]   talk  23:49, 10 March 2019 (UTC)
 * Note that our agreement with Google expires on Jan 10, 2022, so that would be a good time to renegotiate it. Kaldari (talk) 16:36, 12 March 2019 (UTC)
 * Since I did not receive an answer yet, I will mail them again and ask them for the status when I'm at home. Best regards, Luke081515 11:34, 18 March 2019 (UTC)

They currently have an internal discussion about this. Best regards, Luke081515 23:05, 19 March 2019 (UTC)
 * Update: It's still in progress there :/. Best regards, Luke081515 16:16, 7 May 2019 (UTC)
 * Btw: Can you take a look how high the usage currently is? If there is enough free capactity at the moment, I can for example add a random switch, that the bot processes 50% of the request with the search API. Best regards, Luke081515 16:18, 7 May 2019 (UTC)
 * Kaldari has easier access to this data than I, but I can take a look tomorrow if necessary and see how we've been doing over the past couple weeks on average. — Earwig   talk  03:32, 8 May 2019 (UTC)
 * We're currently using about half the quota per day. It normally fluctuates between about 4000 and 6500 queries per day. However, we hit the 10,000 maximum twice last month. Kaldari (talk) 17:06, 8 May 2019 (UTC)
 * This is good to hear. It sounds like we've returned to mostly-reasonable levels. — Earwig   talk  05:32, 9 May 2019 (UTC)
 * I will try to make ~40% of my requests with search engine. This should be within the limit. If it's too high anyway, please ping me. Best regards, Luke081515 18:53, 10 May 2019 (UTC)

, has dewiki considered toolforge:copypatrol? —&thinsp;JJMC89&thinsp; (T·C) 01:51, 8 May 2019 (UTC)
 * Hm, it's not active there, what's the criteria for getting it active? Best regards, Luke081515 18:53, 10 May 2019 (UTC)


 * I'm not sure, but should know. —&thinsp;JJMC89&thinsp; (T·C) 03:11, 11 May 2019 (UTC)
 * You have to get community consensus and then create a Phabricator task (for example, https://phabricator.wikimedia.org/T151609). Kaldari (talk) 05:03, 11 May 2019 (UTC)

May 22: WikiWednesday Salon and Skill-Share NYC
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

The Signpost: 31 May 2019
 * Read this Signpost in full * Single-page * Unsubscribe * MediaWiki message delivery (talk) 02:12, 31 May 2019 (UTC)

#invoke:AfC on Template:AFC statistics
For the past few weeks, Template:AFC statistics, which EarwigBot task 2 updates hourly, has rendered with many repetitions of "#invoke:AfC". I assume this is for the same reason that the page is in Category:Pages where template include size is exceeded, and that the root problem is the volume of drafts.

Is there a way of simplifying or restructuring the page so that it can handle the increasing number of drafts, perhaps by splitting pending, declined, and accepted drafts into three separate pages? --Worldbruce (talk) 15:51, 31 May 2019 (UTC)


 * Hi Worldbruce. You're correct that this is caused by too many drafts on the page. I don't necessarily have an easy solution for it—the usual answer is just to review more drafts to reduce the backlog, but obviously this isn't a quick fix in general. The problem with splitting up is that the pending chart is still the largest by far, so it wouldn't necessarily help very much. Some other ideas I've had are only showing a random subset of drafts at a time and having the full list visible elsewhere (like through a Toolforge page), but I haven't implemented this. —  Earwig   talk 04:24, 4 June 2019 (UTC)

June 19: WikiWednesday Salon and Skill-Share NYC (stay tuned for Pride on weekend!)
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

Copyvio Detector Not Ignoring Fork
Hi! Please add Infogalactic.com to the exclusion list. It's already listed as a fork on WP:Mirrors and forks/GHI, but the detector isn't excluding it: []. Thank you!  Orville1974  (talk) 22:46, 19 June 2019 (UTC)
 * Thanks for pointing that out, Orville1974. It turns out the tool was completely ignoring mirrors on that page. It should be fixed now. —  Earwig   talk 03:59, 21 June 2019 (UTC)

Sunday June 23: Wiki Loves Pride @ Metropolitan Museum of Art
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

The June 2019 Signpost is out!
 * Read this Signpost in full * Single-page * Unsubscribe * MediaWiki message delivery (talk) 15:53, 30 June 2019 (UTC)

ReportsBot issue
Sorry to bother you again with a Reports bot issue, The Earwig; iirc you were kind enough to fix its last glitch a few months back. I'm not seeing it add any articles to WikiProject Women in Red/Metrics/June 2019 for the last three days - history. In that period I've added 100+ properly coded women biog items to wikidata which have thus been roundly ignored in the stats. Any help you can give much appreciated. thx --Tagishsimon (talk) 14:46, 29 June 2019 (UTC)
 * Hi Tagishsimon. This is a bit strange. The bot runs on Wikimedia Toolforge, and it looks like Toolforge has been temporarily banned from querying Wikidata due to excessive usage. My bot only accesses the service once a day, regardless of any errors, so this is presumably the fault of another user. The ban is set to expire tomorrow, so we'll see if it works then. — Earwig   talk  17:28, 29 June 2019 (UTC)
 * Oh dear :). Thanks for looking into it; much appreciated. I'll ping you one way or the other tomorrow (though the ban may be lifted after the ~14.00 hours point at which Reports bot runs, so we might have to wait to Monday to find out. --Tagishsimon (talk) 17:44, 29 June 2019 (UTC)
 * Nothing doing today (diff). We'll see what tomorrow brings. --Tagishsimon (talk) 17:28, 30 June 2019 (UTC)
 * Monday: still kaput. (diff) --Tagishsimon (talk) 13:22, 1 July 2019 (UTC)
 * Tagishsimon, yep, just checked and the ban has apparently been extended further. Clearly whatever caused the ban in the first place hasn't stopped. I'll need to ask around. — Earwig   talk  01:48, 2 July 2019 (UTC)
 * Tagishsimon, good news, I think I've managed to fix it. When accessing Wikidata, we were sending a default user agent—identifying ourselves generically as Python. I was able to avoid the ban by setting a custom one for Reports bot. I'm guessing that Wikidata bans by user agent, and we happened to get caught up in someone else's bad behavior... — Earwig   talk  02:30, 2 July 2019 (UTC)
 * That is good news, The Earwig; excellent sleuthing & thanks for taking the time - we're hugely in your debt. Fingers crossed for tomorrow's run. --Tagishsimon (talk) 02:52, 2 July 2019 (UTC)
 * This appears to be resolved, but some pointers on the UA issue: T224891 was mentioned in Tech News (2019, week 24). A python-request user agent ban was mentioned on the Wikidata list. —&thinsp;JJMC89&thinsp; (T·C) 05:55, 2 July 2019 (UTC)
 * Thanks, that absolutely explains it... —  Earwig   talk 11:45, 2 July 2019 (UTC)


 * And with that (diff) we can all go back to forgetting about Report Bot's operations until next time; success. Thank you once again, The Earwig. I'll hope not to darken your doors again. --Tagishsimon (talk) 16:07, 2 July 2019 (UTC)

Completely new issue, The Earwig. Feel free to decline. Emijrpbot's owner has retired after getting blocked for a 3RR, and their bot no longer operates. Amongst the ridiculously many things the bot did, there is Requests for permissions/Bot/Emijrpbot 6 which points to code which adds new wikidata items for new en.wiki biographies and/or codes items with human and/or gender. Historically it's done a lot of the heavy lifting in this area, and Women in Red, especially, miss it. Don't suppose there's any possibility of you adopting it, or suggesting some other adoptive user? --Tagishsimon (talk) 10:24, 2 July 2019 (UTC)
 * That is unfortunate to hear, but I'm afraid I don't have much time to adopt a new bot these days, especially on a wiki I'm not very familiar with. The usual people I would suggest probably do not have much free time either. —  Earwig   talk 11:45, 2 July 2019 (UTC)
 * Not surprised. But shy bairns get nowt, as they say :) --Tagishsimon (talk) 11:58, 2 July 2019 (UTC)

Sunday July 14: Annual NYC Wiki-Picnic @ Roosevelt Island
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

AfC statistics
Hi! For the time being (while we're way past the template expand limit) on AFC statistics, could each of the tables be put on a page by themselves, so we could at least see some of them? Thanks! Enterprisey (talk!) 08:24, 27 July 2019 (UTC)
 * Oh man, it's pretty bad, isn't it... okay, Enterprisey, I split it up. We now have five subpages. The main Template:AFC statistics page will be mostly useless until the backlog goes down. —  Earwig   talk 06:44, 28 July 2019 (UTC)
 * Nice, thank you! Enterprisey (talk!) 07:18, 28 July 2019 (UTC)
 * Anyway so that those notices use noping? &#32; Headbomb {t · c · p · b} 07:45, 28 July 2019 (UTC)
 * My apologies, Headbomb, I suppose this is an issue now because the pages are smaller. I've changed the template to hopefully avoid this now. —  Earwig   talk 03:23, 29 July 2019 (UTC)

The Signpost: 31 July 2019
 * Read this Signpost in full * Single-page * Unsubscribe * MediaWiki message delivery (talk) 16:18, 31 July 2019 (UTC)

Barnstar of Awesomeness

 * Thank you! —  Earwig   talk 18:39, 17 August 2019 (UTC)

August 28: WikiWednesday Salon and Skill-Share NYC (+editathons before and after)
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

The Signpost: 30 August 2019
 * Read this Signpost in full * Single-page * Unsubscribe * MediaWiki message delivery (talk) 23:42, 30 August 2019 (UTC)

Copyvio down?
Traceback (most recent call last): File "/data/project/copyvios/www/python/src/app.py", line 38, in inner return func(*args, **kwargs) File "/data/project/copyvios/www/python/src/app.py", line 103, in index query = do_check File "./copyvios/checker.py", line 41, in do_check _get_results(query, follow=not _coerce_bool(query.noredirect)) File "./copyvios/checker.py", line 52, in _get_results page.get # Make sure that the page exists before we check it! File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/page.py", line 587, in get rvprop="content|timestamp", rvslots="main") File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/site.py", line 716, in api_query    return self._api_query(kwargs)  File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/site.py", line 254, in _api_query    return self._handle_api_result(response, params, tries, wait, ae_retry)  File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/site.py", line 295, in _handle_api_result    raise exceptions.APIError(e) APIError: API query failed: JSON could not be decoded.

-- &#x222F; WBG converse 09:31, 14 August 2019 (UTC)


 * I'm still investigating this. It seems like something genuinely wrong with the API, but I'm not sure what's causing it yet. —  Earwig   talk 05:15, 15 August 2019 (UTC)
 * Hi, I also get this error very often in the last months. I hope you can investigate this. Regards Doc Taxon (talk) 15:17, 16 August 2019 (UTC)
 * It should be fixed now. —  Earwig   talk 01:45, 17 August 2019 (UTC)
 * no problems any more since August 18th. Thank you Doc Taxon (talk) 10:01, 3 September 2019 (UTC)

Error in Copyvio Detector?
Hi! The URL https://tools.wmflabs.org/copyvios/?lang=de&project=wikipedia&title=&oldid=191835348&use_engine=0&use_links=0&turnitin=0&action=compare&url=fr.facebook.com%2FEsWirdBesserFilm%2Fabout%2F finds a copyvio of 76,7%, but this URL not: https://tools.wmflabs.org/copyvios/?lang=de&project=wikipedia&title=&oldid=191835348&action=search&use_engine=1&use_links=1&turnitin=1

What's wrong? Doc Taxon (talk) 09:59, 3 September 2019 (UTC)
 * Hi Doc Taxon. Sometimes a website won't respond to the tool quickly enough or will generate some error instead of a valid webpage. This might cause us to report it as a 0% match. I tried again, bypassing the cache, and this time it found the match. Unfortunately I don't have a great solution for this problem in general. However, I can work on having the tool indicate that a source failed to load, so you can know to try checking it again. —  Earwig   talk 03:36, 4 September 2019 (UTC)
 * oh, I forgot bypass cache. Sometimes the tool is running against the timeout. Can you raise the timeout for a little more time, please? Doc Taxon (talk) 07:31, 4 September 2019 (UTC)
 * The overall timeout is not really something I can control. —  Earwig   talk 11:21, 4 September 2019 (UTC)

Saturday Sept 7: Met Fashion Edit-a-thon @ Metropolitan Museum of Art
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

Copyvios Detector TimeOut
Hi! Every query runs in a 504 Gateway Time-out. What's going on? Doc Taxon (talk) 22:41, 10 September 2019 (UTC)
 * I don't see any errors in the logs and a test query I just tried worked fine. Do you have an example that isn't working? —  Earwig   talk 03:43, 11 September 2019 (UTC)
 * yes, it's working fine again. Possibly it was a server or network error for some hours. Doc Taxon (talk)

Archive.org sites timing out on earwig
Archive.org results take forever to load on earwig, and never do. Here's an example; it seems to do this everywhere. Any way this could be fixed? Thanks a ton for making the tool in the first place, 💵Money💵emoji💵💸 20:54, 15 September 2019 (UTC)
 * Hi Money emoji. I'm not seeing an issue with your link—the page generates in under a second. Is the problem still happening for you? How long ago did you notice it and how consistent did it seem to be? If it wasn't very long, maybe archive.org was doing some maintenance or had temporarily blocked the tool's IP (which is shared with many other tools). —  Earwig   talk 01:42, 18 September 2019 (UTC)
 * Yeah, the page now generates instantly for me. I noticed the problem about 2 days ago or so, and it affected every archive.org result. Since it wasn't very long, I'm guessing your hypothesis is correct. Thank you for taking you time to look into this, 💵Money💵emoji💵💸 02:01, 18 September 2019 (UTC)

The Signpost: 30 September 2019
 * Read this Signpost in full * Single-page * Unsubscribe * MediaWiki message delivery (talk) 11:07, 30 September 2019 (UTC)

Oct 23: WikiWednesday Salon NYC
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

The Signpost: 31 October 2019
 * Read this Signpost in full * Single-page * Unsubscribe * MediaWiki message delivery (talk) 16:12, 31 October 2019 (UTC)

User:The_Earwig/copyvios.js - user script
Hi Earwig, I have used Earwig's Copyvio Detector via web. I have just downloaded your user Earwig user script at Here and this is my common.js; however, I could not find where is the tool is placed in the top menu (other user scripts are in "More" drop-down list from the Menu). Kindly advise and thanks in advance.05:52, 1 November 2019 (UTC)
 * Hi CASSIOPEIA. The script should appear in the sidebar under "tools"—see here for an example. —  Earwig   talk 15:59, 2 November 2019 (UTC)


 * Thank you Earwig. Found it. I want to take this opportunity to thank you for creating the script and it is so useful where I check copyvio when I review new article. Thank you. CASSIOPEIA(talk) 00:41, 3 November 2019 (UTC)
 * Thanks, I appreciate that! —  Earwig   talk 04:11, 3 November 2019 (UTC)

Saturday Nov 16: Wikipedia Asian Month Edit-a-thon @ Metropolitan Museum of Art
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

Nov 20: WikiWednesday Salon NYC
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

Template:AfC suspected copyvio
Hey, is AfC suspected copyvio still a thing or can this be TfD? --Gonnym (talk) 11:13, 19 November 2019 (UTC)
 * Gonnym, I think it should be fine to delete this template. —  Earwig   talk 01:39, 20 November 2019 (UTC)

Nomination for deletion of Template:AfC suspected copyvio
Template:AfC suspected copyvio has been nominated for deletion. You are invited to comment on the discussion at the template's entry on the Templates for discussion page. Gonnym (talk) 12:13, 20 November 2019 (UTC)

The Signpost: 29 November 2019
 * Read this Signpost in full * Single-page * Unsubscribe * MediaWiki message delivery (talk) 22:24, 29 November 2019 (UTC)

Dec 18: WikiWednesday Salon NYC
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

"Fadd disambiguation" listed at Redirects for discussion
An editor has asked for a discussion to address the redirect Fadd disambiguation. Since you had some involvement with the Fadd disambiguation redirect, you might want to participate in the redirect discussion if you wish to do so. DannyS712 (talk) 01:06, 25 December 2019 (UTC)

The Signpost: 27 December 2019
 * Read this Signpost in full * Single-page * Unsubscribe * MediaWiki message delivery (talk) 12:38, 27 December 2019 (UTC)

Jan 22: WikiWednesday Salon NYC
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

Copyvios on tools.wmflabs
Hi, the tool isn't working, I'm getting a 502. What's the problem? --Hispano76 (talk) 19:24, 14 January 2020 (UTC)
 * Hey Hispano76. I'm not sure what the issue is, but I've restarted it and it seems to be working now. Thanks! —  Earwig   talk 04:13, 15 January 2020 (UTC)
 * It stopped working again about an hour ago. I can't get the page to load; it just sits there and spins. Thanks, — Diannaa (talk) 14:47, 19 January 2020 (UTC)
 * I'm seeing a lot of network errors in the logs, so maybe it's something on Wikimedia Cloud's side? I'm going to restart it, but I have to head outside in 10 minutes, so I can't do any further debugging until tonight. —  Earwig   talk 14:51, 19 January 2020 (UTC)
 * Thanks very much. — Diannaa (talk) 14:54, 19 January 2020 (UTC)
 * Hi Ben; the tool has stalled again. Could you give it a re-start? Thanks, — Diannaa (talk) 13:49, 21 January 2020 (UTC)

Saturday Jan 25: Met 'Understanding America' Edit-a-thon @ Metropolitan Museum of Art
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

Copyvio detector
Hey, I was wondering if it would be possible to configure your copyvio detector tool to also run using other search engines? I've been running up against the daily query limit with Google quite a bit lately, and was hoping that we could work around this by using say DuckDuckGo or another service as an alternative, at least when Google has had enough of us. signed,Rosguill talk 23:11, 23 January 2020 (UTC)
 * Hi Rosguill. That's not a bad idea; in the past, the tool has used other search engines, but it's never supported multiple at once. DuckDuckGo is appealing but I don't think we can use them due to how their API works. Bing would probably be the second-best option after Google, but the free tier is fairly limited—we could only check 100-200 articles a month, so I'd have to see if the WMF would want to help pay for that or make a deal with Microsoft. In the past I've used other engines like Yandex but they're usually not as good. Anyway, I'll add this suggestion to my backlog and see what I can do. —  Earwig   talk 01:50, 24 January 2020 (UTC)

The Signpost: 27 January 2020
 * Read this Signpost in full * Single-page * Unsubscribe * MediaWiki message delivery (talk) 02:10, 27 January 2020 (UTC)

Copyvio detector idea -- rapid grant
Hi. As you know, we've been hitting our cap for Google on the copyvio detector. One thing that you might wish to consider is applying for a rapid grant from the WMF (and then perhaps a longer-term one later). If you apply between the 1st and the 15th of February, you can get a grant of $500-$2000 by the 15th of March. To make that money go further, using Bing would cost $3/1k searches vs Google's 5/1k searches, although you might wish to set a cap if you get that grant. Cheers, and thanks for making this tool, Mdaniels5757 (talk) 15:37, 26 January 2020 (UTC)


 * I was coming here to suggest that (though google is enough better perhaps stick with that), but Earwig suggests in his last comment above that it's not the finances where the problem lies (I imagine WMF consider Earwig (the tool) pretty good value and don't mind covering the cost, it's a fix not finances that are the current issue) Nosebagbear (talk) 22:50, 26 January 2020 (UTC)
 * Might Google make extra allowances for this specific use – helping to enforce copyrights? It would seem to be something that they would want to help us with. —[  Alan M 1 (talk) ]— 01:22, 27 January 2020 (UTC)
 * I have often thought about that, considering the immense value Wikipedia provides to Google, but past attempts to get Google to support the copyvio tool have not been successful as far as I know. At the most, they might be giving us a discount, and even that I'm not sure of. — Earwig   talk  01:48, 27 January 2020 (UTC)
 * , I think the first step is to determine whether we truly are hitting the limit, or if there is a software glitch. If it turns out not to be a software glitch, I think someone ought to reach out to Google. I would hope they would be in complete agreement that using the tool to help remove copyright violations is something they be happy to provide for free if only for the positive press. S Philbrick  (Talk)  16:54, 27 January 2020 (UTC)
 * this is the type of win that could actually be achieved by Appeals to Jimbo - as unlike most complaints it is something that could possibly be actionable coming from him (and potentially wouldn't even come at a financial cost to WMF, so a win-win). Would need to have a very good proposal ready to go to get the most traction. —  xaosflux  Talk 17:37, 27 January 2020 (UTC)
 * , Maybe, but can we hold that thought. I see a small update at phabricator. On the one hand, it could hardly be smaller — all it does is retitle the issue, but the revised title suggests it is a software glitch not truly an issue of running into the limit. Let's just see what happens there first (although I do concur that, if necessary, this sounds like a good example of where Appeal to Jimbo might be fruitful.) S Philbrick  (Talk)  20:45, 27 January 2020 (UTC)
 * Sphilbrick, apologies if it was not clear from the thread above, but yes, I am reasonably confident this is a software glitch. After things settle down, I'm excited by the idea of reaching out to Google to see if we can work out a better arrangement than what we currently have, with the WMF's help. —  Earwig   talk 04:52, 28 January 2020 (UTC)
 * Since this is such a useful tool, it would be good to have alternatives, even another instance running on another server. Or perhaps there could be some kind of javascript version that runs off the invoker's PC. Then it could use the free Google allowance. Graeme Bartlett (talk) 22:36, 27 January 2020 (UTC)
 * I like the idea of a local Javascript version, but it would be a bit of work to set up. I've had ideas for stability/usability improvements to the normal tool for a while, but I haven't found the free time to work on it. With another instance running on a different server, we need to be careful to avoid the current single-point-of-failure in the Google API proxy, which ideally means a separate API key with separate billing. —  Earwig   talk 04:52, 28 January 2020 (UTC)
 * The problem with clientside is your private information (IP + user agent) are sent to directly to Google. This isn't great for privacy, and indeed it might violate the Cloud Service's Terms of Use. I think the CSP headers will eventually block all requests outside WMF anyway (for on-wiki user scripts, possibly with a way to manually whitelist domains). So you'd have to host the tool elsewhere, I suppose. The issue here with the IP changing (now fixed) is a rare event, but we were/are regularly hitting the API quota. I think we should focus on figuring that out, since from my recollection this was also a rare event up until recently? &mdash; MusikAnimal  talk  07:33, 29 January 2020 (UTC)

Feb 19: WikiWednesday Salon NYC
(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

copyvios
Today I was getting 504 Gateway timeout when accessing https://tools.wmflabs.org/copyvios

I reported it on IRC at #wikimedia-cloud where @arturo investigated. He restarted the webservice because he saw in the logs `uWSGI listen queue of socket ":8000" (fd: 7) full !!! (101/100)`

He also saw the following error:
 * open("/usr/lib/uwsgi/plugins/python3_plugin.so"): No such file or directory [core/utils.c line 3664]
 * !!! UNABLE to load uWSGI plugin: /usr/lib/uwsgi/plugins/python3_plugin.so: cannot open shared object file: No such file or directory !!!
 * [uWSGI] getting INI configuration from /data/project/copyvios/www/python/uwsgi.ini

arturo pointed out that metrics may indicate this tool needs higher CPU limits. Curb Safe Charmer (talk) 10:32, 17 February 2020 (UTC)
 * I came here to say that this very useful tool is down, and noticed that Curb Safe Charmer has taken several steps more than I could. I want to emphasize how important this tool is and express that I hope it'll be brought back soon. huji— TALK 16:39, 17 February 2020 (UTC)

It looks like phabricator T245426 is now tracking this aspect of the copyvios failures. David Brooks (talk) 16:43, 17 February 2020 (UTC)

Copyvio Detector not working
Hi Ben. The tool https://tools.wmflabs.org/copyvios seems to have once again stalled. Could you have a look if you have a minute? Thanks, — Diannaa (talk) 19:14, 24 January 2020 (UTC)
 * Seems to be back up for the moment, though Google's quota has been exceeded. I'll take a look in a bit to see if there's any explanation for the recent downtimes. —  Earwig   talk 02:41, 25 January 2020 (UTC)
 * Google quota was used up quite early the last two days. The tool was working only intermittently this afternoon, and was frequently timing out on the types of pages where it normally reports back promptly. Hope that helps. Thanks, — Diannaa (talk) 02:52, 25 January 2020 (UTC)
 * Our Google quota is already used up for the day. That's not normal. I think this problem has occurred before, though I couldn't find a Phabricator ticket. Perhaps there's a way to put a throttle on the tool so there's a limit set on any given IP as to how many searches they're allowed in a 24 hour period? — Diannaa (talk) 14:32, 25 January 2020 (UTC)
 * Sometimes I hit it near the end of the day (EST) if I'm going heavy on copy checks, but I haven't been able to use it once this week. I highly doubt the Blofeld CCI is taking up all that since there's not much to check so not sure why it's been capped. Hopefully it's an easily fixable issue on the back end (luckily for me i have CCIs with cited sources to keep me occupied since that half of earwig still works just fine) Wizardman  00:09, 26 January 2020 (UTC)
 * Again this morning we have no credits, and it's only 5 AM on the West Coast, where the Google offices are located. Either they are being used up at a frantic pace, or we are not actually able to access our credits, or we are not receiving any credits. — Diannaa (talk) 13:14, 26 January 2020 (UTC)

Toolforge does not give me access to user IPs for privacy reasons. I can throttle individual users by adding a login requirement using OAuth, but that will break API usage without a more complex setup. One thing I could try is leaving the API public and having a separate quota for it and normal tool usage. Recent logs suggest not many people are using the API with searches enabled, so giving the API the same quota as a single user might be safe.

It's not clear to me exactly when the quota rolls over (some of Google's docs say 12 AM Pacific Time, which would be 8:00 UTC, but the logs suggest it's more like 7:00). Looking at just the logs between 7:00 UTC today and your message at 13:14 when the quota had been exceeded, there were only about 375 queries made. Each query is limited to 8 Google searches, so in the absolute worst case this means 3,000 searches, whereas I expect the quota to be closer to 10,000, so something is definitely wrong. Kaldari, are you able to check this on your end? Can we tell if something is wrong with Google?

Separately it seems we are having an issue with many requests taking a ridiculous amount of time, on the order of five minutes or more, even very simple ones. It is not clear to me why this would happen, because the request rate is not very high and should not exhaust the capacity of the tool's workers. I need to do further investigation to figure this out.

—  Earwig   talk 17:29, 26 January 2020 (UTC)
 * Are you sure the quota is 10,000? Looking at your source, it looks like you're using | this API, which says it has a free quota of 100 queries per day (after, it's $5/1000 searches up to 10k searches a day, which may be what you're thinking of). Mdaniels5757 (talk) 17:44, 26 January 2020 (UTC)
 * Yes, the WMF pays for this; we are not supposed to be using the free tier. It's not clear to me exactly what the budget is, but as you point out, the limit is 10k. —  Earwig   talk 18:02, 26 January 2020 (UTC)


 * Ahh, I see what's going on:

{ "error": { "code": 403, "message": "The supplied API key is not configured for use from this IP address.", "errors": [ {       "message": "The supplied API key is not configured for use from this IP address.", "domain": "global", "reason": "forbidden" }   ],    "status": "PERMISSION_DENIED" } }

(I should've been reading these errors from Google... my bad.) I'm not sure if this is because Wikimedia Cloud's IP range has changed or what, but I think this needs to be fixed on Kaldari's end. —  Earwig   talk 18:41, 26 January 2020 (UTC)
 * I'm currently traveling on vacation. Has anyone filed a Phabricator task about the problem? That would probably be the quickest way to get it resolved. Kaldari (talk) 09:29, 27 January 2020 (UTC)
 * Looks like T243736 was filed. —  Earwig   talk 12:33, 27 January 2020 (UTC)
 * I put in a temporary fix for this yesterday, but we're still trying to figure out what's wrong. Kaldari (talk) 17:44, 29 January 2020 (UTC)
 * Thanks, I tried it out today and it did work. I am following the ticket as well — Diannaa (talk) 00:16, 30 January 2020 (UTC)


 * I'm not able to access it this morning, though different issue (a 504 Gateway Time-out) - I know there was an issue on Sat, no idea if the same as before or this one. Could someone else confirm/deny if they're getting the same issue? Nosebagbear (talk) 09:20, 12 February 2020 (UTC)
 * It worked for me just now. Graeme Bartlett (talk) 11:39, 12 February 2020 (UTC)
 * But now I get the 504 Gateway Time-out error "openresty/1.15.8.1"
 * It was working great, but again we are getting the 504 Gateway Time-out error.Thanks, — Diannaa (talk) 00:11, 14 February 2020 (UTC)
 * Sorry for the recent trouble–it's been a busy couple weeks so I haven't had a chance to follow up. We're on a new API key and we managed to identify some bots/crawlers that were making undesired requests. They've been blocked, so we should have fewer issues with the quota going forward. The current issue with the timeouts might be due to a change in the memory limit because we just moved to slightly newer infrastructure. I'm going to raise the memory limit and we'll see if this helps. —  Earwig   talk 02:56, 14 February 2020 (UTC)
 * Today I'm again repeatedly getting the tool failing to load due to 504 Gateway Time-out error.— Diannaa (talk) 13:13, 14 February 2020 (UTC)
 * Me too. It's been non-responsive (except for the occasional breakthrough) for several days. I don't use it for Google searches, only ever the specific URL match (it's for text matching when attributing to a specific PD page). Even the query builder page is refusing to load - is the tool server itself sick? David Brooks (talk) 22:48, 15 February 2020 (UTC)
 * I'll look tonight. —  Earwig   talk 22:49, 15 February 2020 (UTC)
 * The statistics shown here are deceptive; Even when the tool will load, I am getting a 504 gateway time out error almost every time I try to perform a comparison. — Diannaa (talk) 14:00, 16 February 2020 (UTC)

I have created a new Phabricator task for the gateway time-out issue: T243736— Diannaa (talk) 12:40, 17 February 2020 (UTC)
 * It's baaaack... direct URL comparison, at any rate. It does seem a little slower than usual, though. Thanks, whoever fixed it. David Brooks (talk) 20:19, 17 February 2020 (UTC)
 * And it's gone again. No response at all. David Brooks (talk) 23:36, 17 February 2020 (UTC)

Earwig's Copyvio Detector in Czech
Hello, can I translate Earwig's Copyvio Detector to Czech language? I am translator at Translatewiki.net. --Patriccck (talk) 12:59, 30 January 2020 (UTC)
 * Iis it possible? Patriccck (talk) 14:56, 21 February 2020 (UTC)
 * Hi Patriccck, sorry for not responding earlier. Thank you for offering to help. I don't have the tool set up to be translated right now, but it's on my to-do list after I fix the current reliability issues. As soon as it's ready, I'll let you know. —  Earwig   talk 01:25, 22 February 2020 (UTC)


 * Thank you. Patriccck (talk) 12:09, 22 February 2020 (UTC)

Earwig question but not about timeouts
I am working on the Dr Blofeld CCI, which is challenging. Many items can be cleared easily, but to say there are many many items to check understates the magnitude. Checking into potential copyright for edits over 10 years old is quite challenging, and the Earwig is indispensable. let me quickly state that I'm subscribed to phrabricator T245426 and this is not about the outages. I have some questions the rise when it appears to be working.

I interacted with Diannas (here, but I will summarize everthing).

This article: Economy of Bács-Kiskun had an initial edit, resulting in oldid =88202525. When I ran Earwig Earwig run

The results suggested the current Wikipedia article Economy of Bács-Kiskun but states that a violation is unlikely and has 0.0% confidence.

That's very puzzling as there is a lot of overlap.

However when Diannaa ran the revision ID and selected the URL comparison using the current URL, she got a 98.4% match.

That makes a lot of sense.

Diannaa's run

My concern is I have cleared a few items because I ran the tool and got a 0.0% confidence, but now I'm concerned that something went wrong. Do you have any insight on why my use of the tool generated a 0.0%? Am I using it wrong?-- S Philbrick (Talk)  20:28, 28 February 2020 (UTC)


 * Hi Sphilbrick. In your initial search, note that under "Checked Sources", all URLs have a confidence of "Excluded". This means they are present in the excluded URL list, so the tool did not actually download those pages nor perform any comparisons against them. Wikipedia itself is in the exclusion list because it's freely licensed, results in many false positives, and users are typically not interested in comparing an article to itself. I recognize your use case here is different from normal. To bypass the exclusion list, you need to do a direct URL comparison, as you noted—the easiest way is to click on the "Compare" link next to "Excluded". — Earwig   talk  08:45, 29 February 2020 (UTC)
 * , Thanks, that explains why I didn't get a match, when I thought sure it should match. S Philbrick  (Talk)  13:37, 29 February 2020 (UTC)