Wikipedia:Bots/Requests for approval/Cyberbot II 5a


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved

Cyberbot II 5a
Operator:

Time filed: 01:46, Tuesday, March 15, 2016 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): PHP

Source code available: here

Function overview: Addendum to 5th task. Cyberbot will now review links and check to see if they are dead. Based on the configuration on the config page, Cyberbot will look at a link and retrieve a live status from the given source. It will update a DB value, default 4, about that link.

Links to relevant discussions (where appropriate): none

Edit period(s): continuous

Estimated number of pages affected: Analyzes 5 million articles, the initial run will probably affect half of that.

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: When the bot checks a link it runs that value against the bot's DB, and assigns it a value from 0 to 4. 0 represents the site being dead, 1-3 represents the site being alive and 4 indicates an unknown state and is the default value. Every pass the bot makes over a URL, if the URL is found to be dead at that moment, the integer is decreased by 1. If found to be alive, the value gets reset to 3. If it is 0, it no longer checks if it is alive, as a dead site found to be dead at least 3 times, is most likely going to remain dead and thus the bot will conserve resources.—cyberpower  Chat:Online 01:46, 15 March 2016 (UTC)

Discussion

 * Checking if a link is really dead or not is a million dollar question because of soft 404s which are common. There is a technique for solving this problem described here and code here (quote):
 * Basically, you fetch the URL in question. If you get a hard 404, it’s easy: the page is dead. But if it returns 200 OK with a page, then we don’t know if it’s a good page or a soft 404. So we fetch a known bad URL (the parent directory of the original URL plus some random chars). If that returns a hard 404 then we know the host returns hard 404s on errors, and since the original page fetched okay, we know it must be good. But if the known dead URL returns a 200 OK as well, we know it’s a host which gives out soft 404s. So then we need to test the contents of the two pages. If the content of the original URL is (almost) identical to the content of the known bad page, the original must be a dead page too. Otherwise, if the content of the original URL is different, it must be a good page.


 * -- Green  C  04:40, 27 March 2016 (UTC)
 * Hi . That is a good point. We've discussed this and decided, for now, to not check for soft 404s. It's never going to be 100% reliable. So for now, we're checking for: hard 404s (and other bad response codes) and redirects to domain roots only. It's less than optimal, but at least we can be sure we don't end up tagging non-dead links as dead. It turns out it's quite easy for a search engine or big web scrapers to detect soft 404s and various other kinds of dead links (ones replaced by link farms etc.). For this reason, we're seeking Internet Archive's help on this problem. They've been very helpful so far and promised to look into this and share their code/open an API for doing this. -- NKohli (WMF) (talk) 03:16, 28 March 2016 (UTC)
 * That would be super to see when available as I could use it as well. Some other basic ways of detecting 404 redirects is to look for these strings in the new path (mix case): 404 (eg 404.htm, or /404/ etc), "not*found" (variations such as Not_Found etc), /error/, . I've built up a database of around 1000 probable soft 404 redirects and can see some repeating patterns across sites. It's very basic filtering, but catches some more beyond root domain. -- Green  C  04:10, 28 March 2016 (UTC)
 * Awesome, thanks! I'll add those filters to the checker. -- NKohli (WMF) (talk) 04:23, 28 March 2016 (UTC)


 * Are the checks spaced out a bit? Something could be down for a few days and then come back up for a while. Also, can we clarify the goal here; is this to add archival links to unmarked links, or to tag unmarked links as dead which have no archival links, or to untag marked-as-dead links? — Earwig   talk  05:57, 29 March 2016 (UTC)
 * Cyberbot can do all three, but the onwiki configuration only allows for the first two. Since Cyberbot is processing a large wiki, the checks are naturally spaced out.—cyberpower  Chat:Online 14:44, 29 March 2016 (UTC)
 * BAGAssistanceNeeded Can we move forward with this?—cyberpower  Chat:Online 14:03, 5 April 2016 (UTC)
 * "naturally spaced out" I would want some sort of minimum time here in the system...  ·addshore·  talk to me! 06:58, 8 April 2016 (UTC)
 * I can program it wait at least a day or 3 before running the check again. That would give the link 3 or 9 days, in case it was temporarily down.—cyberpower  Chat:Online 15:31, 8 April 2016 (UTC)
 * Let's try 3 days of spacing. Is it easy to trial this component as part of the bot's normal runtime? Can you have it start maintaining its database now and after a week or two we can come back and check what un-tagged links it would have added archival links for or marked as dead? — Earwig   talk  23:28, 9 April 2016 (UTC)
 * Unfortunately the bot isn't designed that way. If the VERIFY_DEAD setting is off, it won't check anything, nor will it tag anything.  If it's on it will do both of those things.  I can create a special worker to run under a different bot account so we can monitor the edits more easily.—cyberpower Chat:Limited Access 23:36, 9 April 2016 (UTC)
 * How often does the bot pass over a URL? (Ignoring any 3-day limits.) In other words, are you traversing through all articles in some order? Following transclusions of some template? — Earwig   talk  01:32, 10 April 2016 (UTC)
 * Ideally, given the large size of this wiki, there would be unique workers each handling a list of articles beginning with a specific letter. Due to technical complications, there is only one worker that traverses all of Wikipedia, and one that handles only articles with dead links.  So it would likely hit each URL much longer than 3 days, until the technical complication is resolved.  What I can do is startup the checking process, and compile a list of urls that have a dead status of 2 or 1, which mean the URL failed the first and/or seconds passes.—cyberpower Chat:Limited Access 02:10, 10 April 2016 (UTC)
 * That's similar to what I meant by "Can you have it start maintaining its database now...", though as you suggest it might make more sense to check what's been identified as dead at least once so we don't need to wait forever. Okay, let's try it. —  Earwig   talk  19:50, 10 April 2016 (UTC)
 * —cyberpower  Chat :Online 19:27, 24 April 2016 (UTC)
 * BAGAssistanceNeeded The bot has proven to be reasonably reliable, and the mentioned issues below have been addressed and installed.—cyberpower  Chat :Online 01:51, 4 May 2016 (UTC)

DB Results
In an effort to more easily show what is going on in Cyberbot's memory, I have compiled a list of URLs with a live status of 2 or 1, which indicate they have failed their first, or second, pass respectively.

Bots/Requests for approval/Cyberbot II 5a/DB Results

I've looked through the first chunk of these results. It looks like there are several false positives. The 2 most common types appear to be: There are also some weird cases I haven't figured out yet: I confirmed that these are not related to User Agent. Maybe there is some header or special cookie handling that we need to implement on the curl side. Kaldari (talk) 00:28, 14 April 2016 (UTC)
 * 1) Redirects that add or remove the 'www' hostname. This is bug in the soft-404 detection code. I'll create a Phabricator ticket for it.
 * 2) Timeouts. Several pages (and especially PDFs) seem to take longer than 3 seconds to load. We should consider increasing the timeout from 3 seconds to 5 or 10 seconds. We should also just exclude PDFs entirely. I gave up on http://www.la84foundation.org/6oic/OfficialReports/1924/1924.pdf after waiting 3 minutes for it to load.
 * http://au.eonline.com/news/386489/2013-grammy-awards-winners-the-complete-list sometimes returns a 405 Method Not Allowed error and sometimes returns 200 OK when accessed via curl. In a browser, however, it seems to always return 200 OK.
 * http://gym.longinestiming.com/File/000002030000FFFFFFFFFFFFFFFFFF01 always returns a 404 Not Found error when accessed via curl, but always returns 200 OK from a browser.
 * Accord to Cyberpower, the bot is actually using a 30 second timeout (and only loading headers). I'll retest with that. Kaldari (talk) 00:45, 14 April 2016 (UTC)
 * Timeouts should be handled in a sane and fail-safe way. If something times out, any number of things could be going on, including bot-side, host-side, and anything in between.  Making a final "time to replace this with an archive link" is premature if you're not retrying these at least a couple of times over the course of several days.  Also, you might try to check content-length headers when it comes to binaries like PDFs. If you get back a content-length that's over 1MB or content-type that matches the one you're asking for (obviously apart from things like text/html, application/json), chances are the file's there and valid&mdash;it's highly unlikely that it's a 404 masquerading as a 200.  Similarly, if an image request returns something absurdly tiny (like a likely transparent pixel sorta thing), it might also be suspicious. -- slakr  \ talk / 04:14, 16 April 2016 (UTC)
 * Actually, it looks like yields two back-to-back 301 redirects. Following 5 redirects is sufficiently enough for most likely 99.99% of links I would guess. For example, if you're using curl, it's most likely CURLOPT_FOLLOWLOCATION + CURLOPT_MAXREDIRS, or on the command line,  . -- slakr  \ talk / 06:03, 16 April 2016 (UTC)
 * I'm not sure I follow with the timeouts. If it is a temporary think, the second pass will likely not timeout, and the status resets.  When the bot checks a URL, it needs to receive a TRUE response 3 times consecutively, where each check is spaced apart at least 3 days, for it to be officially classified as dead and the bot to act on it.—cyberpower  Chat :Offline 04:24, 16 April 2016 (UTC)
 * The timeouts were a result of me testing URLs with checkDeadlink which was the wrong function to test with, and having a very slow internet connection (since I'm in Central America right now). There should be no timeout issue with the actual bot as it's using a 30 second timeout and only downloading the headers. It looks like the real issue with http://www.la84foundation.org/6oic/OfficialReports/1924/1924.pdf is the user agent string, which will be fixed by . As soon as you pass it a spoofed user agent, it returns a 200. I still have no idea what's happening with http://www.eonline.com/au/news/386489/2013-grammy-awards-winners-the-complete-list, though. I'm not sure how it's returning a different status code for curl than for web browsers (although it isn't 100% consistent). Kaldari (talk) 15:56, 18 April 2016 (UTC)
 * There could be bot detection mechanisms at work. Google bot detection and mitigation. Some techniques to fool remote sites you are not a bot. A legitimate looking agent string helps, not making too many repeat requests of the same site, not too fast. -- Green  C  14:48, 23 April 2016 (UTC)
 * Cyberbot only scans each page every 3 days. That should be spaced apart far enough.—cyberpower Chat :Limited Access 15:28, 23 April 2016 (UTC)

These exist but are in the "dead" list: The first thomas hit was random; when I clicked the others thomas ones, I was looking for @@s. Those following ones, from different domains, were semi-pseudo-random (I just started going down the list clicking random new domains), at a rate of 2 out of 5 marked false "dead". That's a high false-positive rate, and this list is very certainly not exhaustive. -- slakr \ talk / 05:21, 6 May 2016 (UTC)
 * http://thomas.gov/cgi-bin/bdquery/z?d112:SN00968:@@@P
 * http://thomas.loc.gov/cgi-bin/bdquery/z?d107:SJ00046:@@@P
 * http://thomas.loc.gov/cgi-bin/query/B?r112:@FIELD%28FLD003+s%29+@FIELD%28DDATE+20111217%29
 * http://timesofindia.indiatimes.com/entertainment/hindi/movie-reviews/Vicky-Donor/movie-review/12729176.cms
 * http://mathrubhuminews.in/ee/ReadMore/19766/indian-made-gun-makes-waves-in-expo/E
 * http://mfaeda.duke.edu/people
 * https://www.yahoo.com/music/bp/chart-watch-extra-top-christmas-album-234054391.html
 * http://news.bbc.co.uk/2/hi/uk_news/england/coventry_warwickshire/6236900.stm
 * This one has a throttle on "suspected robots" when I'm proxying off a datacenter. Perhaps exceptions should be made for similar patterns of text.
 * Several updated were deployed during and after the trial completed. I just ran every link through phpunit, including the throttled one, and they all came back as alive.  I re-ran the throttled one persistently, and kept getting a live response.  So the flagged links, are no going to be considered dead by the bot.—cyberpower  Chat :Offline 06:23, 6 May 2016 (UTC)
 * &mdash; basically same as before. -- slakr \ talk / 06:27, 6 May 2016 (UTC)
 * Sorry, I look at this now and see it was outputting to Bots/Requests for approval/Cyberbot II 5a/DB Results and not userspace; that's totally fine; I meant you can keep writing wherever it was writing before, too. It doesn't just have to be userspace; the main thing is not to have it actually editing articles. -- slakr  \ talk / 01:52, 7 May 2016 (UTC)
 * —cyberpower  Chat :Online 15:38, 26 May 2016 (UTC)
 * It would seem it has an issue with allafrica.com. Tracked in Phabricator.—cyberpower <sub style="margin-left:-10.1ex;color:olive;font-family:Comic Sans MS"> Chat :Online 15:56, 26 May 2016 (UTC)
 * Nevermind. It's a paywall, and checkIfDead is not reliable with paywalls.  A feature is in the works, unrelated to the checkIfDead class, so this domain can be ignored during this trial.—cyberpower <sub style="margin-left:-13.5ex;color:\#FF8C00;font-family:Comic Sans MS"><span style="color:\#FF8C00">Chat :Limited Access 16:19, 26 May 2016 (UTC)

DB Results 2
So we have currently marked all the false positives. Some links are also paywalls, which is also a feature in the works right now. So the community tech team is currently analyzing why the false positives are false positives, while I am testing and debugging the new paywall addon.
 * Paywall detection has now been implemented. The bot relies on the  tag on already cited sources.  When it's detected, it's domain get's flagged and all subsequent URLs with that domain gets skipped, even if they're not tagged.  This tag only results in an internal operation, and has no external visible result, other than links in a given domain, not being checked.  Users can still tag those links as dead, and the bot will respond to it however.
 * If I can draw the attention of a BAGger to which tracks the current results.  I'd say the results are pretty good.  Can BAG comment here too?—cyberpower <sub style="margin-left:-13.5ex;color:\#FF8C00;font-family:Comic Sans MS"><span style="color:\#FF8C00">Chat :Limited Access 18:44, 18 June 2016 (UTC)
 * To draw more attention, I think this is ready for approval. This class has been upgraded and extensively tested.  Our current estimated false positive rate is 0.1%—cyberpower <sub style="margin-left:-10.1ex;color:olive;font-family:Comic Sans MS"> Chat :Online 15:46, 23 June 2016 (UTC)

how many edits do you expect? -- Magioladitis (talk) 17:09, 28 June 2016 (UTC)
 * It's really hard to say. The DB, reports roughly 250,000 untagged links as dead.  Our false positive rate is currently estimated at 0.1%.  Average maybe 3 links per page roughly 85,000 pages will have links that get tagged.  This is merged in conjunction to the already functioning and approved task of rescuing those links before resorting to tagging them, and already fixing tagged dead links.—cyberpower <sub style="margin-left:-13.5ex;color:\#FF8C00;font-family:Comic Sans MS"><span style="color:\#FF8C00">Chat :Limited Access 17:12, 28 June 2016 (UTC)

due to the large number of edits I would like an additional bot trial. Is this OK with you? For example, tag 1,000 pages this time and wait 5 days for any remarks? -- Magioladitis (talk) 17:54, 28 June 2016 (UTC)
 * The only problem with that, is it would be hard to follow which pages it tagged, since it would get lost in the numerous other edits it's already making. Otherwise I have no objections.—cyberpower <sub style="margin-left:-13.5ex;color:\#FF8C00;font-family:Comic Sans MS"><span style="color:\#FF8C00">Chat :Limited Access 18:49, 28 June 2016 (UTC)
 * the trick is this time I hope that the problems will come to you instead of you finding them! We won't check every single pages out of the 1,000. This will give us a large sample to et the community check the problems. We can also double check a smaller portion. -- Magioladitis (talk) 21:30, 28 June 2016 (UTC)

-- Magioladitis (talk) 21:32, 28 June 2016 (UTC)
 * Based on this spike, which is when I deployed the update, that Cyberbot has now probably edited more than 1000, pages. Cautiously, I'm going to say .  And immediate response would be best appreciated.—cyberpower <sub style="margin-left:-11ex;color:red;font-family:Comic Sans MS"> Chat :Offline 06:02, 2 July 2016 (UTC)
 * nice work! Ping me in 4-5 days so we can give the final approval. We just have to sit and wait if anyone in the community spots an error. I'll start checking some of the link by myself too. -- Magioladitis (talk) 06:33, 2 July 2016 (UTC)
 * I've spotted a couple of remnant false positives, that were fixed in the latest commit, which I deployed before starting the trial. When seen as fully dead, the status needs to be manually reset in the DB, as the bot no longer checks links marked as fully dead.  There shouldn't be many false positives left over, as I did a massive DB purge to reset those caught in error.  Other cases, of dead links can be explained by tags placed elsewhere.  Because TAG_OVERRIDE is set to 1, a tag dead placed elsewhere, immediately sets the live_state to 0, meaning fully dead.  Other instances, where archives are already present elsewhere, the bot assumes the URL is implied to be dead, and starts to attach those archives to identical links elsewhere.  I have recently overhauled the bot's archive management routines, quite significantly, to improve functionality and to clean up the code a bit more.  So far, I haven't seen anything significant yet, aside from a few stray lingering false positives. :-)—cyberpower <sub style="margin-left:-11ex;color:red;font-family:Comic Sans MS"> Chat :Offline 06:45, 2 July 2016 (UTC)
 * It is time. I'm back.  I have received a few bug reports, but none related to the checking of Dead links.  I'm working to patch the 2 reported bugs on my talk page right now.—cyberpower <sub style="margin-left:-13.5ex;color:\#FF8C00;font-family:Comic Sans MS"><span style="color:\#FF8C00">Chat :Limited Access 11:27, 6 July 2016 (UTC)

does this mean that the bot chose 1000 urls, did 3 passes on the them in a period of 3 days and found them reporting 404 in each of these 3 passes? -- Magioladitis (talk) 06:54, 7 July 2016 (UTC)
 * It actually does a check on the URL every 3 days. During the trial a bunch of URLs only received their second pass.  After the 3rd, if the URL is still coming back dead, it will begin to mark those too.
 * ping again. While waiting, I knocked out the reported bugs unrelated to this BRFA, but related to the bot.  AKAIK, approval of this class is the last piece to complete IABot.—cyberpower <sub style="margin-left:-13.5ex;color:\#FF8C00;font-family:Comic Sans MS"><span style="color:\#FF8C00">Chat :Limited Access 21:44, 10 July 2016 (UTC)

-- Magioladitis (talk) 07:01, 11 July 2016 (UTC)

Thanks for fixing the bug. In fact, you were right. I wanted to ask whether it was related to this BRFA before approving :) -- Magioladitis (talk) 07:02, 11 July 2016 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.