Wikipedia:Bots/Requests for approval/BlevintronBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol oppose vote.svg Withdrawn by operator.

BlevintronBot
Operator:

Time filed: 16:21, Monday March 26, 2012 (UTC)

Automatic, Supervised, or Manual: Supervised during trial period.

Programming language(s): Ruby

Source code available: the source code is  open source

Function overview: Mark broken links in articles, Send user talk messages to Wikipedians to request help repairing those links.

Links to relevant discussions (where appropriate):   old discussion at the village pump idea lab. new discussion at VP.

Edit period(s): continuous, with configurable limits (max edits/day, max edit rate, etc)

Estimated number of pages affected: 10 articles/day during trial period; at most (3*articles) user-talk messages. All limits are configurable.

Exclusion compliant (Y/N): Yes

Already has a bot flag (Y/N):

Function details
The purpose of the bot is to combat link rot. At a high level, the bot performs these tasks:
 * 1) Scrapes random wikipedia articles;
 * 2) Checks those links over a period of several days;
 * 3) Edits articles, marking broken links with a  template;
 * 4) Sends user talk messages to authors asking for help;
 * 5) Collects statistics about its prior actions to help measure whether the bot is effective or annoying; and
 * 6) Uploads its source code to its user page.

These three case studies succinctly demonstrate the bot's main actions: Case Study: 'Johnny Unitas Stadium',  Case Study: 'Mohammed Ali Hammadi', and  Case Study: 'Sean Kennard'.

Articles are selected randomly. URLs are extracted from those articles. URLs are excluded if (i) they are already marked dead, e.g. via, or (ii) they have an archiveurl= alternative URL specified.
 * Scraping

Random selection helps to address articles from the long tail which might otherwise be neglected. Links are checked at least 3 times over a period of 5 days, from a good network position (a university in USA). Repeated tests helps to avoid false positives.
 * Checking

After the link trial period, the bot adds to any broken link that is present in the latest revision of the article. Broken links' are those which consistently demonstrated (i) a timeout error, (ii) a DNS error, (iii) an HTTP 404 error, or (iv) an HTTP 5xx error during their trial period.
 * Editing

UPDATE TO ADD: If a suitable replacement link is found in the archives, then the bot will automatically update citation templates (Citation, Cite web, etc) with archiveurl and archivedate. By 'suitable', I mean that the archive was captured +/- 6 months of the date reported in the accessdate parameter, or if absent, the date of the first article revision that includes the URL. The bot will not send User talk messages in such cases. (see discussion with User::Hellknowz, below). Blevintron (talk) 02:35, 28 March 2012 (UTC)

There are strict limits on (i) the rate of article edits, (ii) the number of articles edited in one calendar day, (iii) the number of links to correct during a single edit (to make reviews simpler), and (iv) the minimum timeout before the bot will re-edit the same article (to avoid edit wars with humans or other bots). The bot respects exclusions and the  template.

Every time it marks a broken link, the bot scans article history to find the user who first added that link to the article. It sends a polite user talk message asking that user to help correct the broken link. If available, that message includes a possible match from the archives. The bot respects exclusions on user pages and user_talk pages, and advertises this opt-out feature at the end of every user_talk message. The bot puts a strict limit on the number of messages it will send to one user in a calendar day. The bot will not send these to IP users or to accounts marked as a bot.
 * User talk messages

Examples of these communications can be found in these case studies: Case Study: 'Johnny Unitas Stadium',  Case Study: 'Mohammed Ali Hammadi', and  Case Study: 'Sean Kennard'.

To demonstrate its effectiveness, the bot will review its edits after 96 hours. It will measure (i) whether the links have been corrected, (ii) whether its edits have been reverted, (iii) whether it has been blocked from the article or from user talk pages, and (iv) measure total participation on that article. These statistics are publicly tabulated and discussed.
 * Collecting Statistics

If these statistics suggest the bot is not helpful, or is a burden to the community, I will withdraw this bot approval request at the end of the trial period.

The latest source code is uploaded to the bot's user space at most once per day. UPDATE: Per the suggestion by user:Hellknowz in the discussion section below, I withdraw this task. I will instead find another place to host the bot source code.Blevintron (talk) 18:59, 26 March 2012 (UTC)
 * Uploading its Source Code


 * Additional information
 * I've written a long description of the bot.
 * I publish the source code.

Discussion
I would suggest you limit the number of source code posting edits the bot is making in the userspace. Wikipedia is really not a source code repository. Technically, even if you makes notes about some other license for the code, anything you post on the pages is still under CC-BY-SA 3.0 and GFDL licenses. — HELL KNOWZ  ▎TALK 16:53, 26 March 2012 (UTC)


 * Noted. My intention was simply to make it available, not to use wikipedia as version control.  I will find a new place to publish the source code. Blevintron (talk) 17:25, 26 March 2012 (UTC)

I would also note that bot's message of "I'm just a bot, so I don't really know how to fix the problem" is not really true. There are at least several bots approved ([1] [2] [3] ([4])) to retrieve archived copies from Internet Archive and Webcite. Placing a dead link instead of archiveurl or Wayback implies the bot has failed to retrieve the archive. — HELL KNOWZ  ▎TALK 17:00, 26 March 2012 (UTC)


 * Even if a bot finds an archived copy from the same date, the bot cannot be certain that the archived copy is exactly the same as the copy referenced by the wikipedia author. Also, the bot will list lower-confidence archive copies, i.e. those from a nearby date.  The "I'm not sure but..." and "I'm just a bot so..." messages are intended to provide a helpful link, but encourage healthy skepticism.  But, of course, all of this text is subject to adjustment... Blevintron (talk) 17:24, 26 March 2012 (UTC)


 * I wasn't really commenting on the wording itself. The question I was implying is -- why doesn't the bot retrieve the archives? Community has somewhat come to expect bots to try and retrieve the archive before just tagging dead links. In fact, if this bot notifies about a dead link but another bot repairs it meanwhile, this gets kind of inconsistent. A dead link usually implies bots could not find archived copies, so these links require manual attention or replacement/removal. Hence dead link is generally accompanied with the bot parameter of who tagged it. — HELL KNOWZ  ▎TALK 17:31, 26 March 2012 (UTC)


 * Okay, I think I understand. First, the bot does set the bot parameter of the dead link tag.  Second, the bot does check archive.org, and adds the replacement link to the user message if its found.  If I understand correctly, you are saying that the bot should automatically update the article with the archive copy.  I can do that, but how good of a match must the archive copy be?  Same day?  Within a week?  I am reluctant to update the article if the replacement was archived one month before/after the link's access date, since the archived content may differ substantially.  What does the community consider to be close enough? Blevintron (talk) 18:54, 26 March 2012 (UTC)


 * Well, I did link a few previous BRFAs, and up to 6 months is acceptable. The closer the better, but some sites don't get updated and thus the archive doesn't get updated either. A month forward and a few backward is OK, there are usually very few false positives (or at least there haven't been any major complaints as far as I know). — HELL KNOWZ  ▎TALK 19:13, 26 March 2012 (UTC)


 * Thanks. I took a look at them.   DASHBot proposed +/- 6 months but now only replaces if  +/- 1 month.   H3llBot no longer lists does not list this as an active task.   AnomieBOT proposed any archive older than the access date, but now only does it on demand (ReplaceExternalLinks5).   WebCiteBot tries to prevent link rot, not repair it.  I can speculate why 6-month windows narrowed, but instead I'm going to ask those authors why they changed their bots.  I'll report back what I hear.  Blevintron (talk) 19:57, 26 March 2012 (UTC)


 * My last comment was unintentionally misleading. I said 'but now only does it on demand', which suggests that the the author had decided to change this feature of AnomieBot, but I have no evidence of that Blevintron (talk) 20:22, 26 March 2012 (UTC)


 * And I believe it was misleading about H3llBot too. I mistook 'inactive' task for 'no longer active'.  It appears that H3llBot task 2 is under development.  Sorry. Blevintron (talk) 20:32, 26 March 2012 (UTC)

Per User:Hellknowz' suggestion, I have updated the proposal so that the bot will automatically update citation templates when an archive copy can be found +/- N months of access date. I do this in interest of compatibility with other bots. My bot should not mark links as dead if another bot would repair the link. It may take a few days to implement and test this change. Blevintron (talk) 02:35, 28 March 2012 (UTC)

I have a feeling you may be underestimating the number of dead links. I don't have exact data (and I should really get the bot running), but it appears at least 1 in 30 links is dead. When I ran the bot late 2010/early 2011 it tagged over 100k articles with dead links within 3 months (these are the ones it couldn't fix automatically). That only covered a part of pages we have and mostly worked on citations; and didn't process missing access dates or bare links. Optimistically, we can expect at least that many more tagged. So how many notifications would that make, because 100k pages == 100k notifications? And how many would the same users be getting? — HELL KNOWZ  ▎TALK 12:36, 28 March 2012 (UTC)


 * The number of notifications is less important than: (i) the load that notifications cause on users, and (ii) the load they cause on servers. The bot features several throttling parameters to keep both loads small, including a limit on the number of notifications that any user will receive per day, and prominent opt-out.  We could debate concrete values, but first: do you agree that there exist some limits that won't melt the users and the servers? I hypothesize that a small number of notifications will cause large article improvement, but I intend to demonstrate that with measurements.  Blevintron (talk) 00:54, 29 March 2012 (UTC)


 * Don't worry about the load on servers, you are very unlikely to cause issues. Just use a sensible maxlag and epm rate. However, from past experience, several messages a day to the same user is a lot. You really need to get broader consensus on messaging a lot of users potentially a lot of times before we can trial/approve the bot; perhaps ask for more input on VP and WP:EL/Webcitebot2. Although it would be really interesting to see how messaging users affects repair rates. — HELL KNOWZ  ▎TALK 08:15, 29 March 2012 (UTC)


 * That's a fair criticism. I'll start a thread on VP. Blevintron (talk) 16:12, 29 March 2012 (UTC)


 * Also, some numbers may be interesting. Judging from the data I already have, I think the bot would be able to contact users no more frequently than once a week, while maintaining an edit rate of a couple hundred articles/day.  Blevintron (talk) 16:12, 29 March 2012 (UTC)

I think we could have a little technical, proof-of-concept trial soon. There are a few notes I'd like to list meanwhile:


 * Firstly, please use Dead link as Broken link is a redirect :)


 * Edit summary character limit is 255 (to be precise I think it's 250 or something, I always forget), so the examples of the edit summaries you have under the case studies would exceed this limit and truncate


 * "limit.. the number of articles edited in one calendar day" -- that's not an issue, as long as the bot doesn't edit too fast.


 * "limit.. the number of links to correct during a single edit (to make reviews simpler)" -- there's a general consensus that bot's should do all their task at once and not return to the same page several times. While it is sometimes unavoidable due to task complexity, I don't think you should split article edits into several parts. If there are dead links, they all should be marked, there have been pages with 50+ dead links.


 * Could you post the message you intend to deliver to the users? I suggest you create a separate page for this (in bot's userspace probably) so it can be edited by others and is substed/trancluded on messages, like uw-joke1 or something.


 * Dead page detection: Current bots all use 404 (some use 301 as well I think) pages. 5xx codes don't necessarily mean dead. DNS lookup errors might be reliable using enough time to account for propagation. Connection timeouts and refusals are somewhat borderline.


 * Does your bot use a proper referral/agent string, as some sites tend to ignore/redirect/fail requests with empty/unknown referral/agent.


 * Does the bot follow automatic redirects? Some sites return 404 but immediately redirect to a live 200. Similarly, some show a 200 but then redirect to a 404.


 * Finally, the bot should probably respect use dmy dates and use mdy dates for archivedate. While it's not required to respect citation's date format or field whitespace formating, it's nice to have, but not required. — HELL KNOWZ  ▎TALK 19:14, 29 March 2012 (UTC)


 * Those are all good points. I have fixed some of those, others are still TODO.
 * Referer/User-agent when checking links: yes, referer is the wikipedia article URL, user-agent is looks like firefox10 on linux. (user agent is honest when contacting wikipedia, archive.org, or webcitation.org)
 * Re: 404 pages that also redirect (e.g. via a meta tag) &mdash; I detect that case, but treat them just like any other 404. I don't know the best action for that case.  The 404 code is a clear indication that the link is not reliable, but when users visit in a web brower it doesn't look broken.
 * Re: (3xx) redirects &mdash; I detect them, and I don't touch them. It's difficult to write up a good policy for redirects.  Consider these counterexamples: de.youtube.com redirects to www.youtube.com because I'm in USA; nytimes.com articles redirect to a login page; some redirects lead to a 404 page; DOI URLs are intended as permalinks and should remain unchanged.  If you have some insight, I'll listen.
 * Re: javascript redirects &mdash; I will never detect them ;)
 * Blevintron (talk) 20:38, 30 March 2012 (UTC)


 * While generic 3xx responses shouldn't be used as indications, 301 (moved permanently) that doesn't redirect is something that's almost always a dead link. Anyway, personally I treat redirects to 404s as dead links, even if the original page is not 301. And I treat 404s/301s that redirect to 200 pages as not dead. This means, for example, paywals and content relocations don't trigger dead links, but error pages and special notices do. I haven't had any obvious problems with this any more than with response code misuse on any other page, if that's any indication of reliability. Then again I'm too lazy to manually collect some empirical data... You seem motivated enough to do some actual work :)
 * By the way, I didn't mean javascript redirects, only header redirects. I don't follow meta redirects either, I really don't know if one should. Probably yes, since that would be what happens client-side, but that needs real-life testing. Here's an excerpt from URL redirection: "Meta Refresh with a timeout of 0 seconds is accepted as a 301 permanent redirect by Google".
 * Also, I could check out my bot's contributions and manually collect info on whether and how fast the links got repaired without any user notifications. May be that would help you decide on notification frequency/amount details. — HELL KNOWZ  ▎TALK 21:26, 30 March 2012 (UTC)


 * Regarding collecting stats: I think that every bot would benefit from collecting stats. In terms of notification frequency: don't worry about it.  My bot has a pretty big pool of pending changes.  If the bot schedules its edits in a clever way, it can make progress while keeping notification rates low.  So, I think the notification rate should be decided by community opinion, not technical constraints.  I need to start a VP thread about that...Blevintron (talk) 13:56, 1 April 2012 (UTC)

I've started a new thread at VP. Also, I added that link to the 'relevant discussions' field in the application above. Blevintron (talk) 15:09, 1 April 2012 (UTC)

Here's a fun and real case to consider about meta redirects. (1) http://www4.ncdc.noaa.gov/cgi-win/wwcgi.dll?wwevent~ShowEvent~494533 redirects to (2) http://www.ncdc.noaa.gov/oa/about/stormdown.html via 0 second meta refresh. Now, (1) is dead -- 404. However, (2) is live -- 200 -- and shows some warning about maintenance. So if the bot doesn't follow the redirect and believes the first 404, it will falsely tag all these links. — HELL KNOWZ  ▎TALK 10:09, 2 April 2012 (UTC)


 * Yes, those are tricky. Over the last few days, I've looked at this case and come to the same conclusion.  New behavior: when my bot encounters a 404+redirect (either via Location header, Refresh header, or Meta-Location, or Meta-Refresh), then it will not modify them.  I also had a wacky idea: can link checking be turned into a GWAP? I digress... Blevintron (talk) 14:14, 2 April 2012 (UTC)
 * I will keep a list of these and see if there are any false positives. Do you have any data you can share? Also, what's a "Meta-Location" redirect? — HELL KNOWZ  ▎TALK 14:30, 2 April 2012 (UTC)


 * I don't know if meta-location redirects occur in practice. As I read the standards, an HTML document may contain  tags in the head section, and those are considered equivalent to an HTTP header x with contents y.  Since HTTP has a Location header, used for redirects, I put in a case to detect x=Location. I'm no longer surprised by all the weird things web servers might do... Blevintron (talk) 14:36, 2 April 2012 (UTC)


 * As for data---I could collect some. Right now, I just discard those links that I don't intend to edit.  What do you want, URLs that 404 and redirect? Blevintron (talk) 14:36, 2 April 2012 (UTC)


 * Anything that redirects might help in deciding what to do with it. 404 redirects in particular. No biggie though, just thought it may be useful. — HELL KNOWZ  ▎TALK 14:39, 2 April 2012 (UTC)


 * Sure, n/p. I'll collect info about redirects for a day, and give you a copy tomorrow. Blevintron (talk) 14:49, 2 April 2012 (UTC)


 * I have | 2000 redirect links for you. I tried to upload them to wikipedia, but it triggered the spam filters ;) Blevintron (talk) 16:12, 3 April 2012 (UTC)


 * "Your download will begin in 295 seconds". Urgh :) Ok, thanks, I'll see if I can use it. — HELL KNOWZ  ▎TALK 16:23, 3 April 2012 (UTC)

I also wanted to clarify your comments about Use dmy dates vs Use mdy dates: did you mean that the presence of these templates controls how the bot parses dates? Or, does it only control how the bot emits dates into the document? Blevintron (talk) 14:14, 2 April 2012 (UTC)
 * Emits. It specifies what date format the bot probably should use. I say "probably" because there are three sides of the fence. One says bots must follow this. Other says, accessdate and archivedate should/can use shorter formats. Third says bots should use whatever the article already uses. Personally, I follow these templates. If none are present, I use the format from accessdate. So far, I haven't had any problems reported. — HELL KNOWZ  ▎TALK 14:24, 2 April 2012 (UTC)


 * ok Blevintron (talk) 14:36, 2 April 2012 (UTC)

Trial
I don't see any major problems with this task. The only potentially controversial bit was the user notification, and the linked discussions so far show no objections. All the technical details clarified (extensively, I might add), so Let's do a short technical trial, so you can get all the stuff working and it's easier to review what the bot will do with live examples. This can serve as a case studies for VP disscusion as well. — HELL KNOWZ  ▎TALK 14:44, 2 April 2012 (UTC)


 * Cool, thanks. Fingers crossed ;) Blevintron (talk) 16:19, 2 April 2012 (UTC)

The trial has completed. Most edits were fine, with three exceptions listed here:


 * 1) There was a bug during the first edit to Pebas District.  This bug prevented the bot from recording local metadata.  Wikipedia was not affected, but as a result, it miscounted its trial period.  Thus, it did 16 edits instead of 15.  Sorry.  I have fixed this bug.
 * 2) There was a bug during the edit to The Bloody Chamber.  Specifically, an off-by-one error caused the bot to incorrectly delete the character immediately following a citation template.  As a result, the closing was changed to /ref> .  This occurred in one place.  I corrected the article, and fixed the bug.
 * 3) In the article Halsey Stevens, the bot misidentified which user added the link.  In particular, it detected that the link was introduced by ClueBot.  Although ClueBot indeed added the link, ClueBot was reverting vandalism.  The real introduction of that link occurred in an earlier revision.  No notification message was sent, since my bot will not send notifications to bot accounts.   I have not fixed this bug yet.
 * I think I fixed this. Blevintron (talk) 15:51, 3 April 2012 (UTC)

In all cases, I manually confirmed that all dead links appear to be broken. No problems.

No notification were sent. This is because:
 * 1) By luck of the draw, only one action was placed into  experiment case (+E+S).  All others were placed into (-E-S) or (+E-S), in which notifications are not sent.
 * 2) The one action in case (+E+S) was the edit to Halsey Stevens, listed above.

Blevintron (talk) 19:34, 2 April 2012 (UTC)

Also, there was a bug report about date formats in Whitley Bay High School. Blevintron (talk) 19:48, 2 April 2012 (UTC)
 * I fixed this Blevintron (talk) 15:51, 3 April 2012 (UTC)

Some issues:


 * First issue is WP:CITEVAR -- you shouldn't be changing referencing style and changing plain links into citations is considered a style change. here you could have used Wayback.


 * I have corrected the article. Blevintron (talk) 20:24, 2 April 2012 (UTC)
 * I have changed this behavior. Blevintron (talk) 20:24, 2 April 2012 (UTC)


 * Current bots don't add yes if the archive parameters are set, because that is the default behavior already.


 * I have changed this behavior. Blevintron (talk) 20:24, 2 April 2012 (UTC)


 * There is no bot field in any citation templates. You can use a comment in one of the archive fields, such as.


 * I'm confused: I just double checked, and every citation, cite web, webcite and wayback that the bot added or modified includes bot. Blevintron (talk) 20:24, 2 April 2012 (UTC)
 * Sorry if I was unclear; I meant the other way around. Citation templates don't implement a bot parameter, that the bot adds. It's kind of pointless because lots of bots edit citations and do lots of stuff to them, and the field doesn't really help identify the changes. That's why using bot in wayback or dead link is straight-forward, but why we add to citations. It's not actually required and it has been very marginally useful to me personally, but there also isn't any rules against it. —  HELL KNOWZ  ▎TALK 20:29, 2 April 2012 (UTC)
 * I've changed this behavior. For completeness, I should note that WebCite and Wayback  also do not list bot in their documentation. Blevintron (talk) 21:47, 2 April 2012 (UTC)
 * No, they don't, perhaps I should've clarified. Dead link mentions it, but the field doesn't do anything. It's only use currently is in markup for identification. — HELL KNOWZ  ▎TALK 21:56, 2 April 2012 (UTC)


 * Oh, I misread your comment. Now I understand, there should not be a bot for any of the citation templates.  Blevintron (talk) 20:29, 2 April 2012 (UTC)
 * I got an auto-resolved edit-conflict as I posted that... — HELL KNOWZ  ▎TALK 20:30, 2 April 2012 (UTC)


 * Adding access date. There is no current obvious consensus to do so, and I have asked about this on VP before: here and here. Additionally duplicates a manually written access date: "Retrieved April 11, 2008."


 * I have changed this behavior. Blevintron (talk) 20:24, 2 April 2012 (UTC)


 * The bot seems to change whitespace around existing field names.


 * Noted. I will work on this.
 * I fixed this. Blevintron (talk) 15:51, 3 April 2012 (UTC)


 * You already mentioned removed characters, but here for example more than one case happened.


 * Nice catch, thank you. I've corrected that article. Blevintron (talk) 20:24, 2 April 2012 (UTC)


 * As a minor note, I would say that the date is not really necessary in the comment, but I also don't have anything against it if is useful for you, although adding too much "clutter" can be seen negatively (from experience). — HELL KNOWZ  ▎TALK 19:53, 2 April 2012 (UTC)


 * I'm not sure what date you are referring to. Blevintron (talk) 20:24, 2 April 2012 (UTC)
 * The one in BlevintronBot/2012-04-02. Again, no biggie, just thought I'd mention. — HELL KNOWZ  ▎TALK 20:29, 2 April 2012 (UTC)
 * I've changed this behavior Blevintron (talk) 21:47, 2 April 2012 (UTC)

A larger test. The numbers are arbitrary/approximate, so don't worry about those or going over the limit a bit. — HELL KNOWZ  ▎TALK 21:57, 2 April 2012 (UTC)


 * Thanks. I wanna do some more testing of these date format issues first.  The bot will probably start editing again tomorrow. Blevintron (talk) 22:07, 2 April 2012 (UTC)


 * A common practice is to subst: the template messages ({{subst:template}}) on user talk pages, so that future template changes don't affect past messages, and make them readable in markup and for new users. I have a couple other suggestion for the user message, but this probably isn't worthy of adding even more walls of text to this BRFA no other BAGgers are attempting to comment on.


 * That's a good idea. I was unaware of that feature, but I'm going to make it happen. Blevintron (talk) 21:01, 3 April 2012 (UTC)


 * You check links four times, very thorough :) I'd say 2 is enough, but I won't stop you from being careful. Although it's hard to imagine you would get many false positives if you didn't do the intermediary checks. Out of curiosity, do you have any data on this? — HELL KNOWZ  ▎TALK 20:24, 3 April 2012 (UTC)


 * No data on the number of trials. I worry more about time between checks.  Any page could be down for a day or two; five days seems permanent to me.
 * The bot checks at least 3 times over 5 days. If links wait in the backlog too long, the bot checks them again before edits.  It's useful from an engineering perspective: it ensures that the bot applies the latest definition of 'dead link', even when I change a lot of code.
 * There's been a few opinions on what the time should be; mostly without any empirical evidence. 5 seems like more or less enough nowadays. I live on the edge, so I use 3. ^-^ Anyway, for BRFA purposes this is more than enough and you can obviously tweak and adjust as you see fit. — HELL KNOWZ  ▎TALK 21:10, 3 April 2012 (UTC)

One oddity during the trial: in Sir Walter Raleigh Hotel, my bot marked a link with Dead link. It turns out, that link was already marked with NRIS dead link. That adds pages to Category:All NRHP articles with dead external links, which is not a sub-category of Category:Articles with dead external links. I'm not really sure what the ideal behavior is, but it seems that adding Dead link is not redundant, since the page categories are disjoint. Blevintron (talk) 21:01, 3 April 2012 (UTC)
 * Pff, another exception... hopefully that's the only one . My guess is to only add one, and treat this like dead link. Then again... — HELL KNOWZ  ▎TALK 21:05, 3 April 2012 (UTC)

I checked the edits, they look fine. Unfortunately, don't see any user having repaired any links. — HELL KNOWZ  ▎TALK 13:11, 5 April 2012 (UTC)

So far, there is no indication that notifications are effective, but the sample size is also very small. Two of the notified users are very inactive (User:Kumarajiva, last edit May 2010; User:Glasstowerpress, last edit June 2010). Four are marginally inactive (User:Shudde, last edit January 2012; User:Dcmacnut, last edit February 2012; User:Dickeybird, User:Sadads, last edit March 2012). The others User:Deinocheirus, User:Bwmoll3, User:Calistemon, User:Arsenikk are recently active but have not acted on the notification.

There is no indication that any were bothered by the notifications (none have opted-out, no bug reports, and I've received no communication from them).

I think two things could be improved:
 * Better notification message (still a work in progress).
 * Larger sample size.

If BAG is willing, I would like to do a larger trial run over the weekend. Blevintron (talk) 15:38, 6 April 2012 (UTC)

A couple editors on VP suggested pinging only users who recently edited. Do you check for recent user activity before notifying? — HELL KNOWZ  ▎TALK 16:15, 6 April 2012 (UTC)


 * I don't yet; I could: maybe, users who contributed in the last month...
 * That may the improve my per-notification metrics. It won't improve the overall dead link repair rate.
 * But fewer notifications == less load on servers, so I probably should implement that. Blevintron (talk) 16:43, 6 April 2012 (UTC)
 * Again, you are unlikely to cause major load on the servers, so don't worry about performance. Anyway, if you say skipping inactive users doesn't improve dead link repair, then there's no real reason to bother then. No point notifying long-gone users, whose pages are generally long lists of existing bot messages. 1 month is very conservative though, and you can surely go up to 6+. Anyway, not a biggie, do what you think would yield most results and least unseen messages. — HELL KNOWZ  ▎TALK 16:52, 6 April 2012 (UTC)

Larger trial
OK, more samples. plus whatever article edits will happen. — HELL KNOWZ  ▎TALK 16:52, 6 April 2012 (UTC)

The bot has finished editing. 236 articles / 40 notifications. Preliminary results are much better than last time. It might take a while. Blevintron (talk) 16:32, 7 April 2012 (UTC)

Here's a few cases bot notified >1 person: (wall of text redacted) Is that intentional? — HELL KNOWZ  ▎TALK 16:39, 7 April 2012 (UTC)


 * Yes that is intentional. In those cases, multiple users contributed distinct links that have died.  Each user was only notified about the links they contributed. Blevintron (talk) 16:49, 7 April 2012 (UTC)


 * I see, I somewhy thought it was the same link, I must have looked at the start of diff and not the link... My bad. — HELL KNOWZ  ▎TALK 16:51, 7 April 2012 (UTC)

Very preliminary results. Over the last 16 hours,

...Users who received notification fixed one or more dead links in six articles:
 * User:Brianboulton fixed Clements Markham
 * User:Mjroots fixed List of windmills in Loire-Atlantique
 * User:Stunteltje fixed IMO ship identification number
 * User:Geschichte fixed Rolf Hansen (athlete)
 * User:Big iron fixed John Cawthra
 * User:DanTD fixed List of paved Florida bike trails

...One case is an 'almost fix':
 * User:NE2 did not edit Interstate 215 (California), but instead replied to the notification with a replacement link.

...One possible annoyance:
 * User:DAJF reverted the notification, but did not opt-out with Bots.

...One definite false positive:
 * Cantor set the link is not dead. I am investigating.


 * This one returns a 500 response code. — HELL KNOWZ  ▎TALK 16:59, 7 April 2012 (UTC)

...Reported bugs:
 * User:E8 reported that the bot failed to find an archive copy Wave power.  This is expected behavior: the archive copy was outside 6 months of the access. (The bot had not contacted E8.)

Blevintron (talk) 16:54, 7 April 2012 (UTC)

I'll look through edits at a later time. For now, let's wait for a while for feedback. Also, does the bot notify >1 dead link added by the same person? — HELL KNOWZ  ▎TALK 16:57, 7 April 2012 (UTC)


 * Yes. here is an example.  There are minor variations, depending on whether all those links were added in one revision, or over several... Blevintron (talk) 17:00, 7 April 2012 (UTC)

I think you forgot to explicitly mention that "you" are a bot in the user messages. — HELL KNOWZ  ▎TALK 09:36, 8 April 2012 (UTC)


 * I didn't know it was required. Neither DPL bot nor SineBot say 'I am a bot' in their notifications.
 * But if it's a problem, I will make it explicit. Blevintron (talk) 13:33, 8 April 2012 (UTC)
 * I don't think it's required, but seeing as at least a couple users responded to the message, may be it is beneficial. Your call, just throwing out suggestions. — HELL KNOWZ  ▎TALK 13:36, 8 April 2012 (UTC)

I reviewed all edits. I observed these problems and fixed the articles:
 * False positive in link detection: Cantor set the link is a 500 but renders content; Open science data one of the three links is is okay; Gay Flag of South Africa the link is borderline timeout.
 * Broken edits: Marian art in the Catholic Church placed Wayback in an ambiguous location for a that contains several links; Parkview High School (Orfordville, Wisconsin) the bot placed Wayback within URL; Rana Gurjeet Singh the bot placed Dead link in a place that broke layout of Infobox Indian politician.

Data is available for last week's edits. The highlights:
 * Notification rate: The bot sends about 4 notifications per 10 edits on average.
 * Participation: About 1 in 5 notified users contribute to the article within a week.
 * Annoyance metrics: the bot was not blocked (via Bots) from any article nor any User talk page. Though one user reverted the notification.

The link improvement metric shows a big difference between the three cases:
 * Control: 0% of the dead links improve after a week.
 * No notifications: 42% improve after a week.
 * With notifications: 58% improve after a week.

This is misleading. Most of the improvement is due to the archive URLs that the bot automatically finds and adds to the articles. By comparing archive rate and mark dead rate, you see that about 0%, 42% and 56% of links were archived by the bot in those cases, respectively. So, the improvement due to notifications is probably closer to 2%.

Conclusions: The false positive and broken edit rate is still too high for deployment. The experiment suggests that notifications do not annoy most users. Notifications have a small, positive effect on dead link remediation.

My initial hypothesis was that notifications would have a large effect. I have invalidated this hypothesis, and now see no benefit of this bot over other dead link bots. I withdraw this BRFA. Blevintron (talk) 14:05, 14 April 2012 (UTC)
 * 2% over millions of edit don't seem like a trivial improvement... Headbomb {talk / contribs / physics / books} 14:40, 14 April 2012 (UTC)
 * Indeed. And even if you don't want to post notifications, dead link marking and archiving are highly useful tasks. We have millions of articles and most have external links, so this is a task where even a dozen bots would struggle. Given it's complexity and the fact that you've worked out 90% of issues, I suggest you still run it. — HELL KNOWZ  ▎TALK 14:51, 14 April 2012 (UTC)


 * Ok, I'll continue the BRFA. I suppose I have to fix those bugs now ;) Blevintron (talk) 15:17, 14 April 2012 (UTC)
 * Or maybe the messaging condition could be tweaked to wait a week after the dead links tagging? Headbomb {talk / contribs / physics / books} 15:41, 14 April 2012 (UTC)
 * That's an interesting idea. I'd have to think about how to guarantee the notification-per-user rates... Blevintron (talk) 16:09, 14 April 2012 (UTC)


 * Yes, please! Don't stop with your bot. This can be another useful bot similar to DPL bot. mabdul 19:17, 14 April 2012 (UTC)

I've fixed several editing bugs and false positive dead links. I've tweaked the notification messages to sounds less human. I think I'll be ready for another trial this weekend. Blevintron (talk) 14:57, 19 April 2012 (UTC)

(plus whatever notifications). — HELL KNOWZ  ▎TALK 16:26, 19 April 2012 (UTC)

The trial was largely good. There were two classes of bugs, both due to mis-parsing links in wikitext.
 * Trailing right parenthesis---; notification;.
 * Missing space between URL and title, e.g. space between URL and title'' ---.

Statistics tell more/less the same story. There were no bug reports or complaints. One user has opted out of notifications from this (and several other) bots.

I've corrected the affected articles, where appropriate (some of those links are broken even if correctly parsed).

I read MediaWiki source code to figure out how wikipedia deals with trailing parentheses and fixed my bot so it parses them in the same way.

Summary: good progress but more to do. Blevintron (talk) 15:57, 30 April 2012 (UTC)


 * A second user has opted-out of notifications. This did not appear in the stats because of the funny use of nowiki. Blevintron (talk) 01:37, 4 May 2012 (UTC)

I've fixed those bugs, found and fixed a few more. I've studied the bot's offline edits and prepared for the next bug before it happens. I've improved the edit rate and decreased (per edit) bandwidth usage. Finally, I have some tools to help me review larger trials more quickly. I'm ready for another trial if you have the patience. Blevintron (talk) 00:14, 4 May 2012 (UTC)

BAG assistance needed Blevintron (talk) 15:28, 5 May 2012 (UTC)

Trial 5 May
— HELL KNOWZ  ▎TALK 15:58, 5 May 2012 (UTC)

248 edits were good. Overall bad edit rate of 0.8% for this trial.

Bad edit 1: Jarrett Bellini:
 * The bot marked two links as dead, but they were not.
 * Tech details: some virtual hosts are sensitive to the case of the host name given by the HTTP Host header.
 * The article included capital letters in host name,
 * My bot did not downcase the host name, sending 'Host: www.JarrettBellini.com',
 * The server is only configured to recognize 'Host: www.jarrettbellini.com',
 * The server reports 404.
 * the article.
 * the code.

Bad edit 2: Revival Centres International Blevintron (talk) 16:21, 6 May 2012 (UTC)
 * Wikipedia parses: http://aps.webjournals.org/default.asp?id={D78783D5-CCB1-46C0-A7EE-628757FBF743} but
 * Bot parsed: http://aps.webjournals.org/default.asp?id={D78783D5-CCB1-46C0-A7EE-628757FBF743}
 * (Note: Neither RFC 1738 nor RFC 3986 allow curly-braces in a URL's query)
 * The bot placed Dead link in the middle of the URL.
 * the article. (The link is dead under either parse.)

Need to suspend this BRfA
tl;dr I'd like to withdraw this BRfA for the moment, with the intention of returning to it later.

Here's the story, Blevintron (talk) 22:28, 30 May 2012 (UTC)
 * 1) I've tested >700 edits in my userspace (in addition to >700 in the article namespace), and have a good understanding of what I need to do next.
 * 2) Specifically, I need to replace the parser to address several corner cases.
 * 3) It will not be fun code to write, and so I've been dragging my heels...
 * 4) After I get that done, I expect a bad-edit rate of about 0.2%.
 * 5) That rate would include mis-identification of dead links as well as edits which don't render correctly (e.g. because of weird templates).
 * 6) Work is sending me to Asia and then to Europe.  I won't have much of a chance to write that code for at least a month.
 * 7) I still want to finish the bot, but it won't be soon.
 * So, yeah, I'm letting you know so you don't think I've forgotten about it.
 * 1) Feel free to close this BRfA, or leave it open... whatever is most appropriate.  I won't be offended and will re-open if necessary.

Per above -- Chris 03:40, 2 June 2012 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.