Wikipedia:Bots/Requests for approval/DASHBot 11


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved.

DASHBot 11
Operator:

Automatic or Manually assisted: Automatic

Programming language(s): Python

Source code available: I need to figure out SVN

Function overview: Find suitable archived copies for dead links on the Internet Archive

Links to relevant discussions (where appropriate): I have to find them all, but it's there.

Edit period(s): Every Night

Estimated number of pages affected: N/A

Exclusion compliant (Y/N): Yes

Already has a bot flag (Y/N): Yes

Function details:

* Some pages return 404 on the first try because their disks are spinning up. ** I have asked for permission to query wikiblame, waiting for reply. *** The people at the Internet Archive told me I could do this given I use a identify-able user-agent (with email and such)
 * 1) Get all urls that are used between   tags.
 * 2) Find those that return error 404 (if they do, test them again 5 seconds later) *
 * 3) Check to see if they are associated with a  . If so, skip to step 5.
 * 4) Query wikiblame, Find approximate date of insertion. **
 * 5) Check the Internet Archive for any archived copy within 6 months (either direction) of our date. ***
 * 6) If so, update all references using the url (with |archiveurl if possible, otherwise Wayback)
 * 7) Tag all non-fixed urls with dead link

Discussion
I did testing (under my own account) in my user space, and did one little edit in the real world to make sure everything worked. Tim 1357  talk  02:37, 3 May 2010 (UTC)
 * Looks helpful. Why "Archive automatically found by DASBHBot"? Josh Parris 03:16, 3 May 2010 (UTC)
 * What do you think would be a better note to leave (if any). Tim  1357  talk  03:17, 3 May 2010 (UTC)
 * Please provide links to consensus discussions. Josh Parris 03:29, 3 May 2010 (UTC)
 * When dead link is inserted there seem to have a trailing space on the preceding citation template. How come? Josh Parris 03:32, 3 May 2010 (UTC)
 * I think "Bot:Fixing dead links" is a more explicit edit summary. If you were bored you could include a count of how many were fixed. If dead link is used, the summary should also include something like "marking dead links"; this also could include a count. Josh Parris 03:37, 3 May 2010 (UTC)

Needs to be more than a day that you wait. More like a week or so. So you'll need to store the dead URLs for that period of time. Shouldn't be too difficult. --MZMcBride (talk) 04:20, 3 May 2010 (UTC)
 * I agree. Ill have a bot pre-parse for dead links before hand. Then, in 5 days, go back to the article and test those dead articles. Tim  1357  talk  03:30, 6 May 2010 (UTC)
 * Isn't there a script built into pywikipediabot that does this? CrimsonBlue (talk) 04:03, 6 May 2010 (UTC)
 * There is a script in pywikipedia that scans the links and creates reports on talk pages of the dead links. I believe it can be set to include a link to the internet archive. Tim  1357  talk  10:27, 6 May 2010 (UTC)


 * Aside from the misspelling of DASHBot, have you considered using the API parse function on each revision, or even better, API-exporting large numbers of revisions and scanning for external links yourself, instead of using Wikiblame? Not only does Wikiblame run on another server, it is also very slow. And what will you do if an editor made 500 edits to the same page in one week? That sometimes will happen. Perhaps the bot should make a good guess which archive it is (for example, the newest few, disregarding "not found" messages like the ones from news sites) if the article has a long history? Remember that the Wayback system's archives frequently do not work (as in "failed to connect to our server" and others), so those probably should be ignored as valid archives (can be identified by the img code for the Internet Archive logo, not sure if it returns an HTTP error code or anything). PleaseStand (talk) 11:17, 6 May 2010 (UTC)
 * To answer one of your questions: yes, the bot checks to make sure the archive works before adding it to the article. Tim  1357  talk  22:50, 6 May 2010 (UTC)
 * I ran the parse sequence last night on some 70 articles. That means I will be ready for a test in 5. Thanks Tim  1357  talk  16:02, 9 May 2010 (UTC)
 * My thread at WP:VPR did not have any objections. Tim  1357  talk  17:37, 9 May 2010 (UTC)
 * What of the other questions? Josh Parris 09:49, 10 May 2010 (UTC)

I'm sure that you will work to improve efficiency when you get a handle on where the bottlenecks are. Do you have a way of measuring where the bottlenecks are?

Have you figured out Subversion yet? Try http://svnbook.red-bean.com/en/1.0/svn-book.html

What technique will you use to select the pages to operate on? Do you have a target edit rate for the bot? Josh Parris 09:49, 10 May 2010 (UTC)
 * Sure. To answer PleaseStand's question: I use wikiblame because it is something that already exists and is probably more bandwidth efficient (for me at least). However, I am wrote something that uses Special:Export to get the accessdate. It can only, however, parse the latest 1000 revisions I believe.  Tim  1357  talk  01:01, 11 May 2010 (UTC)


 * Josh, I have not done any formal testing on the matter, but I can tell that wikiblame is the slowest cog in my bot. Following behind that is the query to the Internet Archive. Tim  1357  talk  01:01, 11 May 2010 (UTC)


 * Again, Josh. I will use a list of pages that Dispenser generates with his tool checklinks. After I do those, well, I'll cross that page when I get to it.


 * I changed my method of storing dead links from a simple dictionary, (which is memory intensive and not so safe) to a SQL database table. My plan is to build a map of all urls on wikipedia, dead or alive, and check urls the minimum amount I have to. Josh, I know you are good with SQL so you might be able to help me with a method of finding articles with the most dead links, using my database. Tim  1357  talk  20:20, 13 May 2010 (UTC)

At #1, you restrict the choice of urls to check, to those between ref tags. Is there some reason for this, or could the bot also check out, named and http://www.nakedurls.org too? Also, to reduce false deadlink tripping on momentary server downtime, you might consider checking the google cache for its timestamp if not its content.LeadSongDog come howl!  17:09, 13 May 2010 (UTC)


 * The bot only checks urls that are in reference tags because urls outside of references are not fit to have links to archives. For example, while this section does not follow the manual of style, it still happens: "Foo bar, baz, spam spam spam lorem foo" should not be replaced with "Foo bar, baz, spam spam spam lorem foo".


 * Good idea about the google cache. I just discovered it yesterday. I am weary, however, to do large-scale calls of their cache without google's consent first. Tim  1357  talk  20:20, 13 May 2010 (UTC)
 * I get your point. I was mainly thinking of Cite used as references. When they rot they are the worst-case for loss of ref information, with no backup title, author, etc, to work from. Accordingly they arguably should be the highest priority for fixing when they go dead, though of course it would be much better to flesh them out in advance of that event. I would think that the pattern . could safely be replaced by archived ref or some such, until human editors can follow up. Too problematic? LeadSongDog come howl!  21:44, 13 May 2010 (UTC)

What will happen when run against Why Is Sex Fun? Josh Parris 09:59, 17 May 2010 (UTC)
 * The bot will skip the page because there are no External Links used within references. Tim  1357  talk  10:46, 17 May 2010 (UTC)
 * To clarify, when I say "between two ref tags" I mean used anywhere between two ref tags. That includes all citation templates such as Cite Web. Tim  1357  talk  10:55, 17 May 2010 (UTC)
 * Ah crap. I thought that was a reference, not an external link.  It won't 404 anyway, it will just ask for log-in details.  Okay, moving on... Josh Parris 11:08, 17 May 2010 (UTC)
 * Yep, anything except for 404 is considered alive. Better have the bot be too timid about messing with links than to be over-ambitious with archiving. Tim  1357  talk  22:42, 17 May 2010 (UTC)

Is Waybacks's star "*" notation for changed revisons reliable enough to use links outside the 6 month window? — Hellknowz ▎talk 15:55, 18 May 2010 (UTC)

Trial
Let's see the bot in action on a larger sample set. Josh Parris 02:41, 18 May 2010 (UTC)
 * Has the trial been undertaken? Josh Parris 11:15, 25 May 2010 (UTC)
 * My first successful edit. Keep in mind that I had 'remove duplicate references' turned on, which seems to have done more harm than good. Because of this, I turned that part of the bot off. Additionally, I switched the bot to use only Special:Export to find the insertion dates, as I found I was relying too much on wikiblame. Tim  1357  talk  20:21, 30 May 2010 (UTC)
 * 32 edits. Sorry I went a bit over, I kept going until I was confident I had worked out all the bugs. The more recent edits are better representatives of the bot's ability. Tim  1357  <sup style="font-family:Times new roman; font-size:small;">talk  04:05, 31 May 2010 (UTC)
 * So which edits ought I ignore? Josh Parris 04:09, 31 May 2010 (UTC)

Well, let me expand a bit. The Internet Archive acts weirdly when you give it a date range. There is an html comment on the pages thatis supposed to give the exact archive date/time. However, this date/time is incorrect and changes each time one does the query. I made a work-around so that the date is more reliable, and went back and re-did the pages I had already tested on. For that reason, I'd say pay attention to the edits that are still marked as (top). Tim 1357  <sup style="font-family:Times new roman; font-size:small;">talk  04:19, 31 May 2010 (UTC)
 * Date-determination

Perhaps the thing I most struggled with is determining the accessdate of a URL. For that reason, I thought it'd be nice to expand on how I go about determining the date of insertion.


 * 1) Look for an |accessdate, or use regex to find a string like "Accessed on ".
 * 2) Separate this into three slices, candidates for year, month, and day.
 * 3) The 4 digit number is the year, obviously.
 * 4) If there is a named month, then obviously that is the month, and the other 1-2 digit number is the day.
 * 5) If one of the numbers is above 12, then it is obviously the day and the other is obviously the month.
 * 6) If there is still uncertainty after this, the bot assumes the first 1-2 digit number is the month, and the second 1-2 digit number is the day.

However, if there is no available accessdate associated with the url, then it scans the article's recent history (1000 revisions) to find the closest date of insertion. Tim 1357  <sup style="font-family:Times new roman; font-size:small;">talk  04:19, 31 May 2010 (UTC)
 * Another technique you could use is to look at the date of the edit inserting the reference, the accessdate will be similar to that. Josh Parris 08:25, 31 May 2010 (UTC)
 * Yep, I do that (see the line above). Tim  1357  <sup style="font-family:Times new roman; font-size:small;">talk  15:23, 31 May 2010 (UTC)
 * Do you deal with vandalism? For example, user removes the content then spends 5 edits randomly posting lolcat pictures. Finally, someone restores the content. I assume you scan the revisions from oldest to newest so this shouldn't be an issue. Also, doesn't full revision retrieval take forever? I don't know about export, but API doesn't let downloading too many revision at a time if the page is large. — Hellknowz ▎talk 15:49, 31 May 2010 (UTC)


 * Yep, because it scans from old to new, that shouldn't be a problem. Special:Export is slow, but its not so bad. Im in no hurry and Im not paying for the bandwith :). Tim  1357  <sup style="font-family:Times new roman; font-size:small;">talk  02:20, 3 June 2010 (UTC)

the URL linked in this edit to England national football team manager is a 404 (sorta). Do you have a mechanism to check if any of the other edits linked to not-helpful archives like this one? Josh Parris 08:25, 31 May 2010 (UTC)
 * Oh, I have since turned the WebCitation archive checker off, because their service is so spotty (long outages, wonky server responses, ect). Maybe when they stabilize, I will turn the feature back on, but for the time-being, its only the Internet Archive. Tim  1357  <sup style="font-family:Times new roman; font-size:small;">talk  15:23, 31 May 2010 (UTC)

This edit claims genfixes; none are made. Josh Parris 08:33, 31 May 2010 (UTC)
 * It does, it removes whitespace on line 172. Tim  1357  <sup style="font-family:Times new roman; font-size:small;">talk  15:23, 31 May 2010 (UTC)

This edit doesn't mention marking dead links; perhaps Found archives for 5 of 17 dead links? Josh Parris 09:06, 31 May 2010 (UTC)
 * Wow, that's actually in my code, but I never really noticed that it wasn't working. I'll fix it. Tim  1357  <sup style="font-family:Times new roman; font-size:small;">talk  15:23, 31 May 2010 (UTC)

As a general comment, it would be nice if the bot could explain a bit more in the summary, may be give a link to task descritpion page. — Hellknowz ▎talk 15:49, 31 May 2010 (UTC)


 * I wrote a decription at the shutoff page, so Ill add a note about that in the e. summary. Tim  1357  <sup style="font-family:Times new roman; font-size:small;">talk  23:06, 1 June 2010 (UTC)

Another note, undated references added in the very first revision have a high chance of being copied/split from another article. This means the addition date is not the access date. For example, 2007 suicide bombings in Iraq, first revision. — Hellknowz ▎talk 15:01, 1 June 2010 (UTC)
 * If the link exists in the first available revision for a page, the bot does not search for an archive of the url in question and simply marks the url as being dead. Tim  1357  <sup style="font-family:Times new roman; font-size:small;">talk  23:06, 1 June 2010 (UTC)


 * Lets wrap things up here

Are there any other concerns that I have not met? Tim 1357  <sup style="font-family:Times new roman; font-size:small;">talk  02:11, 3 June 2010 (UTC)
 * Oh yea, and I fixed the thing that makes the comments. Tim  1357  <sup style="font-family:Times new roman; font-size:small;">talk  21:58, 5 June 2010 (UTC)

Good, good… go break a leg! &mdash; The Earwig   (talk)  21:04, 9 June 2010 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.