Wikipedia:Bots/Requests for approval/WaybackBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol oppose vote.svg Withdrawn by operator.

WaybackBot
Operator: Tim1357

Automatic or Manually assisted: The bot would run extremely supervised until it had enough "experience" to run by itself.

Programming language(s): Python Using pywikipedia

Source code available: here is a link to the code (it updates automatically every time i change it). It needs work, i keep getting little format errors that I need some programers to help me with.

Function overview:

WaybackBot would (intelligently) check the Internet Archive for archives of dead pages.

Links to relevant discussions (where appropriate): There are a lot. Some are


 * 1) Bot Request
 * 2) Bot Request #2
 * 3) Bot request#3

Edit period(s):

At first, I will baby sit the bot and check every edit it makes, until I can feel confident enough to let it run free.

Estimated number of pages affected: An estimated 10% of all links on wikipedia are, in some way, dead. If there are 2.5 million links on wikipedia (there were in 2006), then that means 250000 are dead. Thats a 'lot of pages.

Exclusion compliant (?):Im not sure, is pywikipedia automatically exclusion compliant?

Already has a bot flag (Y/N):

Function details: The bot's syntax looks like this:

Whew, i think thats it. if you want a more nitty-gritty explanation of what the bot does, look at the source code. Pretty much each line has a comment. Note that the source is being hosted from my home computer, so It might not be up when the computer is off.
 * 1) Load a page (from xml dump)
 * 2) Extract and check all the external links
 * 3) check them all, return dead (defined as error code 404 or 401)
 * 4) if they are dead, look for their corresponding accessdate, if none exists, use wikiblame
 * 5) create range of acceptable dates (for right now, the range of an acceptable archive is within 2 months of the original accessdatye, I am willing to change that. Remember that a larger range means that an archive is more likely.)
 * 6) if the url is referenced using citeweb, and does not already have an archive, add archive-url and archive-date.
 * 7) if there is not cite-web, append reference with wayback using parameters |date and |url
 * 8) if there is no internet archive, mark the reference with Dead link using parameters: |date and |bot
 * 9) start over, and cache links that were checked as either dead or alive, so I don't have to check them again. i will add a function to the script to clear the cache.

Discussion
Some Stuff You Should Know:


 * see this skrew up i made (still very sorry).


 * I am pretty new to python, this was my first big project, so i need some help


 * the Internet archive does not show archives until 6 months after they are grabbed (right now they are still processing archives from June), so that means if I request an archive for a page that was accessed today, the bot will not get any archives.


 * I support a larger archive range, but I will leave it up to consensus here.

 Things to do  Id like to put this on hold for a while. User:Dispenser gave me some points about the bot's concept that I hadn't thought about. I am going to tweak the code so I can make it more fail-safe, and so that the bot gives a dead link two tries before it finds the archive (as some links are only dead for a bit, then are live again). Thanks Tim1357 (talk) 02:05, 10 December 2009 (UTC)
 * add logging that is similar to the logs of User:WebCiteBOT ✅ Still need to write the code that uploads the log.
 * make bot exclusion compliant
 * auto-clear cached links
 * add synonyms for templates (citeweb=CIteweb=cite-web) ect.
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.