Wikipedia:Bots/Requests for approval/PhuzBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol neutral vote.svg Request Expired.

PhuzBot
Operator:

Automatic or Manually assisted: Automatic

Programming language(s): Python

Source code available: Standard pywikipediabot

Function overview: Scan Wikipedia using the included weblinkchecker.py script for dead external links and report them on article talk pages.

Links to relevant discussions (where appropriate):

Edit period(s): Weekly

Estimated number of pages affected: Initially, probably close to 50,000 pages. After the initial mass-scan completes and dead links are more noticed, this number will most likely drop significantly (to probably less than 2,000 per week). These are guesstimates, however, and I cannot guarantee that these numbers will not be exceeded.

Exclusion compliant (Y/N): Y

Already has a bot flag (Y/N): N

Function details: This bot will run the weblinkchecker.py script included with pywikipediabot on a weekly basis, scan Wikipedia's external links, and check if the links are dead or not. During the initial scan, only a list of dead links will be compiled (on the machine running the bot), and no changes should be made to Wikipedia. After one week has passed, the script will be run again, and at that time, it will again check the list of dead links from the previous scan to see if the link was only temporarily dead or if it is still not available. In the event that the link is still unavailable, the bot will report on the article's talk page that there are dead links in the article, and list them so that they can be fixed or removed.

Discussion

 * Are you intending to run this over all articles, or all  pages on Wikipedia? Either way, it would probably be much better to use a toolserver query or a database dump rather than trying to download millions of pages from the live site. Anomie⚔ 15:18, 13 August 2010 (UTC)


 * I'll start with the Main namespace to begin, but if it becomes more popular, I can begin running it over other namespaces as well. I've been running the bot in read-only mode, where it won't actually write to talk pages quite yet, and in 24 hours, it's only into the B pages.  It's chugging along quite slowly right now, but I expected the initial run to take anywhere between 1 and 3 weeks.  After the initial run completes, I can more accurately make an assumption on how long runs will take.  Worst case scenario, I could run this every month or two and get the results then.  My only concern is that if the runs get too spread out, there will be more and more links added that need to be checked, increasing the time of the runs, and the power required to run them.  At the rate that it runs right now, it pulls 240 pages at a time, which happens about every 1-3 minutes.  Due to the location of my server, I get routed to the Amsterdam cluster rather than the Tampa cluster, so US traffic shouldn't be heavily affected by the reads/edits.  If you guys would like, I can rate-limit the reading or writing to accommodate European traffic's peak times, and only run at full speed during off-peak hours for the Kennisnet cluster.


 * I'm willing to take any suggestions on how to make the as effective as possible without disrupting Wikipedia. If anyone has any suggestions, I'm all ears. Phuzion (talk) 16:59, 13 August 2010 (UTC)


 * I might be missing something, but for this you are going to have to prove that there is community consensus. If you have not already started a discussion, I suggest starting one, asking for comment at WP:VPR,WT:EL and WT:LINKROT. Tim  1357  talk  23:06, 16 August 2010 (UTC)


 * Requests for input submitted at WP:VPR,WT:EL and WT:LINKROT. Phuzion (talk) 14:38, 17 August 2010 (UTC)

Also, if you could send the data file to me when you are done with the run, (I believe it saves in a subdirectory of the pywikipedia directory) it would make the over-anilytical, datahungry nerd in me supremely happy. If it is under 10MB you can email it to me at this address (hidden to protect against spambots). If it is larger than 10MB, well, we'll figure that out later. I've wanted to do a dead link run like this for a long time, but I can't afford the bandwith and I dont want to do it on my toolserver account. Thanks  Tim  1357  talk  23:14, 16 August 2010 (UTC)
 * Certainly. Phuzion (talk) 14:38, 17 August 2010 (UTC)

weblinkchecker.py only checks Wayback. Could you add support for WebCite? How can i find reported dead links for a subject area using catscan (e.g. template, category inserted) ? Merlissimo 07:24, 18 August 2010 (UTC)


 * Any news?  MBisanz  talk 05:40, 5 September 2010 (UTC)

Mr.Z-man 04:05, 26 September 2010 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.