Wikipedia:Bots/Requests for approval/H3llBot 2


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved.

H3llBot 2
Operator:

Automatic or Manually assisted: Automatic

Programming language(s): C# .NET via MediaWiki API, Not English

Source code available: No (unless requested)

Function overview: Archive dead reference citation external links via Archive.org's Wayback Machine.

Links to relevant discussions (where appropriate): Same as Tim1357's BRFA, same recent VP link, WebCiteBOT BRFA, and some more older request links.

Edit period(s): Continuous ~11pm-2am GMT, run from my PC (I may inquiry Toolserver about hosting of .exe's)

Estimated number of pages affected: Given dead URL caching, much less than read capacity &mdash; estimating 5epm

Exclusion compliant (Y/N): Y

Already has a bot flag (Y/N): Y

Function details:


 * Select a scheduled page or a page from a selected list1 for reading
 * Read a page and find all tagged external links or citation templates with url= and accessdate=2
 * Add all newly seen links to link repository3 for immediate and 5 day repeated check
 * Find a Wayback Machine url4 within last 6 month range for all dead links
 * Update the correspoding urls with found Wayback urls and remove s if any, otherwise mark as  s
 * Schedule page for future processing if first-seen 404 links encountered and proceed with next page

1 FA/GA and Articles with dead links are priorities, otherwise I have classes for category/template/whatever parsing 2 For find link addition revision Wikiblame is extremely slow and proper API revision search is slow to implement but is in progress 3 A link storage and 404-state checker with next 404-check schedules 4 Retrieve archive.org's query result page for selected URL and parse for links with acceptable dates

Dead links are considered to be pages returning 404 error twice between 5 days (I have not yet encountered 404s because of server disk spinning up as pointed out by Tim1357, but I can setup double checks to see how many servers actually behave like that). The parser builds page tree structures and omits pages it cannot process so no or  processing.

Discussion
Regarding same task request: the way I see it, article pool is very large, this is a growing problem, and diminishing returns/redundancy even with multiple bots should not arise. Current progress has almost all of the above functionality. — Hellknowz ▎talk 00:42, 19 May 2010 (UTC)
 * Are you willing and able to publish your source code?
 * Let's see the bot in action a small sample set. Josh Parris 06:42, 19 May 2010 (UTC)
 * D Have you undertaken the trial yet? Josh Parris 11:19, 25 May 2010 (UTC)
 * Sorry that this is taking so long. I had a big deadline yesterday, so this has taken longer than anticipated. I am currently coding the revision retrieval functionality. I really do not wish to hurry with this implementation as any unhandled exceptions may lead to nasty sideeffects. — Hellknowz ▎talk 11:33, 25 May 2010 (UTC)
 * Just so long as you haven't forgotten about this, all's well. Josh Parris 11:40, 25 May 2010 (UTC)
 * About publishing source code &mdash; do you reckon it would be useful to make my custom C# API available? There is currently only one I know of and it lacks many important features. — Hellknowz ▎talk 14:24, 29 May 2010 (UTC)
 * Yes, I encourage you to do so. Feel free to mention it in the appropriate lists. Josh Parris 02:26, 31 May 2010 (UTC)

Special:Contributions/H3llBot. — Hellknowz ▎talk 16:36, 1 June 2010 (UTC)
 * Question - If there are multiple archived versions of a link in the WayBack Machine, and the text is different on each one, how does the bot know the right one to pick?--Rockfang (talk) 18:33, 1 June 2010 (UTC)
 * It always picks the older version closest to the original access date (link addition date) up to 180 day range. — Hellknowz ▎talk 18:57, 1 June 2010 (UTC)
 * Only five edits, lets do one more big test, because I've found citation setup varies a lot from article to article. Tim  1357  talk  21:33, 7 June 2010 (UTC)


 * (Just make sure you babysit the bot). Tim  1357  talk  21:33, 7 June 2010 (UTC)

Contributions/H3llBot
 * One funny issue was this: ; where bot applied a fix to a bad template syntax — vertical pipe before closing brackets. This comes from a super-subtle bug in page structure parser. — Hellknowz  ▎talk 22:55, 7 June 2010 (UTC)


 * Very well done! Tim  1357  talk  23:04, 7 June 2010 (UTC)


 * Tim 1357  talk  23:04, 7 June 2010 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.