User:WebCiteBOT

Update (12/23/14) - I am working on getting the BOT up and running again. There seem to be some technical problems with WebCite (requests giving time out messages, but actually completing) at the moment. I am working on a work-a-around for my code. --ThaddeusB (talk) 18:58, 23 December 2014 (UTC)

WebCiteBOT's purpose is to combat link rot by automatically WebCiting newly added URLs. It is written in Perl and runs automatically with only occasional supervision.

A complete log of the bot's activity, organized by date, can be found under User:WebCiteBOT/Logs/. Some interesting statistics related to its operation can be found at User:WebCiteBOT/Stats.

Operation: WebCiteBOT monitors the URL addition feed at IRC channel #wikipedia-en-spam and notes the time of each addition, but takes no immediate action. After 48 hours (or more) have passed it goes back and checks the article to see if the new link is still in place and if it is used as a reference (i.e. not as an external link). These precautions help prevent the archiving of spam/unneeded URLs.

Articles that have been tagged/nominated for deletion are skipped until the issue is resolved.

For each valid reference it finds, WebCiteBOT first checks its database to see if a recent archive was made. If not, it checks the functionality of the link. Valid links are submitted for archiving at WebCitation.org, while dead links are tagged with dead link. After the archival attempt has had time to complete, the bot checks the archive's status and updates the corresponding Wikipedia page if the archive was completed successfully. It will also attempt to add title, author, and other metadata that wasn't supplied by the human who added the link.

Features not yet implemented:
 * Ability to archive all links on a specific page on demand
 * Build database of "problem" sites to save time
 * Tag invalid links with dead link (Implemented June 6, 2009)
 * More robust capture of metadata; build db of human supplied metadata to assist bot in determining certain items (update: Bot is now capturing human entered data for each page it loads in order to build this db)
 * Attempt to locate archive for older links when updating a page (maybe)

Known Issues/Limitations:
 * Some link additions are not reported to #wikipedia-en-spam (likely because there are too many edits for the reporting bot to examine every one) and thus are not caught by WebCiteBOT.
 * The link reporting bot will "un-encode" characters that are URL encoded (e.g. "%80%99") which will make my bot unable to find the link in the wikitext and report it as "removed". (A workaround was added to the code February 26, 2012 to "save" a few of these.)
 * WebCiteBOT is not able to distinguish between true new additions and additions caused by reverts and such. Thus, sometimes a "new" link is actually fairly old and the archived version may not match the version the original editor saw.
 * WebCitation.org does not archive some pages due to robot restrictions. A small number of additional pages are archived incorrectly. (WebCiteBOT normally catches these and doesn't link to them.)
 * WebCiteBOT does not follow redirects. This means if a page is moved after a link is added, but before the bot looks at it, it will be reported as "(link) has been removed".  It is not clear to me whether following redirects would be a desirable behavior or not.

Feel free to make a suggestion to improve the bot.

Frequently Asked Questions
Q. I just added a new URL to some page ; what should I do now?
 * A. You don't have to do anything. The bot constantly monitors an IRC feed which reports most link additions.  It stores every link reported there and archives them after 2-3 days time.  A feature is currently in the works to allow on-demand archiving of very time sensitive links, but for now it is relying entirely on the IRC feed.

Q. Why wasn't archived?
 * A. The most common reason is the website in question has a robots restriction in place that asks robots not to cache their content. However, there are a number of other possibilities as well. (See known limitations section above.)

Q. Why are there screwed up UTF8 characters in the log?
 * A. Unfortunately the IRC feed the bot relies on sometimes messes up two-byte characters. WebCiteBOT has been programed to try an alternative title where "messed up" characters are corrected based on common patterns if the first title it tries doesn't exist.  It can only do this after first checking the title as provided, as sometimes titles that look messed up actually aren't.  The log always reflects the first title tried, but the actual operation of the bot uses the corrected title when it can figure one out.