Wikipedia:Bots/Requests for approval/BOTijo 10


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol neutral vote.svg Request Expired.

BOTijo 10
Operator:

Time filed: 16:33, Thursday April 21, 2011 (UTC)

Automatic or Manual: automatic unsupervised

Programming language(s): Python

Source code available: Yes, it is free software (GPL3), available at Google Code

Function overview: archive ref links, see examples in the conversion table below

Links to relevant discussions (where appropriate):
 * Bots/Requests for approval/WebCiteBOT
 * WikiProject External links/Webcitebot2
 * Wikipedia talk:Link rot
 * WikiProject Council/Proposals/Dead Link Repair
 * http://www.webcitation.org/doc/WebCiteBestPracticesGuide.pdf

Edit period(s): continuous

Estimated number of pages affected:

Exclusion compliant (Y/N): Yes

Already has a bot flag (Y/N): Yes

Function details:


 * 1) Bot checks all the articles (not only new ones), searching for links introduced in the last X days
 * 2) * Which age limit? 7 days? 30 days? 365 days?
 * 3) If there is an archived copy of the website less than 30 days old at WebCite
 * 4) bot links to this available archived copy
 * 5) else
 * 6) if website is online (no error 404, 403, etc)
 * 7) bot archives it at WebCite and...
 * 8) ...links to this last archived copy
 * 9) else
 * : ( website was lost forever (?), do nothing
 * 1) Submit changes to article using the conversion table below

Features to discuss:
 * below
 * Auto-generated titles using the html tag?

Discussion
Citation templates should not be introduced in the references given as plain text (WP:CITEVAR), instead you can use WebCite. — HELL KNOWZ  ▎TALK 16:41, 21 April 2011 (UTC)


 * WebCite it is easier, but cite web offers more information (WP:CITEVAR says so: Generally considered to be helpful improvements Replacing bare URLs with full bibliographic citations: an improvement because it provides more information to the reader and fights linkrot). So, more opinions? Regards. emijrp (talk) 17:10, 21 April 2011 (UTC)


 * I meant "To be avoided unless there is consensus: Switching between major citation styles". Introducing cite web in an article that exclusively uses plain-text referencing is considered a reference style alteration. — HELL KNOWZ  ▎TALK 17:18, 21 April 2011 (UTC)


 * Can you help me to build a table like this one? Thanks. emijrp (talk) 18:08, 21 April 2011 (UTC)

I oppose automatic generation of titles. While something sensible may happen in a large proportion of cases, this has been done before and it led to countless inappropiate examples and quite a few complaints. Web page titles are often full of spammy key words, are sometimes identical across the whole site, and sometimes contain bafflingly irrelevant information. I would be fine with bot assisted human so that the obviously rubbish, but not fully automatic. Better to leave a bare url than risk nonsense titles.  Sp in ni ng  Spark  17:55, 21 April 2011 (UTC)


 * We have 12M links (although not all are refs), so this bot can't be manually assisted. If auto-generated titles are not desired, I can leave empty that value or fill it with the same URL ( or http://google.com). emijrp (talk) 18:13, 21 April 2011 (UTC)


 * In that case I think you should leave the ref as a bare url, at least that tells the reader for certain where they are being taken. Autogeneration will produce too many spurious results.  I also agree that cite web should not be introduced into articles where it is not already the preferred style.  I would, however, support changing links that have no text into a bare url as it is a disservice to readers not to make clear where they are being taken to.  Sp in ni  ng  Spark  14:10, 26 April 2011 (UTC)


 * Question from SpinningSpark if an editor informs you that they think your bot has made a mistake, what action will you take?  Sp in ni ng  Spark  17:55, 21 April 2011 (UTC)


 * If it is obvious he is wrong, I will leave a link to this RFA (where the issue was discussed before) in his talkpage. If it is not sure, I stop the bot and discuss. emijrp (talk) 18:19, 21 April 2011 (UTC)

Have you talked to the WebCite people about this, so you can conform to any restrictions they may have? Will you be adding no along with archiveurl, in case anyone ever gets around to adding that to cite web? Anomie⚔ 10:33, 22 April 2011 (UTC)


 * I have no talked to WebCite people yet, because I prefer to approve this bot in Wikipedia, and later, ask them to add my IP (or Toolserver IP) into the white list. Today, an IP is banned when it has archived 100 URLs, and you have to wait some hours to start archiving new URLs again.


 * From the FAQ: If you are a programmer or a student looking for a project, here are some ideas for programming projects. [...] develop a wikipedia bot which scans new wikipedia articles for cited URLs, submits an archiving request to WebCite®, and then adds a link to the archived URL behind the cited URL. So, I think there is no problem with white listing stuff.


 * My idea is to add deadurl=no/yes and archiveurl= to every single cite web occurrence. Is that OK? Sorry, but I have not understood your question. emijrp (talk) 12:07, 22 April 2011 (UTC)

A few pitfalls that I have thought of;
 * Will you be checking that the reference does not already include a link to WebCite? There is a danger that you might replace a link to a valid archive with a link to one that has entirely changed since the ref was created.  If there is an earlier archive link, that one should be retained as it is closer to the actual page from which the information in the article was taken.
 * I don't like your algorithm for using pre-existing WebCite archives (30 days). If this archive is different from the one looked at by the editor you will be archiving the wrong thing.  You should only use the pre-existing archive if it is younger than the date the reference was inserted.  You won't, in any case, end up with duplicated archives at WebCite.  Their software will return the link to the pre-existing page if you attempt to archive a duplicate.
 * Are you searching for references that are not between ref tags such as those in a Bibliography, References or Footnotes sections?
 * Are you taking steps to exclude sites such as Google books, Amazon Search Inside and other sites that display previews of copyrighted print works?
 * Are you checking that the ref is not to a site that is already an archive site such as The Internet Archive or the Wayback Machine?

 Sp in ni ng  Spark  14:47, 26 April 2011 (UTC)

Hi: Regards. emijrp (talk) 10:14, 27 April 2011 (UTC)
 * 1) Yes. The bot only edits references without archiveurl= and archivedate= parameters. Furthermore, if the url= parameter contains archive.org or webcitation.org, it is skipped.
 * OK, but I guess that younger pre-existing archives are not common. If a ref was added to an article in 2002, and there is no archive at WebCite before 2002, would I archive a copy today (2011) and add it to archiveurl= parameter? A lot of websites may be changed since the day the ref was added. By the way, the script adds archivedate= parameter, so, a reader can compare accessdate= (for example 2002-01-01) with archivedate= (for example 2010-12-31) and she will know that (probably) the archived copy is different to the original website (8 years).
 * 1) No. The script only edits references between &lt;ref&gt; tags. If you want to add more cases, go to the conversion table below.
 * 2) I don't exclude that domains, but I can black list them. By the way, almost all the Internet is copyrighted, not only Google Books or Amazon. It is not possible to archive only websites with free licenses. Read the What about copyright issues? section at WebCitation.
 * 3) Yes, I exclude archive.org and webcitation.org domains. That URLs are not archived.


 * WP:BRION asked for a few minutes of help from all readers with this request list, and I decided to check it out. On WebCite I found this info: "WebCite allows on-demand prospective archiving. It is not crawler-based; pages are only archived if the citing author or publisher requests it."  The bot is not the citing author, and not the publisher (of the cited work).  –89.204.137.195 (talk) 21:08, 22 June 2011 (UTC)
 * The citing author is the Wikipedia community, this bot will only automatize this process. Read I am a programmer or student looking for a project and ideas - How can I help? section, they request a Wikipedia bot. --emijrp (talk) 06:08, 12 July 2011 (UTC)
 * I'll just add that Gunther Eysenbach of WebCitation.org was actively involved in our discussion about archiving external websites that Wikipedia uses for references and such. Mr. Eysenbach specifically stated that WebCite wants to work with Wikipedia to get websites archived and he has even offered a $500.00 bounty to encourage people to get this task accomplished. So the concern raised above is not an issue in this case. - Hydroxonium (T•C• V ) 02:35, 13 July 2011 (UTC)
 * IMO the "citing author" is listed in the edit history, and only unregistered users like me can be considered as some anonymous "community". I don't know Python, does the bot respect robots.txt? Does it use link rel="canonical" to find "canonical" content with a relevant robots.txt? Does it limit its accesses on a given web server somehow? When I tried to create a simple link checker (a kind of bot, unrelated to Wikipedia) some years ago I managed to flood some web servers with HEAD requests, the damage was fortunately limited by a slow modem connection. –89.204.153.138 (talk) 20:37, 23 July 2011 (UTC)

Conversion table
Suggest as many conversions as you want.

How will you make sure that the date format used for archivedate matches the article format (e.g. use dmy dates or predominant use, see Bots/Requests_for_approval/H3llBot_4 for example)?

How do you re-verify dead links? Currently a week is the usual amount to re-check the site in case it had temporary maintenance/downtime.

Does the bot follow redirects from 404 pages to live pages?

A relevant Requests for comment/Dead url parameter for citations addresses marking urls with yes or no, which is relevant to this BRFA. I don't want to create fait acompli by asking you to do this, but then again I want to wait until the end of RfC. In any case, I think there should be some indication that the original links are not dead yet and the supplied Webcite url is pre-emptive archiving. — HELL KNOWZ  ▎TALK 09:25, 22 May 2011 (UTC)

Please note that the above RfC has closed as successful and deadurl will be implemented. Would you be able to add this your bot's logic? — HELL KNOWZ  ▎TALK 13:54, 11 June 2011 (UTC)
 * Yes, no? SQL Query me!  08:05, 3 August 2011 (UTC)

No response from the operator in 8 days, operator has not edited this page since mid-july. without prejudice, please re-open this request whenever you would like. SQL Query me! 05:16, 11 August 2011 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.