Wikipedia:Bots/Requests for approval/RotlinkBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol oppose vote.svg Withdrawn by operator.

RotlinkBot
Operator:

Time filed: 21:04, Sunday August 18, 2013 (UTC)

Automatic, Supervised, or Manual: Supervised

Programming language(s): Scala, Sweble, Wiki.java

Source code available: No. There are actually only few lines of code (and half of them are Wiki urls and passwords), because of using two powerful frameworks which do all the work.

Function overview: Find dead links (mostly by looking for marks next to them) and try to recover them by searching web archives using Memento protocol.

Links to relevant discussions (where appropriate): User_talk:RotlinkBot, Bot_owners'_noticeboard

Edit period(s): Daily

Estimated number of pages affected: 1000/day (perhaps a bit more in the first few days)

Exclusion compliant (Yes/No): No. It was not exclusion compliant initially and so far nobody undid any change or make complaints against it. It can be easily made exclusion compliant.

Already has a bot flag (Yes/No): No

Function details: Find dead links (mostly by looking for marks next to them) and try to recover them by searching web archives using Memento protocol.

The current version of the bot software does not work with the other, non Memento-compatible, archives (WebCite, WikiWix, Archive.pt, ...).

During [//en.wikipedia.org/w/index.php?title=Special:Contributions/RotlinkBot&offset=20130818230000&target=RotlinkBot the test run], about 3/4 of recovered links were found on Internet Archive (because it has the biggest and oldest database), about 1/4 on Archive.is (because of its proactive archiving of the new links appearing on the Wikis) and only few links on the other archives (because of their smaller size and regional specific).

Discussion

 * Comment - I have a concern with this Bot in that it has a possibly unintended side affect of replacing valid HTML entities in articles with characters that result in undefined browser behavior. RotlinkBot converted "&amp;gt;" to "&gt;" here and "&amp;amp;" to "&amp;" here. --Marc Kupper&#124;talk 01:41, 19 August 2013 (UTC)


 * Hi.


 * Yes, it was already noticed. User_talk:A930913/BracketBotArchives/_4. Anyway, although this optimisation of entities which Wiki.java framework performs on each save is harmless and do not break the layout (both rendered pages - before and after  the change - has "&gt;"), the resulting diffs are confusing for people. I am going to fix it to avoid weird diffs in the future.


 * Also, the rendered HTML (which the browser sees and deals with) has "&amp;amp;" even if the Wiki source has "&amp; without an entity name". This seems to be the reason of the framework authors: if the resulting HTML is the same it may has sense to emit the shortest possible Wiki source. But this nice intention results in the weird diffs if the Wiki source was not optimal before edit.


 * Again, to be clear. Bot makes changes in the Wiki source. The browser does not deal with the Wiki source and cannot hit the "undefined behavior" case. Browser deals with the HTML which is produces from the Wiki source on the server and single "&amp;"s are converted to "&amp;amp;" on this stage. The only drawback I see here is that bot makes edits out of the intended scope. And this will be fixed. Rotlink (talk) 03:49, 19 August 2013 (UTC)

I will consider the many edits already made as a long trial. I'll look through more edits as time permits.

One immediate issue is to have a much better edit summary, with a link to some page explaining the process of the bot.

For more info on how this was handled before, see H3llBot, DASHBot and BlevintronBot and their talk pages, so you are aware of the many issues that can come up.

Just to clarify, does "supervised" mean you will review every edit the bot makes, because that is what it means here? If so, I can be more lenient on corner cases as you will manually fix them all. Hellknowz


 * I look through the combined diff of all edits (it is usually 1-2 lines per article - only the editing point and few characters around it). Also, the number of unique introduced urls are smaller than the number of edits. Rotlink

For example, how do you determine links are actually dead? What are all the ways that "mostly by looking for dead link" actually means? For example, it is consensus that the link should be revisited at least twice to confirm it is not just server downtime or incorrect temporary 404 messages. This has been an issue, as there are false positives and many corner cases. Hellknowz


 * You are absolutely right, detecting dead links is not trivial.
 * That is why I started with the most trivial cases:
 * sites which are definetely dead for long time (such as btinternet.{com|co.uk}, {linuxdevices|linuxfordevices|windowsfordevices}.com family, ...)
 * Google Cache, which Wikipedia has a lot of links to and which is easy to check if a entry were removed from the cache.
 * This job is almost finished, so it seems needless to submit another BRFA for it.
 * I mentioned dead link as a future way to find dead links. Something like   would be stronger signal about dead link than a 404 status of the url (which can be temporary). The idea was to collect a list of all urls marked as dead link, to find dead domains with a lot of urls and to perform on them the same fixes which had been done on btinternet.com (or, generally, not dead domains but dead prefexes, for example, all urls starting with http://www.af.mil/information/bios/bio.asp?bioID= are dead while the domain itself is not). I see no good way how to it fully automatic. The scripts can help to find such hot spots and prioritize them by the number of dead links so a single manual check can give a start to a massive replace. This also simplifies manual post-checking.Rotlink


 * So the bot does not currently browse the webpages itself? You manually specify which domains to work on by a human-made assumption they are all dead? That sounds fine.
 * As a sidenote, is usually a human-added tag, and humans don't double-check websites (say, if it was just downtime). In fact, a tag added by a bot, say , might even be more reliable as the site was checked 2+ times over a period of time. At the same time, the bot could make a mistake because the remote website is broken while human can tag a site which appears live (200) to the bot. Just something to consider. —  HELL KNOWZ  ▎TALK

One issue is that the bot does not respect existing date formats. archivedate should be consistent, usually with accessdate and definitely if other archive dates are present. They are exempt from use dmy dates and use mdy dates, although date format is a contentious issue. Hellknowz
 * Ok. Guessing the date format was implemented but soon disabled because of mistakes in guessing when a page has many format examples to reuse. It seems that the bot will not create new citeweb and the only point where it will need to know the date format is adding archivedate to the templates which already have accessdate as the example. Rotlink

Further, Wayback (or similar) is the preferred way of archiving bare external links, you should not just replace the original url without any extra info, this just creates extra issues as the url now permanently points to the archive instead of the original. You can create other service-specific templates for your needs, probably for Archive.is. Hellknowz


 * Currenly, I try to preserve the original url within the archive url when the url is replaced (like in the diff you pointed to) and the use the shortest form of archive URL when archive url is added next to the dead url. It can be a discussion topic which way is better. Original URLs were preserved inside Google Cache urls and it was very easy to recover them out.Rotlink


 * But you are still obfuscating the original url, even if it can be parsed back by human or bot. I still argue a template made specifically for this issue should be used. — HELL KNOWZ  ▎TALK


 * I would say that prepending something like http://web.archive.org/web/20021113221928/ produces something less cryptic than two urls wrapped into template. Also, Wayback renders extra information besides the title. This can break narrow columns in tables, etc Rotlink


 * I admit this isn't something I have considered because I didn't work on anything other than external links inside reference tags, mainly because of all the silly cases like this, where beneficial changes actually break formatting. Yet another reason to use proper references/citations. But that isn't English wiki, and I cannot speak for them. Here we would convert all that into an actual references. I won't push for you to necessarily implement Wayback and such, but if this comes up later, I did warn you :) — HELL KNOWZ  ▎TALK

Here you change external url to a reference, which is afoul of WP:CITEVAR and bots should never change styles (unless that's their task). I'm guessing this is because the url sometimes becomes mangled as per previous paragraph? Hellknowz


 * It was a reference, it is inside  .
 * It may look inconsistent with many other bot edits.
 * The former is adding archive url next to existing dead url. It tries to preserve original url which means:
 * if the dead url is within something like citeweb it adds archiveurl= to it (the shortest of the archive urls is used)
 * if the dead url is if forms citeweb with archiveurl= (the shortest of the archive urls is used)
 * otherwise, it replaces dead url with archive url (using form of archive url with the dead url inside it)
 * The latter just replaces one archive url (Google Cache) with another archive url. So it does not depends on the content. Rotlink


 * Let me try to clarify. A reference is basically anything that points to an external source. The most common way is to use the tags, but it doesn't have to be. The most common citation syntax is the citation style 1, i.e. using cite web template family, but it doesn't have to be.
 * is a valid reference using manual citation style.
 * is a valid reference using CS1.
 * So if all the references in the article are manual (#1), but a bot adds a cite web template, that is modifying/changing the citation style. Even changing only bare external urls to citations is sometimes contentious, especially if a bot does that. This is what WP:CITEVAR, WP:CONTEXTBOT means. — HELL KNOWZ  ▎TALK
 * Is interchangeble with ? They render identically Rotlink
 * No, they cannot be interchanged in code if the article already uses one style or the other (unless you get consensus or have a good reason). Human editors have to follow WP:CITEVAR, let alone bots. You could, for example, convert them into CS1 citations if most other references are CS1 citations. I admit, I much prefer cite xxx templates and they make bot job easy, and I'd convert everything into these, especially for archive parameters. But we have no house style and the accepted style is whatever the article uses. That's why we even have Wayback and such. — HELL KNOWZ  ▎TALK

Other potential issues, like a Wayback template already next to citation, various possible locations of dead link (inside, outside ref). Archive parameters already in citations or partially missing. Hellknowz

To clarify, does the bot use accessdate, then date for deciding what date the page snapshot should come from? If there are no date specified, does the bot actually parse the page history to find when the link was added and thus accessed. This is how previous bot(s) handled this. Unless consensus changes, we can't yet assume any date/copy will suffice. Hellknowz


 * I have experimented with parsing the page history and this tactic showed bad results. The article (or part of the article) can be transtaled from a regional wiki (or another site) and url can be already dead by the time it appears in English Wikipedia. I implemented some heuristics to pick up the right date, but 100% accurate can be only manual checking. Or, we could consider as a good deal the fixing a definetely dead link with a definetely live link which in some cases might be inaccurate (archived a bit before or after the timerange when the url had the cited content) with preserving the original dead link either within the url or as a template parameter.Rotlink


 * It's still more accurate than assuming current date to be the one to use. What I mean is that any date before today where the citation already exists is closer to the original date, even if not the original. Translated/pasted text can probably be considered accessed at that day. Pasted text on first revision can be skipped. I won't push this logic onto you, as I think community is becoming much more lenient with archive retrieval due to sheer number of dead links, but it's something to consider. — HELL KNOWZ  ▎TALK


 * Current heuristic is to peek the oldest snapshot of the exact url. By "exact" I mean string equality of archived link and dead link in Wiki article (they are not always equal, because of presense and absense of www. prefix, see the urls in Memento answer on the picture below). Parsing the page history adds the knowledge of the top bound ("article existed at most at that date"). Knowledge of the top bound wouldn't help the heuristic which simply peeks the oldest snapshot. Anyway, we need more ideas here. Perhaps, all the snapshots between the oldest and the top bound have to be downloaded and analyzed (if they are similar, then bot can peek any one, otherwise the decision must be done by human... Rotlink


 * You didn't hear me say this, but we probably don't need such precision. It would be an interesting exercise to compare old revisions. But I think we can just use the oldest date that was found in article history and that would work for 99%+ of cases (at least I did this and I have not received any false positive reports in the past). In fact, humans hardly ever bother to do this and previous bots weren't actually required to. I personally just ignored any cases where I couldn't find the date within a few months, but later bots have pretty much extended this period to any date. — HELL KNOWZ  ▎TALK

The bot does need to be exclusion compliant due to the nature of the task and the number of pages edited. You should also respect inuse templates, although that's secondary. Hellknowz

Can you please give more details on how Memento actually retrieves the archived copy? What guarantees are there that it is a match, what are their time ranges? I am going through their specs, but it is important that you yourself clarify enough detail for the BRFA, as we cannot easily approve a bot solely on third-party specs that may change. While technically an outside project, you are fully responsible for the correctness of the change. Hellknowz


 * Actually the bot does not depend on the services provided by Memento project. Memento just defines the unified API to retrieve older versions of web pages. Many archives (and Wikis as well) support it. Without Memento, specific code need to be written to work with each archive (that's what I will do anyway to support WebCite and others in the future).
 * The protocol is very simple and I think one picture could explain it much better than the long spec.
 * C:\&gt;curl -i "http://web.archive.org/web/timemap/http://www.reuben.org/NewEngland/news.html"&#xA;HTTP/1.1 200 OK&#xA;Server: Tengine/1.4.6&#xA;Date: Mon, 19 Aug 2013 12:18:06 GMT&#xA;Content-Type: application/link-format&#xA;Transfer-Encoding: chunked&#xA;Connection: keep-alive&#xA;set-cookie: wayback_server=36; Domain=archive.org; Path=/; Expires=Wed, 18-Sep-13 12:18:06 GMT;&#xA;X-Archive-Wayback-Perf: [IndexLoad: 140, IndexQueryTotal: 140,, RobotsFetchTotal: 0, , RobotsRedis: 0, RobotsTotal: 0, Total: 144, ]&#xA;X-Archive-Playback: 0&#xA;X-Page-Cache: MISS&#xA;&#xA;&lt;http:///www.reuben.org/NewEngland/news.html&gt;; rel="original",&#xA;&lt;http://web.archive.org/web/timemap/link/http:///www.reuben.org/NewEngland/news.html&gt;; rel="self"; type="application/link-format"; from="Wed, 13 Nov 2002 22:19:28 GMT"; until="Thu, 10 Feb 2005 17:57:37 GMT",&#xA;&lt;http://web.archive.org/web/http:///www.reuben.org/NewEngland/news.html&gt;; rel="timegate",&#xA;&lt;http://web.archive.org/web/20021113221928/http://www.reuben.org/NewEngland/news.html&gt;; rel="first memento"; datetime="Wed, 13 Nov 2002 22:19:28 GMT",&#xA;&lt;http://web.archive.org/web/20021212233113/http://www.reuben.org/NewEngland/news.html&gt;; rel="memento"; datetime="Thu, 12 Dec 2002 23:31:13 GMT",&#xA;&lt;http://web.archive.org/web/20030130034640/http://www.reuben.org/NewEngland/news.html&gt;; rel="memento"; datetime="Thu, 30 Jan 2003 03:46:40 GMT",&#xA;&lt;http://web.archive.org/web/20030322113257/http://reuben.org/newengland/news.html&gt;; rel="memento"; datetime="Sat, 22 Mar 2003 11:32:57 GMT",&#xA;&lt;http://web.archive.org/web/20030325210902/http://www.reuben.org/NewEngland/news.html&gt;; rel="memento"; datetime="Tue, 25 Mar 2003 21:09:02 GMT",&#xA;&lt;http://web.archive.org/web/20030903030855/http://reuben.org/newengland/news.html&gt;; rel="memento"; datetime="Wed, 03 Sep 2003 03:08:55 GMT",&#xA;&lt;http://web.archive.org/web/20040107081335/http://www.reuben.org/NewEngland/news.html&gt;; rel="memento"; datetime="Wed, 07 Jan 2004 08:13:35 GMT",&#xA;&lt;http://web.archive.org/web/20040319134618/http://www.reuben.org/NewEngland/news.html&gt;; rel="memento"; datetime="Fri, 19 Mar 2004 13:46:18 GMT",&#xA;&lt;http://web.archive.org/web/20040704184155/http://www.reuben.org/NewEngland/news.html&gt;; rel="memento"; datetime="Sun, 04 Jul 2004 18:41:55 GMT",&#xA;&lt;http://web.archive.org/web/20040904163424/http://www.reuben.org/NewEngland/news.html&gt;; rel="memento"; datetime="Sat, 04 Sep 2004 16:34:24 GMT",&#xA;&lt;http://web.archive.org/web/20041027085716/http://www.reuben.org/NewEngland/news.html&gt;; rel="memento"; datetime="Wed, 27 Oct 2004 08:57:16 GMT",&#xA;&lt;http://web.archive.org/web/20050116115009/http://www.reuben.org/NewEngland/news.html&gt;; rel="memento"; datetime="Sun, 16 Jan 2005 11:50:09 GMT",&#xA;&lt;http://web.archive.org/web/20050210175737/http://www.reuben.org/NewEngland/news.html&gt;; rel="last memento"; datetime="Thu, 10 Feb 2005 17:57:37 GMT"
 * Rotlink


 * This is very cool, I might consider migrating my bot to this service, as manual parsing is... well, let's just say none of the archiving bots are running anymore. — HELL KNOWZ  ▎TALK 13:49, 19 August 2013 (UTC)

Finally, what about replacing google cache with archive links? Do you intend to submit another BRFA for this? — HELL KNOWZ  ▎TALK 09:34, 19 August 2013 (UTC)


 * Hi.
 * Thank you for so detailed comment!
 * Some of the questions I am able to answer immediately, but for other I need some time to think and answer later (few hours or days) so they are skipped for a while. Rotlink


 * I moved your comment inline with mine, so that I can further reply without copying everything. I hope you don't mind. — HELL KNOWZ  ▎TALK 13:18, 19 August 2013 (UTC)


 * About "another BRFA for google cache". I united this question with another one and answered both together. But after the refactoring of the discussion tree it appears again (you can search for "another BRFA" on the page). Rotlink (talk) 16:34, 19 August 2013 (UTC)
 * Oh yeah, my bad. I also thought it's a wholly separate task, whereas you are both replacing those urls same way as others and replacing IPs with a domain name. The latter one is what I am asking further clarification on. Especially, since it can produce dead links, which should at least be marked so. Of course, that implies being able to detect dead urls or reliably assume them to be. — HELL KNOWZ  ▎TALK 16:46, 19 August 2013 (UTC)
 * The issue about replacing dead links with other dead links by replacing IPs with a domain name (webcache.googleusercontent.com) was fixed about a week ago. New links are checked to be live. If they live, IP replaced with webcache.googleusercontent.com. Otherwise the searching on the archives begins. If nothing found, links with dead IP remain.
 * So, these had been two tasks. Then they were joined together in order the bot not to save the intermediate state of pages (witch contains a dead link) and not to confuse the people who see it. Rotlink (talk) 17:04, 19 August 2013 (UTC)


 * Because the bot operator has removed this request on the main page, I am hereby marking this as — cyberpower ChatOnline 19:11, 20 August 2013 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.