Wikipedia:Requests for comment/Archived citations 2

In response to a previous RfC, which was met with positive feedback but a number of requests to start with a small trial, this is a proposal to address the linkrot problem for external links, starting with a trial of an existing solution on a single category.

Proposal:
 * 1) to modify MediaWiki:Common.js for a group of testers so that they will see additional links next to all external links, pointing to archived versions of those webpages.
 * 2) to test the implementation currently used on the French Wikipedia, starting with a single article (Lady Antebellum) and a single category (Category:Musical groups established in 2006), asking Wikiwix to store cached versions of all external links in that category.
 * 3) to revisit the original proposal (to do something similar for all visitors and all articles) after a month, with this additional information in hand.

Summary
Link rot is a major problem on Wikipedia. The French Wikpedia found a solution using cached webpages at Wikiwix.com and implemented it over 2 years ago. A task force on English Wikipedia is proposing the same solution. A board member from the Wikimedia Foundation has reviewed the related discussion and is supportive.


 * There is an on-going discussion of additional backup solutions to avoid having a single point of failure.
 * This change would affect everyone visiting the English Wikipedia with a browser that has javascript enabled, not just registered users.

This proposal suggests implementing this, for all readers, on a single article. That will inform further discussion about how to proceed with a larger-scale test.

Details
Wikipedia relies on verifiable information from reliable sources to ensure that the information it provides is accurate. Wikipedia uses external links to webpages to verify information in articles. There are currently 17.5 million external links used on the English Wikipedia (see list 844 MB download) with a few thousand links added each day. These links often go dead, which is referred to as WP:LINKROT. The number of articles marked with dead links has been increasing and is now over 100,000 (see graph).

A task force was formed to deal with this issue and found that the French Wikipedia implemented a solution in October 2008 (see translated proposal}. The solution involves modifying the javascript code that is common to everybody that visits Wikipedia. The modified code adds an "[archive]" blue link to cached versions of webpages from the search engine Wikiwix.com (see translated example, note however that in the original proposal the links were green).

These "[archive]" links would be added next to external links for visitors that have javascript enabled.

Proposal

 * 1) Have wikiwix add cached pages for external links for the category (Category:Musical groups established in 2006).
 * 2) Ask wikiwix to start following RecentChanges for that category so that new external links are cached instantly
 * 3) Modify page-specific javascript for a test article in that category (Lady_Antebellum) to include fr:Utilisateur:Pmartin/cache.js, so that all visitors see links to cached versions of external links on that page.
 * 4) Invite interested testers to add this javascript to their own personal js, by hand or via a widget, so that they can try it out on articles in that category.
 * 5) Revisit this proposal after a trial period of one month for further discussion, to consider whether it should be expanded.

Testing
This solution has been in use at fr.wikipedia for over 2 years and has been thoroughly tested.

Example: the 4th reference in Lady_Antebellum is http://www.nbcaugusta.com/news/entertainment/18338164.html/ The Wikiwix archive link would look like this:
 * [ archive]

To test the javascript yourself, add the following line to your vector.js. At the moment, this code can only show how the links will look like, the actual Wikiwix archive is not yet enabled. importScript("User:Pmartin/cache.js")

Should we run this test?
Implementing this feature requires community agreement as it affects everybody that visits articles in the test category on English Wikipedia using a browser with javascript enabled. Please use the discussion area below for extended comments.

Support this test

 * 1) A test is needed to move forward. Dodoïste (talk) 22:05, 22 March 2011 (UTC)
 * 2) I support this test, we've had a lot of RFC's on this issue without action and this is a sensible and practical way forward. A test is just a test, the pre-conditions don't need to be perfect, because indeed we're trying to discover any problems, but this is an important step to fighting the arch-annoyance of linkrot Jztinfinity (talk) 04:26, 23 March 2011 (UTC)
 * 3) A test should be made. – Allen4names 05:00, 23 March 2011 (UTC)
 * 4) Doing a test on a single article seems limited to the point of uselessness. But if it will lead to a wider usage I support it. Mr.Z-man 06:42, 26 March 2011 (UTC)
 * 5) Support a test.  Hopefully this can move quickly from a single article to one of its encompassing categories, and then can be proposed for the whole site. –SJ +  07:02, 19 April 2011 (UTC)
 * 6) Now that my concerns have been addressed, I think a trial would be helpful; however, I would like some sort of gadget/script to disable this feature if desired. / ƒETCH COMMS  /  16:33, 19 April 2011 (UTC)
 * 7) Support this is worth trying out. Graeme Bartlett (talk) 08:22, 8 May 2011 (UTC)
 * 8) Support - Linkrot is a major issue and we have people wanting to help us fix it, so I support this test. - Hydroxonium (T•C• V ) 15:54, 8 May 2011 (UTC)
 * 9) Support this is a big problem, wikiwix may be the solution. --Cerebellum (talk) 23:46, 8 May 2011 (UTC)

Oppose this test
(Now supporting.) I accessed the example link above to the NBC Augusta site, and the Wikiwix header came up with "Cette page est la version en cache de cette URL. C'est une image de la page telle qu'elle était au moment de l'insertion du lien sur wikipedia . Vous pouvez obtenir plus d'informations et nous donner votre avis sur le cache sur cette page. | Si vous souhaitez bloquer la mise en cache de ce site, merci d'utiliser le formulaire de blocage de site." Now, I know French, so I understand what this is saying. But I couldn't figure out how to turn that to English (while the regular Wikiwix homepage shows up in English), and I oppose this unless/until there is a full English interface to display to readers who do not understand French. This is, after all, the English Wikipedia. / ƒETCH COMMS  /  21:26, 22 March 2011 (UTC)
 * Sure. The Wikiwix archive deployed at en.wikipedia will have an English interface. We'll do the translation, and refine it according to the community's desires during the first trial. That's considered to be normal, and developers at LinterWeb (which produces Wikiwix) have already made several interfaces in English. Here is a quick translation: "This page is the cached version of this URL. It is a copy of the page as it was when the URL was first inserted in Wikipedia. For more informations, or to provide feedback, please go to this page. | If you wish to block caching of this site, please follow the corresponding procedure." Cheers, Dodoïste (talk) 21:42, 22 March 2011 (UTC)
 * OK, but how long do you think it will take for this to be up? I'd think it more reasonable if the community could examine an English-only interface first, and then proceed with the trial, or at least have a set date for the English version ready. Do you know what the current relationship between LinterWeb and the WMF is? (I mean, who is supposed to contact them?) Regards, / ƒETCH COMMS  /  23:46, 22 March 2011 (UTC)
 * J'ai demandé à mon équipe de traduire la barre du haut en anglais, elle devrait passer en anglais au plus tard Lundi. Cordialement Pascal Pmartin (talk) 18:23, 24 March 2011 (UTC)
 * Translation by Dodoïste : "I've just asked my employees to translate the interface into English, which should be completed on this Monday (March 28). Yours, Pascal Pmartin (talk)".
 * Done Pmartin (talk) 09:20, 28 March 2011 (UTC)
 * Sorry for responding so late; I forgot to check back here—it looks good and I'm now supporting the trial. / ƒETCH COMMS  /  16:33, 19 April 2011 (UTC)

Discussion

 * find a way to test this using Cite web or similar templates.  test this as a gadget with the default set to 'off'.  (Allen4names)
 * confusion about an incident with Twitter indexing which made it appear as though Wikiwix was running ads on its caches (it does not).
 * "unopposed to this being a gadget"


 * Isn't the issue that we need to archive a page before it's dead? While it's great to link to archives on dead pages, despite the best efforts of the archival services not all pages end up archived, however, by triggering an api by an archival service you can force the page to be archived, just in case.Jztinfinity (talk) 21:27, 23 February 2011 (UTC)


 * use multiple archival sites, like webcitation.org


 * I proposed modifying Citation/core in this discussion, which would update the major templates like cite news and cite web. But there has been some opposition to that suggestion. (Hydroxonium)


 * The code at User:Pmartin/cache.js needs to be better commented so that people can easily understand what it's doing. Kaldari (talk) 21:26, 22 March 2011 (UTC)

Comments on this second RfC
We would like to make some progress and feel a small test would be beneficial. Thanks. -  Hydroxonium  ( H3O+ ) 16:20, 26 February 2011 (UTC)

This is a draft RfC to fix the dead link problem. Everybody is encouraged to edit it. Please refer to the previous RfC and its talk page for related information. Thanks. - Hydroxonium (T•C• V ) 10:23, 16 March 2011 (UTC)

Well, regarding the last RfC, we should try to come to a real, implementable solution this time, as WP:LINKROT is not just one of some minor problems of Wikipedia. Also, how are people generally made aware of an RfC? Sine what we try to reach a consensus over is something, that will most likely affect Wikipedia as a whole, I think we should draw as much attention to this RfC as possible. How are people made aware of currently running RfCs? Toshio Yamaguchi (talk) 10:49, 16 March 2011 (UTC)
 * For RfC's in general, people look at the RfC topics they are interested in. For the last RfC I listed it at the Village pump and on Centralized discussion, but it obviously didn't get much attention. When this RfC is ready I will make a watchlist notice so that people will see it every time they look at their watchlist. - Hydroxonium (T•C• V ) 11:04, 16 March 2011 (UTC)

Try a comment-type RfC?
I'm thinking that we may want to start this off as a comment RfC to define what people are willing to accept for a solution. Then moving to a yes/no type RfC after we have defined what we want to do. Any thoughts? - Hydroxonium (T•C• V ) 11:09, 16 March 2011 (UTC)
 * Sounds good to me. I think this time we should try to focus the discussion on one single place. One problem with the last discussion was that it became dispersed over several places and I personally began to lose the overview on the overall progress of the whole thing. The whole discussion needs to be carried out in one place (on one page) only. Otherwise it will end the same way as the last discussion. Toshio Yamaguchi (talk) 11:31, 16 March 2011 (UTC)
 * I think this RfC should focus on starting a test. We can have a separate comment-style discussion about how to implement a full-scale solution.
 * There was general agreement in the earlier discussion that we could run a test -- either as a gadget or otherwise. So I'd like to see us decide on a specific test and implement it, while proceeding with a parallel discussion about how a full-site implementation could happen.  The results of the test, and input from people who take the time to be testers, can help inform the longer discussion.  I have proposed a slightly modified version of the first proposal, which would try this out on a single article and then a single category.  –SJ +  20:46, 22 March 2011 (UTC)

Comments by UncleDouggie
I'm a bit confused about the proper place to comment on this draft RfC, so I'll just put my thoughts here. —UncleDouggie (talk) 04:37, 24 March 2011 (UTC)
 * 1) The landing page on wikiwix should be translated to English before the trial begins. This isn't necessary to approve the RfC, just a condition of starting the actual trial.
 * 2) The RfC should be clarified to say that MediaWiki:Common.js will be updated to call User:Pmartin/cache.js only when viewing Lady_Antebellum. In addition, users may import the script on their own for other pages in Category:Musical groups established in 2006 to get links for those articles. Users will need to conditionally import the script to avoid getting non-functional archive links on other articles.
 * 3) The reference to fr:Utilisateur:Pmartin/cache.js should be changed to User:Pmartin/cache.js because fr:Utilisateur:Pmartin/cache.js does not function on en.wikipedia.
 * 4) User:Pmartin/cache.js used deprecated code, including a call to getElementsByClassName. Also, the getTextContent function should be replaced with the appropriate jQuery selector and comments should be added. The rewrite should be worked while the RfC is running and be ready before the trial begins, just as for the English translation. We may want to also move the script to the MediaWiki namespace.
 * as a global function is deprecated. The script however, uses  which is (relatively new) feature part of core JS and unaffected by MediaWiki changes. I think the only usage of deprecated code is the use of wgNamespaceNumber. Mr.Z-man 06:45, 26 March 2011 (UTC)

Comments by Toshio Yamaguchi
It's not at all clear when exactly consensus was reached to use this Wikiwix solution to solve Wikipedias deadlink problems. I still think Wikipedia should use WebCite as one possible solution. Regarding single-point-of-failure concerns, backup possibilities should still be explored. However, one working solution is better than having nothing. Instead of simply saying WebCite is the solution to the Linkrot problem, I will provide some arguments why I think it is. Toshio Yamaguchi (talk) 14:43, 11 April 2011 (UTC)
 * I have used WebCite in my work here as a Wikipedia editor quite some timkes now, and my experiences were all positive
 * Using WebCite does not require any changes in the MediaWiki software and any editor can use it, since it is very easy to handle
 * WebCite is capable of archiving most content on the web without problems (I have already archived a variety of PDFs and WebCite doesn't seem to have any problems with reproducing the cached file)
 * We already have a bot for archiving sources with WebCite, which (for whatever reasons) however doesn't seem to be active

What will a trial achieve?
I still don't understand what the purpose of a limited trial is in this case. The worst case scenario seems to be that it doesn't work for some reason and we're back to where we are now. Given the lack of any major potential downsides on our end, what do we have to lose by just implementing it? Further, it's not clear what this trial is actually evaluating. Are we just verifying that the system works? Surely we already know that by the usage on the French Wikipedia. Are we testing to see if it actually saves any links from dying? What if no links in that category actually die during the trial? Is it a failure? Trials seem to be a popular way to get new things here, but if we don't actually care what the results are, there's no point in a trial. Mr.Z-man 18:11, 8 May 2011 (UTC)
 * I see this trial as a simple way to reassure users that are not familiar with Wikiwix. And also as a way to gain interest and input from users. We've only been talking until now, a trial is concrete and may draw attention from users. Dodoïste (talk) 20:04, 8 May 2011 (UTC)
 * I agree with Mr.Z-man. I am now viewing this whole thing as a type of software upgrade. By that logic, I think we should put up a notice that we will be running a site wide test in a week or whatever. The test would be for all users and all articles. Then run the test for 24 hours to look for technical problems and ask for feedback regarding technical problems. My opinion is that we completely disregard issues with uncached pages, problems of look and feel, or anything else and just focus on technical issues. If there are no technical issues, we can have PMartin run through and archive the pages and then just implement this thing. After that, we can deal with non-technical issues. - Hydroxonium (T•C• V ) 11:48, 9 May 2011 (UTC)
 * If we do this, we should be very careful not to reproduce the problems of the Pending changes trial (just see the confusion in this discussion). It must be clear from the beginning
 * What is the goal of this trial?
 * When will the trial start and when will it end? (exact dates should be defined)
 * A clear consensus must be reached prior to the start of this trial supporting the trial.
 * Where to comment on problems and experiences encountered during the trial? (one, easy to find place for discussion)
 * That the trial will be run only in a limited environment and not on Wikipedia as a whole and it must be clear to everybody, which pages are within the scope of this trial
 * That the trial will be stopped after the trial phase has ended, regardless of whether the trial yielded the expected results or not
 * That a seperate discussion will take place to reach consensus over whether to implement the changes at all taking into account the experiences from the trial
 * That consensus to run this trial is not consensus in favor or against the system the trial will test
 * That those who have the ability to implement and run this do not abuse this to force this system into full service by simply continuing the trial indefinitely and that those who do or try to do will face consequences
 * If this is run in accordance with all the above points, then I would support such a trial. Although the harm a continued trial would cause might be minimal (if present at all), the harm this can cause to the community if not handled correctly and with great care could be more serious. Toshio Yamaguchi (talk) 12:50, 9 May 2011 (UTC)
 * These all seem like good points, Toshio. We can have a simple timeline:  Discussion until the end of this month.  A js update affecting readers of a specific small category for a week.  Then a broader RfC about implementing this sitewide, to run for a month.  At the end of the RfC if the result is a move not to implement it, the change would be rolled back for the trial category.
 * We should have a single talk page, and a page to list specific, addressable concerns.  The talk page can include a lot of nonspecific discussion, but the specific concerns should all be addressed by the proposers.  A trial seems useful here to make it clear what the result will look like, to sort out things like interface-language and landing-page in practice, and to get feedback from editors and readers of the affected pages who wouldn't otherwise follow this discussion.  – SJ +  23:03, 24 May 2011 (UTC)
 * Ok, so can somebody create that page? This page should address all the points I mentioned, so everyone sees exactly what we are doing and in which phase of implementation of this system we are, which pages are within the scope of the trial etc. Toshio Yamaguchi (talk) 10:01, 26 May 2011 (UTC)


 * In that case, that's even more reason to just do it, rather than have a trial. if we just implement it, the worst case scenario is that it doesn't work and we revert it, which is as easy as enabling it. With a trial, apparently even if it works, if we make some bureaucratic misstep the community might feel lied to and oppose it anyway. Given those options with their respective potential downsides, it seems pretty clear what the best option is. Mr.Z-man 22:16, 12 May 2011 (UTC)

A way ahead that doesn't need extensive debate
It seems to me that there are several distinct things that can and should function independently before we worry too much about trials of the overall system.

1. Collect, track, sort, and categorize the URLs used in articles in such a way as to know which ones need (and do not already have extant) archived versions. This doesn't affect anything in articlespace directly, and is fully within the licenses in use. Any system that does this effectively and efficiently should be welcomed with open arms. At most there may be some disagreement over which URLs to prioritize. I see zero potential for harm to the project here.

2. Request that archive site(s) add the targets of these URLs. Again, fully permitted by existing licenses. The more sites the merrier, but the sooner we get to have at least one for each cited URL, the better off we'll be. Again, this doesn't directly effect articlespace and should meet no objections except perhaps from overloading the archive sites. Throttling as necessary should solve that. This step requires someone provide a liason with the archive administrators.

3. Track the progress of the requested archiving and list those completed or declined. If declined, request again on another archive site with different criteria. Still no effect in article space and no reason for objections.

4. Collect, track, sort, and categorize the URLs which do have archived targets, including those from step 3. Determine which articles cite each of them and yet do not have archive-links. This effectively forms a to-do list for articlespace edits, but still does not directly change the articles. Note that once an URL is on this list/database we need have no fear of a deadlink because we know it can be repaired, yet we still have not changed articlespace.

5. Revise code to take an URL and a deaddate parameter. This way article editors (even bots) can indicate if there's a present need to add the archiveurl. Alternatively, a spider bot could periodically crawl the URLs from step 4 to find which ones have gone dead and stayed dead. In the latter case, we still haven't changed articlespace, while in the former the change is restricted to the wikitext, not the rendered text.

6. Revise to return something like "Dead link, archived at  on ". Finally, there's an impact on the rendered text, but so long as it is correct there should still be no objections. It is purely an improvement over having dead links in the articles without archiveurls.

7. Article editors may choose to rework (or not) the citations to incorporate these archiveurl and archivedate results into the form of the existing citations in articles.

Does anyone see any glaring omissions in the above? LeadSongDog come howl!  18:00, 24 May 2011 (UTC)


 * Comment to point 1: You say:
 * "I see zero potential for harm to the project here."
 * In my opinion there is already extensive potential for harm to the project even at this step. I will show you why. You say:
 * "Collect, track, sort, and categorize the URLs used in articles in such a way as to know which ones need (and do not already have extant) archived versions."
 * From this it is not even clear to me what form this Collect, track, sort, and categorize the URLs would take in practice. Simply running this without knowing what we are actually doing is very dangerous in my opinion and goes against the whole spirit of consensus.


 * As a sidenote: I have already a concern with the heading you chose for this: A way ahead that doesn't need extensive debate. How do you know it doesn't need extensive debate? Currently the above points seem to be merely your opinion on what we should do. Where is the discussion that led to a consensus on what actually to do? Please don't misunderstand me, it is not my intention to hail others with criticism. I only want to ensure that this discussion actually leads somewhere. And the discussion of the Pending changes trial has shown that such a discussion leads nowhere, if it is not clear what we are actually talking about. Toshio Yamaguchi (talk) 19:36, 24 May 2011 (UTC)

Comment by OverlordQ
Isn't this duplicating existing effort, why not work with them? Q T C 16:01, 3 June 2011 (UTC)


 * Never mind, reading through there and looking above I see similar names. Q  T C 16:03, 3 June 2011 (UTC)

wikiwix action
We started yesterday caching the content of the English Wikipedia external links, so that, while waiting for the decision to be taken, all information sourcing work could be backed up.

Therefore, since yesterday, we're archiving all new links introduced in Wikipedia. For those introduced before yesterday, we'll try to archive them as well, and for those that return an error 404, we'll try to get them from webarchive.org.

Cheers, Pmartin (talk) 21:08, 6 June 2011 (UTC)
 * We have finish the work to storage all link in the namespace 0, so we are ready for all external links. --Pmartin (talk) 12:27, 28 March 2012 (UTC)