Talk:Wayback Machine

Wiki Education Foundation-supported course assignment
This article was the subject of a Wiki Education Foundation-supported course assignment, between 28 August 2018 and 11 December 2018. Further details are available on the course page. Student editor(s): Chr09.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 12:46, 17 January 2022 (UTC)

Untitled

 * See Using the Wayback Machine for information on using the Wayback Machine with Wikipedia.

Banned in Russia
Not sure if this is newsworthy enough for main paragraph or needs a new subject or nothing. Maybe sites banned and unbanned all the time.

http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over-islamic-state-video/510074.html — Preceding unsigned comment added by RonPaul573e (talk • contribs) 08:50, 27 October 2014 (UTC)

Still reading?
Is it still reading pages? Seems not. 82.163.24.100 (talk) 14:40, 8 June 2010 (
 * Did you see the part of the article that reads: "Snapshots become available 6 to 18 months after they are archived." ? -- Quiddity (talk) 18:11, 8 June 2010 (UTC)

Where in Europe?
Later in the article it talks about how copyright law in 'Europe' could cause certain effects but it doesn't mention where in Europe! The Continent? If so, where on the continent? Is it the UK? There is no single copyright law within the region... Just curios!

Presumably this refers to the European Union (not all the countries of the European Peninsula/so-called continent), which has a very important governing role. --Eleanor1944 (talk) 02:55, 11 February 2013 (UTC)

Wayback Machine is Amazingly Slow
What surprises me time and time again is how incredibly slow the WayBackMachine is. Check Google for 'waybackmachine slow' and you'll see other people agree; even called "notoriously slow" by some folks. I wonder if there's a reliable source somewhere so we could mention the service's speed in the article. --82.171.70.54 (talk) 06:07, 19 June 2010 (UTC)


 * I believe Wayback Machine compresses everything because there is too much information for just their servers. (we are talking about the entire, or most of. The World Wide Web!) so it takes a long time to de-compress all of the related files. - 45.36.173.204 "wellsilver"


 * Although your comment dates back to 2010, it may still be wortwhile to read IA's Jason Scott's explanation on the performance of the Wayback Machine: MichielN (talk) 13:47, 19 May 2023 (UTC)

Still collecting pages?
I was able to see the www.defenselink.mil page from October 22, 2009 http://replay.waybackmachine.org/20091022164418/http://www.defenselink.mil/

171.64.66.13 (talk) 15:31, 9 September 2010 (UTC)

All of Wayback Machine's archived links are shut down!
Why aren't all those archived links in the Wayback Machine working anymore?! Can't someone please fix the Wayback Machine?! --Angeldeb82 (talk) 20:30, 26 January 2012 (UTC)
 * Would you kindly explain to those of us who are not familiar with the term, what are "archived links"? Thanks in advance Ottawahitech (talk) 15:55, 3 March 2012 (UTC)


 * I took "archived links" to mean links to its old, archived pages, it's main function.      As of June 30 it's still down. ERR: "The New Wayback Machine is having problems. Please try again later." Seeking help in forums etc, I could find no activity in recent months. I hope this historical treasure of history comes back, as I see evidence that Winston Smith's memory hole is gaining power —and coincidentally the historical treasure of Google's Usenet archive no longer seems cut in stone. --68.127.94.194 (talk) 17:53, 30 June 2012 (UTC)Doug Bashford


 * UPDATE my above: I've since used it, it's seemingly working fine. --68.127.90.135 (talk) 16:15, 27 July 2012 (UTC) Doug Bashford

Netbula v. Chordiant Software ? ...Jargon?
That section makes no sense. The first paragraph, I assume accurate, is meaningless. Probable jargon and/or insider-know presumptions. Suggest repair or deletion. --68.127.94.194 (talk) 16:56, 30 June 2012 (UTC)Doug Bashford

Not reliable anymore
A matter of location of the IP? — Preceding unsigned comment added by 201.10.57.86 (talk) 02:05, 6 September 2012 (UTC)

Reliability in retrieving archived material
It would probably be miraculous if the WM could archive everything on the internet, but as an experienced user I know only too well that pages and images are often unavailable not because of robots.txt or legal reasons, but simply because WM failed to retrieve them properly. There is absolutely no mention of this in the article and there should be. Lee M (talk) 02:42, 1 July 2013 (UTC)

I agree it's only archive 10%~40% of whole pages specially if the site are above 500 pages, no need to mention sites had million of pages/link they almost store 10% max .--Salem F (talk) 01:12, 7 December 2015 (UTC)

Not well
Section Search engine links:


 * ... began to provide links to other versions of pages archived on the Wayback Machine.

What does that even mean? That they use the Wayback Machine as a caching service? That it is possible to see not only the latest version of a page, but olders versions as well? Whatever it is, it ought to be described.

--Mortense (talk) 14:18, 15 February 2014 (UTC)

December 2014
[http://www.theregister.co.uk/2014/12/12/san_franciscos_storm_snafus_show_why_its_losing_its_place_as_a_top_tech_town/ This week it rained in San Francisco and the power immediately blew out. Your tech utopia • The Register]

Internet Archive: The big storm in SF has knocked out power to our main data center, so the site will be down for a while. We'll keep you posted here! 7:59 AM - 11 Dec 2014

unintelligible sentence
Under the heading "Origin, growth and storage", this rather odd sentence appears: "This became a threat of abuse the service for hosting malicious binaries." Can anyone make sense of this? It would seem to be missing a few words. Bricology (talk) 06:40, 23 March 2015 (UTC)


 * ✅. I checked all three references the paragraph cites.  I changed the sentence to, "This became a threat of abuse by the service for hosting malicious binaries."  The sources support the assertion that potentially malicious executables and PDFs are currently archived at the site.  &mdash;Aladdin Sane (talk) 19:06, 25 March 2015 (UTC)

Storage capacity
At present this section is mainly a list of historical capacities. Can anyone add anything about the growth rate and future ability to store information? It would also be good to include information in the section on resilience i.e security of the data stored. LookingGlass (talk) 10:14, 12 September 2015 (UTC)

External links modified
Hello fellow Wikipedians,

I have just added archive links to 1 one external link on Wayback Machine. Please take a moment to review my edit. You may add after the link to keep me from modifying it, if I keep adding bad data, but formatting bugs should be reported instead. Alternatively, you can add to keep me off the page altogether, but should be used as a last resort. I made the following changes:
 * Attempted to fix sourcing for https://archive.org/

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at ).

Cheers.—cyberbot II  Talk to my owner :Online 13:36, 31 March 2016 (UTC)

Stanford version of Wayback Machine
I was just wondering if the Stanford version of the Wayback Machine is in any way related to the Internet Archive's Wayback Machine. And the Stanford Wayback Machine has a few pages, some dating to late 1991! So if anyone knows, make sure to reply.

Source(s): https://swap.stanford.edu/ — Preceding unsigned comment added by 173.73.242.76 (talk) 01:34, 25 April 2016 (UTC)

An error on Storage capabilities
At the start, we claim that in 2009 the site grew by 100 TB per month.

At the bottom, we claim that in 2014 the site grew by 20 TB per week, which is 80 TB per month - less than in 2009.

Is it possible? רן כהן (talk) 13:53, 11 May 2016 (UTC)

"Mass deletion of content"

 * "Beginning in 2015, mass deletions of previously archived content caused a number of critics to question the sincerity of this goal."

The cited sources don't support that assertion. The first source is confused and inaccurate. The second source contains an update to the effect the problem was specific to that user and fixed. Both are essentially self-published blogs. -- Green  C  21:45, 23 September 2016 (UTC)


 * Then please go ahead and remove it (and put this info into the edit summary) . --Fixuture (talk) 17:02, 29 September 2016 (UTC)
 * I was going to alert people to a seldom covered fact: the archive's own archive of itself claims to have 502 billion pages saved, not the current 278. However, I later saw that it's just a change in their counting definition. I hope the "bug" in the two sources you mention served as a wake-up call for certain people to get their act together. A site this important should be coded in such a way that bugs are likely to make it display more pages than desired. Connor Behan (talk) 02:53, 13 February 2017 (UTC)

Major problem with robots.txt
Hello, I just notice that since wayback machine won't archive pages AND also deletes the all previous archives of the webpage prior to the use of robots.txt, there is a flaw in it:


 * If a website went defunct, another site opens with the same URL later, and the second URL have robots.txt, can delete the previous defunct website. Even if the latest web owner does not technically own the dead website version of the URL.


 * If a site got hacked and robots.txt was applied, the same thing happened, all history is gone.

Check out a citation of an archive of SpySheriff, before, wayback machine does host the website, now since it now have robots.txt, the past versions archived are now deleted. I've assume hackers adjusted the website under that URL to include that file.

This is another threat to both wikipedia and wayback machine, as wayback machine does not have a "protection" to its archive. With things that can accidentally vanish by website replacement with robots.txt and hacked sites, it makes archiving virtually pointless in the very future.Joeleoj123 (talk) 05:12, 15 April 2017 (UTC)


 * Thank you for bringing this up. Do you have any relevant references? However, from what I can see, there are good reasons to exclude malware-distributing websites which seems to be the case for "SpySheriff". Also it seems that as of last month they are exploring ignoring robots.txt more broadly (see: Wayback Machine). --Fixuture (talk) 14:26, 18 May 2017 (UTC)
 * This is very easily solved, by using whois service to check if the owner changed. 152.62.109.203 (talk) 12:13, 19 December 2022 (UTC)

This is still regularly occurring. As an example someone unrelated to the original site owners has taken over the expired domain name www.xyzzynews.com and redirected it to a casino site so that years of archived material that I need to access is no longer available. What this means is that anybody can delete anything they want from the wayback machine as long as the domain name is available for purchase. There needs to be some mention on this page that the archived material of sites that don't exist anymore is not safe and can disappear at any time. 116.250.163.80 (talk) 01:28, 20 July 2018 (UTC)
 * Try https://archive.is/www.xyzzynews.com -- Green  C  01:40, 20 July 2018 (UTC)
 * Also, when blocked by robots.txt, the original HTML can still be accessed by using a non-JavaScript enbabled browser, or simply doing a wget or curl request to retrieve the HTML and view the html file locally. The robot blockage mechanism requires JavaScript to work. -- Green  C  13:12, 20 July 2018 (UTC)

Hi, regarding the 'citation needed' on the 2017 policy change mentioned in the main page, I looked into it and found that there indeed was an automated mechanism via robots.txt documented here but the page got removed in 2015. The docs on wbm exclusion since late 2018 just say to write an e-mail. Might have happened even earlier, I did not have time to hunt down the earliest mention across site layout changes. Theultramage (talk) 09:12, 25 July 2020 (UTC)

Problem with only first page of pdf files
I know this is off topic but I don't know a better way to reach Wayback users.

I have an ongoing problem with only the first page of pdf's being supplied:

https://web.archive.org/web/20060912144906/http://www.dbts.edu/journals/1996_1/ACDIXON.PDF https://web.archive.org/web/20160313082813/http://users.ipfw.edu/jehle/deisenbe/cervantes/bowle.pdf

and many others. I am using Safari on iOS, latest versions. Any remedy? Thanks. deisenbe (talk) 11:37, 5 April 2019 (UTC)
 * I don't have this problem they both download complete multipage. Try a different browser or system. -- Green  C  14:01, 5 April 2019 (UTC)
 * The same problem in Chrome and Dolphin. I was hoping some reader had dealt with this. deisenbe (talk)
 * Maybe clear cache? Download the file and open with a different PDF viewer not attached to the browser? -- Green  C  14:10, 5 April 2019 (UTC)

Self-censorship BY (not of) the Wayback Machine
I can't into Wikipedia, but I believe this case to be notable enough to be included. In August 2016, the Wayback Machine removed an archived page out of their own volition and pro-homosexual anti-Nazi bias. Link.--Adûnâi (talk) 11:47, 6 April 2019 (UTC)

Observation: User agent passthrough.
Hello. I have noticed that when using web.archive.org/save/example.org (initially web.archive.org/record/example.org in October 2013, see http://www.digitaljournal.com/article/360776 ), the Wayback Machine forwards the browser's user agent to the archived page.

This explains why archiving a website from a mobile web browser brings up the mobile version of the webpage.

Whether the Wayback Machine keeps a record of that user agent, is unknown. --Handroid7 (talk) 14:51, 26 August 2019 (UTC)

Copyright claims appear to be spurious
It would appear that copyright claims against an archive service would be spurious given that there exists an explicit limitation against copyright in the United States which allows for archival of content. See Title 17, United States Code. Sec. 108. https://www.law.cornell.edu/uscode/text/17/108. Thus, it would appear that any claim for copyright infringement against an archive service such as wayback would be obviously meritless, making the assertion that cases were filed on such grounds highly suspect at best. I would therefore suggest removal of such references to such matters unless it can be shown that case was filed in PACER. http://www.pacer.gov 66.90.153.184 (talk) 23:10, 3 November 2019 (UTC)


 * I am an Israeli citizen living in Israel. Israeli copyright law says archival for public access is permitted only by specific law, e.g. the law by which the national library of Israel operates, and requires publishers to submit two copies of every book.
 * So when the Internet Archive scraped my website and made copies of it available to the public, it didn't rely on U.S. law. It relied on an Israeli citizen having a really hard time taking a foreign company with no local offices to court. I call it anarchism.
 * As long as they respected my request not to make archive of my site public, and the robots.txt to not scrape my site, I was quite. Then they've decided to scrape my site regardless. In four days they've consumed as much bandwidth as everyone else does in three months, including people browsing the web site, web engines' crawlers, hackers searching for vulnerabilities, and the library of congress and French national library coming every two weeks to archive every image hotlinked by U.S. & French sites. Why? Because they want to archive not just the content under sitemap.xml or linked from the root index.html, but also everything under cPanel & co (including graphics and fonts) for all posterity. Who knows? Maybe I tailored my version of cPanel, and in a hundred years some historian would find it interesting. 152.62.109.203 (talk) 12:12, 19 December 2022 (UTC)

Censorship and other threats
Someone who understands this sentence should rewrite it for clarity: There are known rare cases where online access to content which "for nothing" has put people in danger was disabled by the website.

Perhaps a longer quotation would help.71.14.76.58 (talk) 22:34, 25 March 2020 (UTC)

Wayback Machine is blocked in India ?
I found two news about Internet Archive was blocked in India in 2017 (they are all in Chinese), but I don't know if the blockade has been lifted nowadays, should it be put into the article?

-- BlackShadowG ★（talk） 03:56, 29 June 2020 (UTC)


 * AFAIK that's old news. The Internet Archive was at times blocked by various authoritarian governments but it usually comes back. Nemo 13:23, 29 June 2020 (UTC)

Site cannot archive pages
Last days, Wayback is not able to archive web pages. --5.43.102.127 (talk) 15:42, 25 July 2020 (UTC)

Move discussion in progress
There is a move discussion in progress on Talk:WABAC machine which affects this page. Please participate on that page and not in this talk page section. Thank you. —RMCD bot 13:48, 9 October 2020 (UTC)

Wayback Machine blocked in India
The Wayback machine has been blocked in India, possibly due to copyright issues. There will be a message that says "Your requested URL has been blocked as per the directions received from Department of Telecommunications, Government of India. Please contact administrator for more information."

Oldest cached pages
Whilst the oldest cached pages are reported to have been from the 12th of May 1996, I have found a page that predates it (https://web.archive.org/web/19960511013802/http://www.geocities.com/homestead/) on May 11th, 1996. I don't think it's time zones or anything like that in effect. Should it be added that they started archiving on the 11th, or at least the earliest (known?) page is from that date? Markymark101 (talk) 17:01, 3 December 2021 (UTC)
 * Awesome find. Geocities no less. Sure go ahead and change it, there is no official source for the date, just links to captures people have found. You could re-frame it as "the oldest known archive date". --  Green  C  19:22, 3 December 2021 (UTC)
 * It's not just you who think it's not time zones. When one archives a page at e.g. 12:00:00 on 25 February, 2025 (UTC), no matter where one is, it gets the "20250225120000" timestamp. Alfa-ketosav (talk) 18:12, 22 February 2024 (UTC)

Blacklisting of adservers
It looks like they have recently blacklisted advertisement servers such as tpc.googlesyndication.com and 2mdn.

> This URL is in our block list and cannot be captured. Please email us at "[…]" if you would like to discuss this more. — Preceding unsigned comment added by Okoso (talk • contribs) 00:38, 12 January 2022 (UTC)

Usage tip for Internet Archive Digital Library
As important as Internet Archive is in terms of providing working links, it seems like there should be a page for usage tips.

This tip is specific to the Internet Archive Digital Library:


 * When searching for a book title using the default "search metadata" option, you should put the title in quotes to specify an exact match.
 * However, if the title includes a colon, you need to delete the colon or you will get no match.

Fabrickator (talk) 07:30, 23 February 2023 (UTC)

Crawler?
Which crawler software and user agent name does the Wayback Machine use, anyone know? I'm looking for reliable and recent sources. By the way, I know Heritrix exists, and that it is a project from the Internet Archive, but that doesn't mean they currently use it for their Wayback Machine. Thanks. The reason I'm asking is I'd like to include this information in this article, and possibly other places (e.g. User-Agent header, maybe Heritrix, etc). --2001:1C06:19CA:D600:2BD8:5934:EB69:C9 (talk) 10:33, 12 September 2023 (UTC)

Define "crawl"

 * I think the article needs to provide a clear definition of the word "crawl" and some of its varied uses. The inexperienced, technically limited reader, like myself, has a glimpse of what it means but a concise definition would be helpful. The source article The Internet Archive Turns 20 contains 84 varied uses of the word.   Buster Seven   Talk  (UTC) 13:12, 15 June 2024 (UTC)

Website number drop?
While in January 3, 2024, the Wayback Machine has been reported to have over 866 billion archived websites, as of 08:22, 22 February 2024 (UTC), the Internet Archive's main page (archive.org), web.archive.org and archive.org say 365 billion. Why did these decreases happen? Alfa-ketosav (talk) 20:11, 21 February 2024 (UTC)

Also, as of 08:22, 22 February 2024 (UTC), the dropdown menu appearing on the "Web" part of the menu still says 866 billion archived websites. Alfa-ketosav (talk) 20:20, 21 February 2024 (UTC)