Talk:Content-addressable storage

Is the term CAS too EMC-specific? Some might prefer the expression "disk archiving". Westwind273 00:34, 8 September 2006 (UTC)

This page seems entirely biased towards a particular view of CAS technology, and the number of mentions of "John Canessa" is daunting. There's a lot more in content-based storage than is mentioned in this article; it feels like it was written by one person with a very strong bias about the history of the technology, and lacks any authoritative citations for why that view of history is correct. There's a lot of relevant academic work on content addressability - Venti, which _is_ cited, as well as systems such as the Low-Bandwidth File System, Windows' "Single Instance Storage", and enormous work on disk deduplication in research (Fred Douglis at IBM is a good starting point, and Data Domain, recently acquired by EMC, is a good starting point on the corporate side). 128.2.209.18 (talk) 14:56, 3 November 2009 (UTC)DaveAndersen

It's disgraceful that an article with title "Content-addressable storage" should suggest that the history of the topic began in 1992. Content-addressable storage was a term that had been around for several decades by then, products providing contenct-addressed storage had been available for a long time, and the article looks like an attempt to claim an underserved priority for specific people and products. The coat-hook metaphor is NOT relevant to CAS in general, but only to a particular firm's product, and I guess the use of this is part of the same over-inflated claim. Maybe a disambiguation page would avoid the appearance of commercial puffery instead of an encyclopedia article, with this page NOT carrying the simple title it currently carries (since that would belong to the disambiguation page), but I think an article on content addressable storage in general is needed as a top level article rather than just a disambiguation page. Michealt (talk) 14:52, 25 July 2010 (UTC)

No info on hash collisions
Since hashing produces non-unique keys, and collisions are always a risk - despite that really long keys lower that risk - content addressable storage doesn't scale safely for massive collections. The issue is both that multiple documents may share the same key, and more problematically, that the hubris of overconfident programmers leads them to skip writing collision handling code. The article brazenly omits this risk.

For people who say "oh, well these hashes can't collide, they could label every atom in the universe uniquely" - the reality is that this is merely another case of the birthday paradox. And if the hash length *were* enough to be certain, surely tossing one bit wouldn't make it too short, right?...[repeat until interlocutor gets uncomfortable with the shrinking bit count]. Alex North-Keys (talk) 00:18, 28 April 2023 (UTC)


 * Seconded. This needs to be mentioned in this article, and prominently.
 * It is possible to safely use hashes for addressing storage, but each new copy ingested needs to be checked in some additional way.
 * If they match, great; only one copy needs to be retained!
 * If they don't match, however, then some sort of secondary 'collision identifier' needs to be used. As more and more data is encrypted (and therefore is effectively random), the collision risk becomes higher still. Trying to de-duplicate on the block level (or even worse, using a 'rolling window' method)
 * Hashes (cryptographic or otherwise) by themselves can be very useful for message authentication, or even just to guarantee data integrity (i.e. making sure a file wasn't intentionally or inadvertently changed or corrupted); alone, however, they cannot safely be used to add content to a storage system.
 * See also: the Pigeonhole principle, Record linkage, and the Gambler's fallacy.
 * - Jim
 * (I don't have any sources handy at the moment, but I took the time to write my this in the hopes that someone else does; I'm sure there are numerous papers in the ACM library, for example.)
 * - Jim Grisham (talk) 05:53, 8 September 2023 (UTC)
 * - Jim Grisham (talk) 05:53, 8 September 2023 (UTC)

External links modified
Hello fellow Wikipedians,

I have just modified one external link on Content-addressable storage. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
 * Added archive https://web.archive.org/web/20071012085111/http://www.opensolaris.org/os/project/honeycomb/ to http://www.opensolaris.org/os/project/honeycomb/

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

Cheers.— InternetArchiveBot  (Report bug) 05:39, 25 May 2017 (UTC)