Another article on deduplication has come out and I still have the same concerns as I always have. Deduplication occurs by analyzing the data, hashing it into a mathematical tag and then normalizing the tags to save storage. And that’s great — in theory. Until you add in the hash collisions (the MD5 one, the SHA-1 one, the RIPEMD one, as well as the ones not yet discovered but inevitable). Basically, all hash functions will inevitable fall to collisions — the usual definition of a hash function is one that

A one-way function which takes a variable-length message and produces a fixed-length hash. Given the hash it is computationally infeasible to find a message with that hash; in fact one can’t determine any usable information about a message with that hash, not even a single bit.

It’s the infeasible part that bothers me.

Intel is still clipping right along with Moore’s law and pretty much everything in my IT experience says to me that computers will only get faster, more powerful and better at doing what they do (and that’s absent any technical magic hereto unknown). In my mind, that tells me that all hashes (both current and future) will fall to either collisions or simple brute attacks.

The security implications of that statement are interesting (and may be addressed at a later date); today’s discussion is about storage. I can forsee situations in which a large institution (say, a library) enables deduplication to save on storage costs. Then 40 years later, a collision is proven; since there is no unaltered copy stored (but, rather, a normalized array of hash tags — hashes calculated through the same process as the proven collision), a non-trivial potential for data loss arises.

Now, this situation may be avoided by periodically upgrading the hash function from time to time. However, that solution imposes a rather high cost of maintenance and administration upon the system. The admins (DBA, SA, etc.) will have to expend extensive resources in migrating the legacy data to the new hash while — much more importantly — verifying that the original data from the legacy hash function can still be reproduced from the new hash function with acceptable fidelity. And, since the how reason why this is happening in the first place is to cut down on extremely large volumes of data, the process will be time consuming, unpleasant and expensive.

So what’s my solution? Well, deduplication does have it’s place and uses — I’m not so foolish as to be reflexively dogmatic. But, DASD is cheap. And getting more so every day (in a corollary to Moore’s law, I’m sure). In those rare occasions where I have an infinite budget and no external pressures, I would choose to dedup data of less importance — things like lookup tables, transitory data (data with a shelf life of 5-10 years). Any data that must be maintained with fidelity over a very long period of time (40+ years or so), I would recommend for straight storage on inexpensive disks.

linkedin Deduplication, Storage & Why Im Not A Fanreddit Deduplication, Storage & Why Im Not A Fanslashdot Deduplication, Storage & Why Im Not A Fandelicious Deduplication, Storage & Why Im Not A Fanstumbleupon Deduplication, Storage & Why Im Not A Faninstapaper Deduplication, Storage & Why Im Not A Fanemail Deduplication, Storage & Why Im Not A Fanshare save 171 16 Deduplication, Storage & Why Im Not A Fan

Related posts (autogenerated):

  1. Hitachi’s Fantasy Land Of Virtualization Storage
  2. Flash Based RAID
  3. Thoughts on Exadata