Re: Is EMC's CAS Centera considered "permanent data"?



In article <pbihp1p00n83trrmpo98iricg96f86kt1m@xxxxxxx>,
Faeandar <mr_castalot@xxxxxxxxx> wrote:
>On Thu, 08 Dec 2005 09:10:21 +0000, HVB <devnull@xxxxxxxxx> wrote:
>
>Now, not being a math geek I can't personally verify this but a friend
>who actually is a math geek tells me that using MD5 as the method to
>determine uniqueness is seriously flawed. Something to do with the
>maximum number of unique combinations not being near enough for most
>of today's medium to large storage environments.

AFAIK, the Centera uses a 128-bit hash code. So the probability that
two documents have the same hash code is 2^(-128), which is about
10^(-39). But that's not the relevent question - which is the
birthday paradox. Remember: If you have 23 people in a room, the
probability that two of them have the same birthday is about 1/2 -
even though there are 365 distinct birthdays. That's because with 23
people, there are lots of combination which could give you a match.
For a hash code system, the probability of a hash collision begins to
be significant when the number of hashes is roughly the square root of
the number of distinct hash values, or in our case 2^64. Now let's
assume that the typical object or file stored in a Centera is about
1MB in size (I made up that number completely out of thin air, but it
sounds plausible). To store 2^64 = 1.8 * 10^19 objects, you would
need a Centera that can store about 1.8 * 10^10 PB (or 18 billion
petabytes). I don't know what the largest Centera is you can buy, but
it is less than 1PB in size (in practice, it is probably closer to
1/10 of that). So we have lots and lots of safety margin.

Now obviously, this ignores the possibility of a smart attacker
(already pointed out by Paul Rubin) deliberately gaming the system to
create a hash collision (a bad thing to do). And even with
infinitesimally small probabilities, I would sleep better at night
knowing that in the case of a hash collision (or a hash match during
object lookup), a full-content comparison has been performed. But
then, I store my personal data on crappy consumer grade equipment
myself, so who am I to talk.

--
The address in the header is invalid for obvious reasons. Please
reconstruct the address from the information below (look for _).
Ralph Becker-Szendy _firstname_@xxxxxxxxxxxxxxxxxxxxxxxxxx
.



Relevant Pages

  • Re: Is MD5 outdated ?
    ... ]> will produce a second document that has the same hash as the first. ... ]that Greg Rose described would yield a collision with high probability ... ]In security, a threat cannot be ignored simply because it is not certain ... ]a large amount of computation, but it's not so large as to be ...
    (sci.crypt)
  • Re: [RFC/PATCH] Documentation of kernel messages
    ... Maybe a stupid idea but why do we want to assign these numbers by hand? ... I can imagine it could introduce collisions when merging tons of patches ... But maybe also 4 bytes would be enough, since the hash only has to be ... For 1000 messages the probability is ...
    (Linux-Kernel)
  • Re: Fast 32-bit Hash
    ... I think the probability of an undetected error with this hash is the same as ... the probability that two random vectors produce the same hash value: ... I think this hash would make a good error detection mechanism. ...
    (sci.crypt)
  • Re: compare-by-hash (was Re: sharing /etc/passwd)
    ... hash collision generation with a broken hash function, ... collision probability, particularly with something like sha-1, ... P(undetected TCP error) = 0.000000005 ...
    (FreeBSD-Security)
  • Re: Call to atention about using hash functions as content indexers (SCM saga)
    ... > about the use of hash functions as content indexers. ... > about the low probability of such a collision, ... already in use by git. ...
    (Linux-Kernel)