Re: Is EMC's CAS Centera considered "permanent data"?
- From: _firstname_@xxxxxxxxxxxxxxxxxxxxxxxxxx
- Date: Fri, 09 Dec 2005 06:43:31 -0000
In article <pbihp1p00n83trrmpo98iricg96f86kt1m@xxxxxxx>,
Faeandar <mr_castalot@xxxxxxxxx> wrote:
>On Thu, 08 Dec 2005 09:10:21 +0000, HVB <devnull@xxxxxxxxx> wrote:
>
>Now, not being a math geek I can't personally verify this but a friend
>who actually is a math geek tells me that using MD5 as the method to
>determine uniqueness is seriously flawed. Something to do with the
>maximum number of unique combinations not being near enough for most
>of today's medium to large storage environments.
AFAIK, the Centera uses a 128-bit hash code. So the probability that
two documents have the same hash code is 2^(-128), which is about
10^(-39). But that's not the relevent question - which is the
birthday paradox. Remember: If you have 23 people in a room, the
probability that two of them have the same birthday is about 1/2 -
even though there are 365 distinct birthdays. That's because with 23
people, there are lots of combination which could give you a match.
For a hash code system, the probability of a hash collision begins to
be significant when the number of hashes is roughly the square root of
the number of distinct hash values, or in our case 2^64. Now let's
assume that the typical object or file stored in a Centera is about
1MB in size (I made up that number completely out of thin air, but it
sounds plausible). To store 2^64 = 1.8 * 10^19 objects, you would
need a Centera that can store about 1.8 * 10^10 PB (or 18 billion
petabytes). I don't know what the largest Centera is you can buy, but
it is less than 1PB in size (in practice, it is probably closer to
1/10 of that). So we have lots and lots of safety margin.
Now obviously, this ignores the possibility of a smart attacker
(already pointed out by Paul Rubin) deliberately gaming the system to
create a hash collision (a bad thing to do). And even with
infinitesimally small probabilities, I would sleep better at night
knowing that in the case of a hash collision (or a hash match during
object lookup), a full-content comparison has been performed. But
then, I store my personal data on crappy consumer grade equipment
myself, so who am I to talk.
--
The address in the header is invalid for obvious reasons. Please
reconstruct the address from the information below (look for _).
Ralph Becker-Szendy _firstname_@xxxxxxxxxxxxxxxxxxxxxxxxxx
.
- Follow-Ups:
- Re: Is EMC's CAS Centera considered "permanent data"?
- From: Pavel Ammosov
- Re: Is EMC's CAS Centera considered "permanent data"?
- References:
- Is EMC's CAS Centera considered "permanent data"?
- From: RobertDavid
- Re: Is EMC's CAS Centera considered "permanent data"?
- From: Faeandar
- Re: Is EMC's CAS Centera considered "permanent data"?
- From: Faeandar
- Is EMC's CAS Centera considered "permanent data"?
- Prev by Date: Re: Is EMC's CAS Centera considered "permanent data"?
- Next by Date: Re: Indefinite Storage
- Previous by thread: Re: Is EMC's CAS Centera considered "permanent data"?
- Next by thread: Re: Is EMC's CAS Centera considered "permanent data"?
- Index(es):
Relevant Pages
|
|