Re: Supercomputers and Dawn
- From: nmm1@xxxxxxxxx
- Date: Sun, 29 Nov 2009 16:32:48 +0000 (GMT)
In article <yf38wdps4zm.fsf@xxxxxxxxxxxxxxxxxxx>,
Peter Grandi <pg_nh@xxxxxxxxxxxxxxxxxxx> wrote:
If you want to understand how subtle the issues can be, consider
reading this paper on "undetected" errors in storage systems:
https://indico.desy.de/contributionDisplay.py?contribId=65&sessionId=42&confId=257
Er, I would call those the obvious ones :-( There is a much more
evil category, which I know is both more common than people think
and is usually undetected, which is inconsistencies at the next
level up (or two up). That is often caused by race conditions
causing an inconsistent set of blocks to be written.
It can be missed for ages if the pattern of use is such that only
a single 'view' is observed, but a different use of the data can
give incorrect results. Surprisingly few systems have any real
checking at that level, and often don't even have a precise enough
specification to check rigorously. When I wrote my file system
scanner, aimed at a higher level, I was surprised at how many errors
at the lower level showed up. And a lot of the problem is file
systems that automatically recover from failure, because fsck is
rarely run and is usually not very thorough in any case.
I agree with the paper's points. Checksums etc. that were designed
for the odd million transactions just don't cut the mustard when
dealing with the odd million million transactions. But that's all
basic probabilty theory.
As a guy once said in an interview to Datamation, "as far as I
know the computing center I manage never from an undetected
error". Think carefully about that.
Er, yes, but there is a syntax error :-) I think that you meant
to add the word "fail" after "never", but am not sure. If so,
yes, I agree with you - as the American saying goes "It's not
what we don't know that causes the trouble, it's what we know
that ain't so."
Regards,
Nick Maclaren.
.
- Follow-Ups:
- Re: Supercomputers and Dawn
- From: Jonathan Thornburg [remove -animal to reply]
- Re: Supercomputers and Dawn
- References:
- Supercomputers and Dawn
- From: Del Cecchi
- Re: Supercomputers and Dawn
- From: Robert Myers
- Re: Supercomputers and Dawn
- From: Jonathan Thornburg
- Re: Supercomputers and Dawn
- From: Peter Grandi
- Supercomputers and Dawn
- Prev by Date: Re: Supercomputers and Dawn
- Next by Date: Re: hardware prefeth, branch predictor in the context of multitasking
- Previous by thread: Re: Supercomputers and Dawn
- Next by thread: Re: Supercomputers and Dawn
- Index(es):
Relevant Pages
|