Re: Supercomputers and Dawn



In article <yf38wdps4zm.fsf@xxxxxxxxxxxxxxxxxxx>,
Peter Grandi <pg_nh@xxxxxxxxxxxxxxxxxxx> wrote:

If you want to understand how subtle the issues can be, consider
reading this paper on "undetected" errors in storage systems:

https://indico.desy.de/contributionDisplay.py?contribId=65&sessionId=42&confId=257

Er, I would call those the obvious ones :-( There is a much more
evil category, which I know is both more common than people think
and is usually undetected, which is inconsistencies at the next
level up (or two up). That is often caused by race conditions
causing an inconsistent set of blocks to be written.

It can be missed for ages if the pattern of use is such that only
a single 'view' is observed, but a different use of the data can
give incorrect results. Surprisingly few systems have any real
checking at that level, and often don't even have a precise enough
specification to check rigorously. When I wrote my file system
scanner, aimed at a higher level, I was surprised at how many errors
at the lower level showed up. And a lot of the problem is file
systems that automatically recover from failure, because fsck is
rarely run and is usually not very thorough in any case.

I agree with the paper's points. Checksums etc. that were designed
for the odd million transactions just don't cut the mustard when
dealing with the odd million million transactions. But that's all
basic probabilty theory.

As a guy once said in an interview to Datamation, "as far as I
know the computing center I manage never from an undetected
error". Think carefully about that.

Er, yes, but there is a syntax error :-) I think that you meant
to add the word "fail" after "never", but am not sure. If so,
yes, I agree with you - as the American saying goes "It's not
what we don't know that causes the trouble, it's what we know
that ain't so."


Regards,
Nick Maclaren.
.



Relevant Pages

  • Shared Disk/Transactional/Distributed file system (GSoC Proposal)
    ... I want to make an OpenVMS inspired file system. ... Another question is whether to make it a pure record oriented I/O file ... file system would make the distributed lock manager's job much ... manage aborted transactions, or finished transactions etc. ...
    (freebsd-hackers)
  • Prevayler
    ... I have done object serialization to file system ... before and what I've found is that with windows, ... is "cached speed" but with transactions over time (with large prevalent ... (which is the real issue I think with distributed data ...
    (comp.lang.java.programmer)
  • [098/111] reiserfs: properly honor read-only devices
    ... transactions to replay, ... Where a clean file system on a read-only device refuses to mount ...
    (Linux-Kernel)
  • Re: [SLE] laptop security
    ... A couple of us here maintain a subversion checkout on our laptops. ... One could encrypt the file system. ... to other odd problems. ...
    (SuSE)
  • Re: reiser4 plugins
    ... No existing file system guarantees such behavior. ... Because to have such transactions databases pay huge price in both ... send the line "unsubscribe linux-kernel" in ...
    (Linux-Kernel)