Re: Lack of bit field instructions in x86 instruction set because of patents ?



In article <0%avl.14800$D32.8889@xxxxxxxxxxxxxxxxxxxx>,
Stephen Sprunk <stephen@xxxxxxxxxx> wrote:

Absolutely NOT! You do not get genuine reliability by complexity;
you get it in by designing the system to (a) not have certain classes
of errors in the first place, (b) being able to detect all errors of
certain other classes and having a reliable recovery method and, most
interestingly, (c) by ensuring that other errors are self-correcting.
And, lastly, by being able to prove (or at least provide very strong
evidence for) the claim that ALL failures (andticipated and otherwise)
fall into one of those classes.

Most large systems I'm familiar with do _not_ follow this model. They
started with a naive design and, as each error is discovered, they hack
on the code to deal with it -- introducing multiple new errors, so the
complexity grows exponentially.

I didn't say they did - I said that it is the way to get genuine
reliability! If you deduce from that I am saying that most large
systems do not provide genuine reliability, you understand my point
perfectly ....

Any system for which "failure must be impossible" falls into this
category in my experience -- and they always eventually collapse
horribly under their own weight.

Indeed. Which is precisely why the approach I described is the only
practicable way of delivering that target.

The only place that the actual programming comes in is in coding up
the design in such a way that you haven't introduced new error modes.

You must work with some amazing programmers if you can say that with a
straight face.

I haven't even heard of it being done very often, I agree :-(

And THAT is where the POSIX/C threading model fails so badly.

Hmm.

Yeah :-(

The Internet works so well in the face of failures because it was
designed with the assumption that failures would be common -- which gave
rise to the mistaken claim that it was designed to survive nuclear
attack. In practice failures are not as common as expected, but they're
common enough to test all the various bits of error-handling code and
expose bugs quickly; in many other systems, the error handling code is
actually the least reliable part because it's never used and thus never
tested (if the error conditions are even known and responses specified).

Sorry, but that's completely wrong. The layers of timeout and retry
are such that certain failure modes are both ubiquitous and largely
unrecognised. Older protocols like FTP were/are notorious, but that
applies even to the TCP layer.

The effect is that those modes cause a mere loss of performance,
wasted bandwidth and failure to diagnose problems correctly, but all
those effects are largely hidden by higher layers and ignorance
about what should happen.

You can take down a small part of the Internet, or degrade a large part,
but so far nobody's ever managed to take down the entire thing
completely and all at once. I'm not even sure it's theoretically
possible; even the worst-case scenarios still have _some_ traffic
getting through.

Not really. There would be still some traffic possible in disconnected
sections of it, but it would cease to be a network. All it needs is
the USA Tier-1 DNS to go AWOL (as has happened with some others) ....

More seriously, most systems that are designed to accept small-scale
failures (ATMs, telephones, most power grids, even the Internet) rely
critically on the Law Of Large Numbers - and will fail catastrophically
when that fails. And it can, where a vast number of local failures
are trigged by a common cause.

Hmmm. The closest we've probably come to that is the periodic
virus/worm epidemics like ILOVEYOU, CodeRed, NIMDA, etc. -- but there
are plenty of folks (like me) who were never affected by them.

I can anticipate scenarios that would do that to the Internet, but
the time is not yet ripe - it's coming, though, because software with
the relevant characteristics is becoming more common.


Regards,
Nick Maclaren.
.



Relevant Pages

  • Re: msnbc/oberg: Murphys Law rules outer space
    ... > different spacecraft, different sub-systems, both ... The reason behind the failures lies not ... >> important than budget and design. ... you're going to have real trouble getting the reliability up. ...
    (sci.space.policy)
  • Re: Lack of bit field instructions in x86 instruction set because of patents ?
    ... You do not get genuine reliability by complexity; ... evidence for) the claim that ALL failures ... the design in such a way that you haven't introduced new error modes. ... Compare the Internet to the phone system: ...
    (comp.arch)
  • Re: SpaceX Falcon I Hold-Down Firing Scheduled
    ... > willing to accept some failures. ... > this company who calulated the reliability etc. ... Will they not launch the ... It's the difference between a design calculation and a test measurement. ...
    (sci.space.policy)
  • Re: Lack of bit field instructions in x86 instruction set because of patents ?
    ... started with a naive design and, as each error is discovered, they hack on the code to deal with it -- introducing multiple new errors, so the complexity grows exponentially. ... practice rarely if ever suffer massive failures. ... Compare the Internet to the phone system: the Internet has never been entirely down in its entire history, while the phone system has -- but people think the phone system is more reliable because it only has a system-wide outage once a decade or so, while we experience tiny parts of the Internet being down every day... ... failures does NOT lead to reliability of the whole - the DNS! ...
    (comp.arch)
  • Re: Signal failure between Paddington and Reading Friday pm
    ... design and implementation. ... back up situations can actually reduce the reliability of the overall ... The reasons for power failure are many and varied. ... and has to travel by road (because the trains are stopped and to carry ...
    (uk.railway)