Re: Lack of bit field instructions in x86 instruction set because of patents ?



nmm1@xxxxxxxxx wrote:
In article <okVul.15712$as4.1365@xxxxxxxxxxxxxxxxxxxx>,
Stephen Sprunk <stephen@xxxxxxxxxx> wrote:
The more reliable a system is designed to be, the more complicated it must be -- and the bigger the problem when it does fail, as everything eventually does.

Absolutely NOT! You do not get genuine reliability by complexity;
you get it in by designing the system to (a) not have certain classes
of errors in the first place, (b) being able to detect all errors of
certain other classes and having a reliable recovery method and, most
interestingly, (c) by ensuring that other errors are self-correcting.
And, lastly, by being able to prove (or at least provide very strong
evidence for) the claim that ALL failures (andticipated and otherwise)
fall into one of those classes.

Most large systems I'm familiar with do _not_ follow this model. They started with a naive design and, as each error is discovered, they hack on the code to deal with it -- introducing multiple new errors, so the complexity grows exponentially.

Any system for which "failure must be impossible" falls into this category in my experience -- and they always eventually collapse horribly under their own weight.

The only place that the actual programming comes in is in coding up
the design in such a way that you haven't introduced new error modes.

You must work with some amazing programmers if you can say that with a straight face.

And THAT is where the POSIX/C threading model fails so badly.

Hmm.

OTOH, systems designed to accept small-scale failure as a normal practice rarely if ever suffer massive failures (for a variety of reasons). Compare the Internet to the phone system: the Internet has never been entirely down in its entire history, while the phone system has -- but people think the phone system is more reliable because it only has a system-wide outage once a decade or so, while we experience tiny parts of the Internet being down every day...

I don't remember the telephone system ever having failed in toto in
any modernised country except when its government has gone bananas
and used centralised powers to turn the thing off (yes, Thatcher,
Blair etc., I am thinking of you)!

The AT&T crash of 1990 is the closest to a country-wide outage I think we've ever seen. A nine-hour, near-total failure of the US's long distance network (all because of a misplaced "break" statement in some error-handling code that had never been executed before). There have also been a few spectacular local failures; a few years ago, millions of lines in Houston were down for over a week, and of course, the phone network (both wired and wireless) had serious problems in NYC after 9/11 -- but the Internet was fine.

Also, that's not true at all. You are correct that the Internet was
carefully designed to be robust against most small-scale failures
becoming large ones and, by and large, its design works in that
respect.

The Internet works so well in the face of failures because it was designed with the assumption that failures would be common -- which gave rise to the mistaken claim that it was designed to survive nuclear attack. In practice failures are not as common as expected, but they're common enough to test all the various bits of error-handling code and expose bugs quickly; in many other systems, the error handling code is actually the least reliable part because it's never used and thus never tested (if the error conditions are even known and responses specified).

But it also provides an example of where accepting frequent small-scale
failures does NOT lead to reliability of the whole - the DNS!

DNS is shockingly reliable given what it has to work with. It's not the protocol's fault that many DNS admins are so incompetent that there is absolutely nothing that can be done to recover from their mistakes...

We have already seen local failures cause havoc on a Tier-1 server;
with the network probabilities involved, there is only one that can
continue normally without relying on any others - and then with a
fairly severe loss of function. There are actual failure modes
against which the Internet is not resilient.

You can take down a small part of the Internet, or degrade a large part, but so far nobody's ever managed to take down the entire thing completely and all at once. I'm not even sure it's theoretically possible; even the worst-case scenarios still have _some_ traffic getting through.

More seriously, most systems that are designed to accept small-scale
failures (ATMs, telephones, most power grids, even the Internet) rely
critically on the Law Of Large Numbers - and will fail catastrophically
when that fails. And it can, where a vast number of local failures
are trigged by a common cause.

Hmmm. The closest we've probably come to that is the periodic virus/worm epidemics like ILOVEYOU, CodeRed, NIMDA, etc. -- but there are plenty of folks (like me) who were never affected by them.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Isaac Jaffe
.



Relevant Pages

  • Re: msnbc/oberg: Murphys Law rules outer space
    ... > different spacecraft, different sub-systems, both ... The reason behind the failures lies not ... >> important than budget and design. ... you're going to have real trouble getting the reliability up. ...
    (sci.space.policy)
  • Re: Lack of bit field instructions in x86 instruction set because of patents ?
    ... You do not get genuine reliability by complexity; ... evidence for) the claim that ALL failures ... the design in such a way that you haven't introduced new error modes. ... Compare the Internet to the phone system: ...
    (comp.arch)
  • Re: Lack of bit field instructions in x86 instruction set because of patents ?
    ... systems do not provide genuine reliability, ... the design in such a way that you haven't introduced new error modes. ... designed with the assumption that failures would be common -- which gave ... In practice failures are not as common as expected, ...
    (comp.arch)
  • Re: SpaceX Falcon I Hold-Down Firing Scheduled
    ... > willing to accept some failures. ... > this company who calulated the reliability etc. ... Will they not launch the ... It's the difference between a design calculation and a test measurement. ...
    (sci.space.policy)
  • Re: Missed It By That Much
    ... Internet right. ... One of the failures: he took for granted that online ... information would be provided by Big Government, Big Business, and Big ...
    (rec.arts.sf.written)