Re: Don't Fix It if it is Not Broken (was Looking at Macs...)



In article <COHPe.4629$2_.3400@xxxxxxxxxxxxxxxxxxxxxx>, TheLetterK
<theletterk@xxxxxxxxxxxxxxxxx> wrote: ...very profound observations...

First off, let me state that I really enjoy your postings.

I learn quite a bit from them. You likely don't get much in return,
because you obviously know more than I do.

Hope I do not accidentally do anything to piss you off, that is _not_
my intention.

Everyone do yourselves a favor and don't read this post, because it
wound up too long and says too little of value for everyday Mac users.

All that said, here goes ;-)

> > Don't even try operating the same computer in an aircraft at 60,000
> > feet, because the soft error rate increases to 6,400 per month, or
> > appromately one soft error every 8 minutes of steady operation.
>
> Maybe, maybe not. How much hard research has been done towards this end?

Near as I can tell by reading the article, IBM and several other
companies mentioned put the RAM modules in small vacuum chambers to
achieve the 60,000 altitude, and "exercised" the RAM modules
continuously. IBM submitted the most pessimistic report, the other
companies said soft failures were less, but still significant.



> As for the altitude diversion... how many people run their systems for
> months on end at 60,000 feet? Not very many.

True.



> >>The frequency of errors shouldn't change dramatically with
> >>the amount of load the system is under.
> >
> > Incorrect. Most of the RAM in a lightly loaded Mac is not even being
> > used, so any soft failures in the RAM which is not being used will not
> > cause corruption.
>
> This demonstrates that you don't understand how OS X manages memory.
> It's almost always doing something with RAM.

All I am saying is that if you are running a dual G5 for example, and
the thing is just sitting there with a few of the usual OS X deomons
running, that the only RAM that is being _used_ is perhaps only a
part of the RAM in the very first RAM module #1, and all the other 7
RAM modules are not being used at all by the OS.

Even OS X is not smart enough to say to itself, "Hey, I have been using
RAM module #1 for awhile, I think I will abandon #1 and start using #8
instead."

So the net effect is that RAM module #1 is used all the time the Mac is
running, whereas RAM module #8 is only used when the Mac is very busy.

As we know, a heavily used RAM module is more likely to go bad than a
RAM module that is just sitting powered up but unused in the Mac.

The reason for this is because internal voltages in the "active" module
are frantically being switched from zeroes to ones and vise-versa.

This constant switching generates strains on the RAM cell dialectric
insulation at the molecular level, resulting in accelerated _aging_
of the tiny bit of insulation in the individual RAM cells.

When that insulation eventually fails (deteriorates) - the RAM cell
itself fails. (a "hard" failure)

When the insulation is _about_ to fail, little things like
temporarily low voltages can cause an intermittant failure.

Easy enough to prove in a test chamber, where two identical RAM modules
are being tested. The "exercised" module will always fail before the
"idling" module does.


A RAM module unopened in its original packing will likely last a
hundred years and still be "good".

The same module placed in a computer running heavily 24/7 will last
only 7 years and still be reliable, according to what IBM says in the
article.


> You seem to be thinking that 'soft errors' are a major concern. They
> aren't. Well, not relatively close to sea-level anyway.

Here is a way to convince yourself otherwise. Take a Mac that is
running perfectly.

Heavily exercise the RAM with 24/7 running of Photoshop rendering
operations for example.

Eliminate any possible disk problems by using redundant RAID, such that
if one disk's media fails, it will not corrupt anything.

The files *WILL* get corrupted on this Mac, it is merely a matter of
time, and _all_ the corruption will be because of "soft" RAM errors.



Now the same Mac in placed in a lead box or in an underground location
will not fail nearly as soon, due to the less radiation reaching the
RAM.

Why do you think that a Mac at 60,000 feet fails 100 times more often,
it is directly because of the increased radiation at high altitudes!

(the 100 times figure directly from the article)



So in summary it is my contention that almost all file system failures
and file corruption is caused by _temporarily_ bad RAM, called
"soft-failures".

Soft failures in turn are almost always caused by gamma radiation
and/or high speed nuclei of atoms, mis-named "cosmic rays".

Gamma radiation is very similar to X-Rays, only a higher frequency.

In fact, temporary RAM soft errors _can_ be caused by subjecting the
RAM module to high intensity X-Radiation.

Even low intensity radiation has been implicated as causing RAM soft
errors, stuff like the Alpha and Beta particles emitted by the stray
isotopes of lead and tin found in the solder used on circuit boards.
(all according to that web article)

All this can be easily proved by placing the RAM module in such
unfriendly environments, then observing the failure rate.

(bad environments such as high altitudes, outer space, and also in
man-made devices that are capable of generating either gamma radiation
or high speed atomic nuclei)

The more intense the bombardment, the more often soft failures will
occur.

A Mac in a mile-deep mine will seldom experience file corruption.

The same Mac in the space station will get its files corrupted often.
(unless shielded in a lead box)

The space shuttle carries five seperate redundant computers just for
this reason.


Now I will grant you that our Macs at sea level are essentially
reliable machines, however file damage occurs often enogh to be
worrisome, even at sea level.

I am still trying to make up my mind on the value of "journaling".

Temporarily, in the recent past I have been leaving journaling "on",
but I think I will try it "off" just to see if it has any effect on my
creeping corruption of files.

FWIW, probably it will not make any difference one way or the other,
because I think that journaling is mainly designed to protect against
failures induced by power interuptions, not against the slow insidious
file system corruption caused by soft RAM failures.

To me, that means that journaling is no damn good for powerbooks,
because they are immune to A.C. power interuption.

'course, I am probably wrong about how journaling works.

Mark-
.



Relevant Pages

  • Re: Dont Fix It if it is Not Broken (was Looking at Macs...)
    ... Most of my time to date has been "playing" with the Mac, ... I believe 99% of file corruption problems are caused by _temporarily_ ... bad RAM, so called "soft" RAM failures......and also temporarily bad ... 16 times more likely to develop soft failures when it is heavily used. ...
    (comp.sys.mac.advocacy)
  • Re: Dont Fix It if it is Not Broken (was Looking at Macs...)
    ... I believe 99% of file corruption problems are caused by _temporarily_ bad RAM, so called "soft" RAM failures......and also temporarily bad ... then the Mac could run essentially forever without experiencing _any_ file corruption at all. ... 16 times more likely to develop soft failures when it is heavily used. ...
    (comp.sys.mac.advocacy)
  • Re: Dont Fix It if it is Not Broken (was Looking at Macs...)
    ... >> 16 times more likely to develop soft failures when it is heavily used. ... quantity of RAM and the frequency of RAM failure. ... "Statistically soft errors scale linearly with memory size, ... >> few soft errors will happen with this Mac. ...
    (comp.sys.mac.advocacy)
  • Re: Dont Fix It if it is Not Broken (was Looking at Macs...)
    ... > wound up too long and says too little of value for everyday Mac users. ... > companies mentioned put the RAM modules in small vacuum chambers to ... > companies said soft failures were less, ... > So the net effect is that RAM module #1 is used all the time the Mac is ...
    (comp.sys.mac.advocacy)
  • Re: 20" iMac G5 overheating?
    ... Seems like you might instead assume one of the RAM chips you added is ... both heat output and airflow. ... Has anyone had any experience of bad ram module, ...
    (comp.sys.mac.system)