Re: IBM x345 Server goes black during memory test of Samsung DIMMs
- From: Phil <silicontundra@xxxxxxxxx>
- Date: 1 May 2007 15:31:43 -0700
On May 1, 5:03 pm, Franc Zabkar <fzab...@xxxxxxxxxxxxxxxxx> wrote:
On Mon, 30 Apr 2007 12:43:41 GMT, Robert Redelmeier
<red...@xxxxxxxxxxxxxxx> put finger to keyboard and composed:
Franc Zabkar <fzab...@xxxxxxxxxxxxxxxxx> wrote in part:
I can't imagine that the paltry few bytes of CMOS RAM, most
I wasn't aware of any limit to CMOS RAM. Most systems have
little, but a very low-level designer could put in more,
probably at a different port address. BIOS isn't fixed.
Many (most?) systems now have 256 bytes of CMOS RAM. AFAIK, the first
128 bytes are accessed via ports 70/71h, and the next 128 bytes via
ports 72/73.
I suppose it's possible to have more CMOS RAM, but it could also be
that the Integrated System Management Processor has its own RAM or
EEPROM. FWIW, other IBM server products appear to write their error
logs to "NVRAM", which in PC terms usually refers to an EEPROM.
of which are already in use, would be enough to store more
than a handful of such errors. In any case, what is the point
of ECC if a system dies when its error log becomes full?
Avoiding error! In many business apps, errors are worse
than downtime. Keeping a suspect machine up that could be
propagating errors and enshrining them in a database is a
DB admins worst nightmare.
Not in my experience. The performance penalty of a faulty memory bit
usually amounted to no more than an extra clock cycle. Taking a
mainframe out of service for a non-fatal error would have meant that
up to a dozen workstations would have been idle. Furthermore, many
servers run 24/7 doing batch jobs.
The whole point of ECC, especially in servers, is to provide a fault
tolerant system. If the error log is full, then the machine should
alert the operator, but that's all. In fact the OP's machine does
indicate when the log is 75% full.
My experience of ECC memory in mainframes is that a computer
can run forever, albeit with a minor performance penalty,
So long as the errors are rare and not localized.
Not true. With ECC you can have a dead bit at *every* address in
*every* memory module and still have a functioning system. It's only
when you have a multi-bit error that the system can break down.
See the references to ChipKill and "memory scrubbing" in IBM's
documentation.
It also
depends very much on the calcs. A scientific machine doing
interative calcs could probably tolerate/heal error much
better than an accounting package running integers.
I don't see it.
if RAM errors are limited to a single data bit per word. I
can't see why PCs would be any different.
It may help to know which chipset is detected by memtest86+. I found
one URL which suggests that the OP's chipset may be the Serverworks
Serverset CNB20-HE.
FWIW, the following URL describes a problem with memtest86+ v1.65:
Support for Serverworks Serverset (CNB20HE)?
http://forum.x86-secret.com/archive/index.php/t-4459.html
The author writes:
"This [test failure] seems to happen only with 2x1GB memory strips.
... If I test with 2x512MB everything works fine."
- Franc Zabkar
Tested with a slightly older version of MemTest86+ v1.65 on the
original IBM x345 Server OEM 256MB DDR SDRAM memory DIMM sticks in the
server's slot 1 and 2. The results were again similar, with the
computer box hanging after about 45 minutes of testing and again
requiring a R&R of the CMOS battery before it would boot again.
The memory is mfgr'd by Micron Tech with two different date codes.
Again my conclusion is that there is a thermally related failure mode;
the older 2002 date codes failing first, presumably fabricated with
older process technology that results in higher power consumption.
My conclusion is that gamer-type RAM coolers (convection heat sinks)
are required to reduce memory reliability issues with IBM OEM DIMM
memory in their legacy 2002 xSeries servers, even though the 2U
servers are quite well designed with dual redundant banks of 4 fans
across the cross-section of the chassis (wind-tunnel type design). Any
SEs concur? Regards, Phil
Details follow:
IBM memory P/N 38L4029 FRU 09N4306, 2 sticks of 256MB PC2100 CL2.5
2.5v registered ECC, double sided, organized 32Mb x 72
The older Micron Tech PC2100A-25330-M1 DIMM with 18 chips 46V32M4-75A
date code late 2002. This pair hung 32 min into testing cycle with
each pass taking 11 min for 512MB, or 2 1/2 passes.
The newer Micron Tech PC2100A-25331-Z DIMM with 18 chips 46V32M4-75B
date code mid 2003. This pair hung 49 min (similar to newer 1GB DIMM)
into testing cycle, or just over 4 passes. Hung at Test5, Block move.
The chips were almost-too-hot-to-touch with the pinky finger.
Thanks guys for your comments, my reply to your questions are:
1) RR: computer was internally very clean, as if this unit was a
backup on-the-shelf, nowhere do any manuals state that filled BIOS
System-Error logs cause no ability to boot
2) FZ#2: the computer operation is factory stock; none of the LEDs
light up in IBM's Light Path Diagnostics Panel or on the mainboard or
planar. Thanks for correction, memory operation has a 10ns cycle time
at 100MHz clock. Samsung parts indeed are K4510638D-TB80, 8ns parts,
with sufficient design bandwidth margin. DDR clocking gives the 266MHz
operation to the Xeon processors with their 533MHz FSB. My feeling on
memory chip power consumption is that it is more than the 1.5 watts
spec.
3) DC: not using IBM's ChipKill technology DIMMs, so deallocation of
a block of memory space is not effected.
4) FZ#4: again not using IBM ChipKill DIMMs. I'll again refer you to
IBM's White paper on ChipKill.
5) FZ, RR, the HW Manual references p34, 94 do not pertain to problem
at hand. Same with User Manual, p5, 6.
.
- Follow-Ups:
- Re: IBM x345 Server goes black during memory test of Samsung DIMMs
- From: Franc Zabkar
- Re: IBM x345 Server goes black during memory test of Samsung DIMMs
- References:
- Re: IBM x345 Server goes black during memory test of Samsung DIMMs
- From: Franc Zabkar
- Re: IBM x345 Server goes black during memory test of Samsung DIMMs
- Prev by Date: Re: IBM x345 Server goes black during memory test of Samsung DIMMs
- Next by Date: Re: IBM x345 Server goes black during memory test of Samsung DIMMs
- Previous by thread: Re: IBM x345 Server goes black during memory test of Samsung DIMMs
- Next by thread: Re: IBM x345 Server goes black during memory test of Samsung DIMMs
- Index(es):
Relevant Pages
|