Re: 54 Processors?
- From: Anne & Lynn Wheeler <lynn@xxxxxxxxxx>
- Date: Wed, 27 Jul 2005 00:18:03 -0600
edgould@xxxxxxxxxxxx (Ed Gould) writes:
> Now I know I am out of date on this but somewhere in the mists of
> time, I could swear that IBM came out saying that anthing above 18(???
> this is a number I am not sure of) was not good, in fact it was bad as
> the interprocessor costs was more overhead than they were
> worth. They sited some physics law (IFIRC) .
> Did IBM rethink the "law" or are they just throwing 54 processors out
> hoping no one will order it?
> My memory is cloudy but I seem to recall these statements around the
> time of the 168MP.
a big problem was strong memory consistency model and cache
invalidation model. two processor smp 370 cache machines ran at .9
times cycle of a single processor machine ... to allow for cross-cache
invalidation protocol chatter (any actual invalidates would slow the
machine down even further). this resulted in basic description that
two processor 370 hardware was 1.8times (aka 2*.9) of a uniprocessor
.... actual cross-cache invalidation overhead and additional smp
operating system overhead might make actual thruput 1.5 times a
we actually had a 16-way 370/158 design on the drawing boards (with
some cache consistency slight of hand) that never shipped ... minor
http://www.garlic.com/~lynn/2005m.html#48 Code density and performance?
3081 was supposed to be a native two-processor machine ... and there
never originally going to be a single processor version of the 3081.
eventually a single processor 3083 was produced (in large part because
TPF didn't have smp software support and a lot of TPF installations
were saturating their machines ... some TPF installations had used
vm370 on 3081 with a pair of virtual machines ... each running a TPF
guest). the 3083 processor was rated at something like 1.15 times the
hardware thruput of one 3081 processor (because they could eliminate
the slow-down for cross-cache chatter).
a 4-way 3084 was much worse ... because each cache had to listen for
chatter from three other processors ... rather than just one other
this was the time-frame when vm370 and mvs kernels went thru
restructuring to align kernel dynamic and static data on cache-line
boundaries and multiples of cache-line allocations (minimizing a lot
of cross-cache invalidation thrashing). supposedly this restructuing
got something over five percent increase in total system thruput.
later machines went to things like using a cache cycle time that was
much faster than rest of the processor (for handling all the
cross-cache chatter) and/or using more complex memory consistency
operations ... to relax the cross cache protocol chatter bottleneck.
around 1990, SCI (scallable coherent interface) defined a
memory consistency model that supported 64 memory "ports".
Convex produced the exampler using 64 two-processor boards where the
two processors on the same board shared the same L2 cache ... and then
the common L2 cache interfaced to the SCI memory access port. This
provided for shared-memory 128 (HP RISC) processor configuration.
in the same time, both DG and Sequent produced a four processor board
(using intel processors) that had shared L2 cache ... with 64 boards
in a SCI memory system ... supporting shared-memory 256 (intel)
processor configuration. Sequent was subsequently bought by IBM.
part of SCI was dual-simplex fiber optic asyncronous interface
.... rather than single, shared syncronous bus .... SCI defined bus
operation with essentially asyncronous (almost message like)
operations being performed (somewhat latency and thruput compensation
compared to single, shared syncronous bus).
SCI had definition for asyncronous memory bus operation. SCI also has
definition for I/O bus operation ... doing things like SCSI operations
IBM 9333 from hursley had done something similar with serial copper
.... effectively encapsulating scsi syncronous bus operations into
asyncronous message operations. Fiber channel standard (FCS, started
in the late 80s) also defined something similar for I/O protocols.
we had wanted to 9333 to evolve into FCS capatible infrastructure
but the 9333 stuff instead evolved into SSA.
ibm mainframe eventually adopted a form of FCS as FICON.
SCI, FCS, and 9333 ... were all looking at pairs of dual-simplex,
unidirectional serial transmission using asyncronous message flows
partially as latency compensation (not requiring end-to-end syncronous
a few recent postings mentioning 9333/ssa:
ttp://www.garlic.com/~lynn/2005.html#50 something like a CTC on a PC
http://www.garlic.com/~lynn/2005m.html#35 IBM's mini computers--lack thereof
http://www.garlic.com/~lynn/2005m.html#46 IBM's mini computers--lack thereof
a few recent postings mentioning SCI
http://www.garlic.com/~lynn/2005d.html#20 shared memory programming on distributed memory model?
http://www.garlic.com/~lynn/2005e.html#12 Device and channel
http://www.garlic.com/~lynn/2005e.html#19 Device and channel
http://www.garlic.com/~lynn/2005.html#50 something like a CTC on a PC
http://www.garlic.com/~lynn/2005j.html#13 Performance and Capacity Planning
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
- 54 Processors?
- From: Ed Gould
- 54 Processors?