Re: Superstitious learning in Computer Architecture
- From: jsavard@xxxxxxxxx
- Date: 26 Aug 2006 18:52:34 -0700
Andrew Reilly wrote:
Sure, that's what I'm talking about.
Ah, then I'm sorry for having misunderstood you, thinking you were
talking only about MMX.
There's no practical advantage to
having that managed by one single "instruction" or a bunch of simpler
execution units operating in parallel, in a modern super-scalar or VLIW
CPU.
It's true that short loops can stay in the cache, and so instructions
don't really eat up that much memory bandwidth.
Sure, that's easy, if you want to build a processor with a peak flop rate
limited by memory bandwidth.
No, I'm rather more bold than that. I want a peak flop rate that
matches *cache* bandwidth. One of everything per cycle - even
*divisions*. Yes, three or four Wallace trees, laid out in a row, to do
Goldschmidt division with full pipelining. (And a little extra
circuitry so that the divide unit, when idle, adds three to the
superscalar number of multiplies available...)
I dare say that would take
comparatively few transistors, indeed. The DSP chip that I was using more
than ten years ago could manage what you ask, but today's can't, because
the relationship between core speed and memory speed has changed. I
wouldn't ask to go back to that state, though. Excess in-core performance
is useful.
Oh, yes.
You're talking about heroic memory architecture, not processor
architecture, and even that's been impossible to organize essentially ever
since processor+cache moved onto one chip (for performance reasons,
remember). There isn't enough bandwidth.
Yes, I realize that it is heroic memory architecture. Unless, of
course, your problems don't need more RAM than, say, 8 Megs or so -
which we can provide for you as cache in today's technology.
But heroic memory architecture _is being done_, the NEC SX-6 (and
SX-6r, don't forget!) with 2,048-way interleaved memory (and no
cache... but then, they are using the space saved on the chip for other
things. I would like to have that and a cache _too_, thank you very
much, if I could...) proves it.
For a consumer product, I get only semi-heroic. Let's use present or
near-future technology for RAM modules, say 2-way to 8-way interleaved
inside the chip.
I think that we *could* then insist that the modules gate in a
well-behaved way to high impedance, and allow, say, 16-way interleaving
on their bus.
I think we can get enough pins on a chip to have a 256-bit bus, never
mind 64-bit.
Then, I imagine that one could have eight memory buses in parallel,
each with a controller that doubles as an external floating-point (and
integer, just in case) vector unit. Connect those to a particularly
*fast* bus going into the CPU.
64-way interleaving seems rather more bearable than 2,048-way
interleaving. Still going to be pricey compared to the ordinary CPU,
but instead of $180,000, I think it could be pulled off for, say,
$5,000.
Occasionally some of the "real vector" guys whinge that no-one's building
them a 3GHz SX-6 or Y-MP or Cyber, without looking closely at why that
might not be happening.
I would have thought it's mainly not because physical limits make it
impossible, but simply because the advantage in speed over your current
common garden Pentium or Opteron just isn't *remotely* worth the price
tag these days.
So what I'd like to do is take the Pentium or Opteron, and make it look
a *little* bit more like an SX-6 or Cray X-1 without launching the
price into the stratosphere.
I think that with a little work, they can make this more reasonable.
After all, there was a style of memory module that only had 16 data
lines, but yet kept up with conventional ones with 64 data lines... and
current conventional memory modules at least do two-way interleaving
these days.
That'd be rdram. The last generation Alpha had (if I remember correctly)
four independent rdram channels, as do most of the high-end GPUs.
That edge has largely been overtaken by beefing up the signalling tech on
the more conventional, wider DRAMs. That's what "DDR" (double data rate)
and other such acronyms are all about.
And so I'm thinking of having signalling tech equivalent to RDRAM, on a
256-bit wide bus. I figure it's doable; current packages have plenty of
pins.
So it's all being pushed as hard as it's possible to go: the limit is how
many pins you can put on the chip, and how fast you can transmit the bits
across those pins. With today's processors, that rate is significantly
lower than the "peak" rate that you can cycle a floating point MAC unit,
when operating from on-chip registers or cache.
No argument there, I know that too. But if one's vectors are not too
terribly long, one might well be able to operate out of cache for some
problems; they're putting a lot of cache on chips these days.
I'm assuming, in my fantasy chip that could be implemented in perhaps
two to four years (but which wouldn't be practical, except in a form
stripped of certain *other* excesses, for a long time after that) a
4,096-bit wide bus between cache and the ALU bank. 64 flops every
cycle. And that isn't so far-out any more. A CELL can do 16 flops a
cycle, 8 SPUs, 128-bit vectors, two 64-bit numbers in parallel.
Look at the stream triad
figures. [Actually, I just have, and it's clear that there are some
machines being built that *do* have heroic memory architectures. For
example, there's an Alpha system with a "balance" of only 3.1 that whups
an eight-processor SX-6. The graph on the front page of the stream web
site is still a pretty reasonable description of the trend, though.]
Interesting. I thought the SX-6 was an example of a heroic memory
architecture. The Alpha must have a good interface.
It's not architecture so much as physics. BlueGene and the Opteron based
Crays are a response to the realities of physics, not some penny-pinching
desire to use off-the-rack processors because they're cheap.
I thought the two were *the same thing*. Off-the-rack processors get to
have almost the same performance as something fancy because the
realities of physics favor them.
I wasn't castigating the Opteron-based Crays the way I *would* have
castigated them if Cray were Myrias - that is, if they were trying to
build a supercomputer out of a whole pile of 68020 chips, or 386 chips,
for example.
They *are* pretty much doing what I believe is the correct strategy -
first, make a processor that is as big, fast, and powerful as is
practical under current technology, and *then* put a bunch of them in
parallel.
At best, one might squeeze in a *small* improvement - maybe a bit
better than a factor of two, but that sort of thing - by adding a few
vector supercomputer tricks to what the commodity microprocessors are
already doing.
The
catch is that until languages and system management software catch up, you
have to program each of those parallel memory busses separately.
Nine women cannot have a baby together in one month. It isn't
necessarily just a problem of waiting for backwards software writers to
catch up with the modern world.
John Savard
.
- Follow-Ups:
- Re: Superstitious learning in Computer Architecture
- From: Steve Richfie1d
- Re: Superstitious learning in Computer Architecture
- From: toby
- Re: Superstitious learning in Computer Architecture
- From: jsavard
- Re: Superstitious learning in Computer Architecture
- References:
- Superstitious learning in Computer Architecture
- From: Steve Richfie1d
- Re: Superstitious learning in Computer Architecture
- From: jsavard
- Re: Superstitious learning in Computer Architecture
- From: glen herrmannsfeldt
- Re: Superstitious learning in Computer Architecture
- From: jsavard
- Re: Superstitious learning in Computer Architecture
- From: Steve Richfie1d
- Re: Superstitious learning in Computer Architecture
- From: Andrew Reilly
- Re: Superstitious learning in Computer Architecture
- From: jsavard
- Re: Superstitious learning in Computer Architecture
- From: Andrew Reilly
- Superstitious learning in Computer Architecture
- Prev by Date: Re: Superstitious learning in Computer Architecture
- Next by Date: Re: Superstitious learning in Computer Architecture
- Previous by thread: Re: Superstitious learning in Computer Architecture
- Next by thread: Re: Superstitious learning in Computer Architecture
- Index(es):
Relevant Pages
|