Re: Superstitious learning in Computer Architecture



On Thu, 24 Aug 2006 22:55:47 -0700, jsavard wrote:
No, no, he means *real* vector arithmetic.

Where you have one instruction, and it plows through three arrays in
memory... doing one floating-point multiply per cycle in the pipelines
for just about as long as you want.

Sure, that's what I'm talking about. There's no practical advantage to
having that managed by one single "instruction" or a bunch of simpler
execution units operating in parallel, in a modern super-scalar or VLIW
CPU.

If you have an architecture that can pull *this* off without the vectors
having to be in the cache, you're talking about stuff like the SX-6 from
NEC. And its CPU is said not to require more transistors than a modern
Pentium.

Sure, that's easy, if you want to build a processor with a peak flop rate
limited by memory bandwidth. I dare say that would take
comparatively few transistors, indeed. The DSP chip that I was using more
than ten years ago could manage what you ask, but today's can't, because
the relationship between core speed and memory speed has changed. I
wouldn't ask to go back to that state, though. Excess in-core performance
is useful.

You're talking about heroic memory architecture, not processor
architecture, and even that's been impossible to organize essentially ever
since processor+cache moved onto one chip (for performance reasons,
remember). There isn't enough bandwidth. GPUs are the current kings of
bandwidth/processor speed, but that's mainly because the main conventional
processors run single threads much faster than them. The limit at the
moment is how many pins you can put on a chip.

Of course, you need to spend a lot on memory for one of those chips...
2,048-way interleaving means you can't just put in *one* memory stick;
let's see, now, 1,024 memory sticks at about $50 a pop... no wonder a
single-CPU SX-6r costs $180,000 since the memory is probably about half
of that!

And then worry about physical distance, fan-out, buffering etc.

Occasionally some of the "real vector" guys whinge that no-one's building
them a 3GHz SX-6 or Y-MP or Cyber, without looking closely at why that
might not be happening.

I think that with a little work, they can make this more reasonable.
After all, there was a style of memory module that only had 16 data
lines, but yet kept up with conventional ones with 64 data lines... and
current conventional memory modules at least do two-way interleaving
these days.

That'd be rdram. The last generation Alpha had (if I remember correctly)
four independent rdram channels, as do most of the high-end GPUs.
That edge has largely been overtaken by beefing up the signalling tech on
the more conventional, wider DRAMs. That's what "DDR" (double data rate)
and other such acronyms are all about. Current generation Opetrons
and the like can support two of those. Most modern RAM can maintain at
least four, probably more, pages open at once (interleaving in the
mainframe sense), and within a page you've got effectively arbitrary
pipeline-able access for on the order of a disk sector's worth of data.
So it's all being pushed as hard as it's possible to go: the limit is how
many pins you can put on the chip, and how fast you can transmit the bits
across those pins. With today's processors, that rate is significantly
lower than the "peak" rate that you can cycle a floating point MAC unit,
when operating from on-chip registers or cache. Look at the stream triad
figures. [Actually, I just have, and it's clear that there are some
machines being built that *do* have heroic memory architectures. For
example, there's an Alpha system with a "balance" of only 3.1 that whups
an eight-processor SX-6. The graph on the front page of the stream web
site is still a pretty reasonable description of the trend, though.]

It's not architecture so much as physics. BlueGene and the Opteron based
Crays are a response to the realities of physics, not some penny-pinching
desire to use off-the-rack processors because they're cheap. In a sense,
they're all about increasing the number of high-speed channels to memory,
and moving the memory closer to the processor (so that it's faster). The
catch is that until languages and system management software catch up, you
have to program each of those parallel memory busses separately.

Cheers,

--
Andrew

.



Relevant Pages

  • Re: Superstitious learning in Computer Architecture
    ... It's true that short loops can stay in the cache, ... don't really eat up that much memory bandwidth. ... The DSP chip that I was using more ... well-behaved way to high impedance, and allow, say, 16-way interleaving ...
    (comp.arch.arithmetic)
  • Re: Softkicking on a DKB Cobra/1240
    ... The reason is that the upper end of the chip ... after reset) and then coldcapture might not be what you expect it to be. ... memory, and coolcapture is delayed until after expansion has done its job. ...
    (comp.sys.amiga.programmer)
  • Re: Guidance sought on how memory is used
    ... Their use of PCs may well be playing simple games or letter writing. ... have been stripped of Hard Drives and Memory (of any larger size - ... to install WinXP. ... decided to give the 256mb chip a try and added that to the ...
    (alt.comp.hardware.pc-homebuilt)
  • Re: Chucks plan
    ... ANS Forth, ran it on an iTV chip and got horrid performance. ... to machineForth produced a huge gain over portable ... In P21Forth ANS stacks are in memory like most eForth ...
    (comp.lang.forth)
  • Re: The variable bit cpu
    ... > I don't claim to be an expert on processor architecture, ... > from memory to processor in a second. ... Using space in a cache for metadata means ... Which make the chip incrediable large. ...
    (comp.lang.java.programmer)