Re: Superstitious learning in Computer Architecture



John,

There's no practical advantage to
having that managed by one single "instruction" or a bunch of simpler
execution units operating in parallel, in a modern super-scalar or VLIW
CPU.

It's true that short loops can stay in the cache, and so instructions
don't really eat up that much memory bandwidth.

Without a LOT of logic or some other better approach (like in the GE/Honeywell 600/6000 systems), re-executing the instructions requires re-decoding (or lots more instructions on a RISC) and it ties up the cache memory bus transferring more data as instructions than the instructions are working on.

Sure, that's easy, if you want to build a processor with a peak flop rate
limited by memory bandwidth.


No, I'm rather more bold than that. I want a peak flop rate that
matches *cache* bandwidth.

I am a LOT more bold than even that! I want a peak flop rate that is limited by many parallel memory buses all running at cache/clock speeds. Only then would I consider replicating these cores.

The concept of cache is fundamentally flawed in that it STILL restricts access to one word per clock cycle, when a single modern ALU can easily use 5 plus whatever is eaten up with instruction accesses. If/when you put several ALU in there, you need proportionally more buses. There is most of an order of magnitude in speed sacrificed by even HAVING a cache in a single ALU system, and more than an order of magnitude in multiple-ALU systems!

Most people presume that at worst a cache will simply provide no benefit, but they are MUCH MUCH worse than that even in normal operation, because they force a choked memory architecture even with a single ALU.

One of everything per cycle - even
*divisions*.

Of course, divisions are among the simplest of operations on a logarithmic ALU.

Yes, three or four Wallace trees, laid out in a row, to do
Goldschmidt division with full pipelining. (And a little extra
circuitry so that the divide unit, when idle, adds three to the
superscalar number of multiplies available...)

How about Cray-style division, where you first compute 1/denominator, and them multiply the numerator by that value?

You're talking about heroic memory architecture, not processor
architecture, and even that's been impossible to organize essentially ever
since processor+cache moved onto one chip (for performance reasons,
remember). There isn't enough bandwidth.

This ONLY applied if you are NOT implementing the system on a wafer or VLSI. On a wafer or VLSI, just implement more memory buses.

Yes, I realize that it is heroic memory architecture. Unless, of
course, your problems don't need more RAM than, say, 8 Megs or so -
which we can provide for you as cache in today's technology.

But, cache is very inefficient in real estate. Why not just use some scratchpad RAM instead?

But heroic memory architecture _is being done_, the NEC SX-6 (and
SX-6r, don't forget!) with 2,048-way interleaved memory (and no
cache... but then, they are using the space saved on the chip for other
things. I would like to have that and a cache _too_, thank you very
much, if I could...

Are you ready to trade multiple simultaneous memory buses for it? Sounds like a bad bargain to me.

) proves it.

For a consumer product, I get only semi-heroic. Let's use present or
near-future technology for RAM modules, say 2-way to 8-way interleaved
inside the chip.

Multiple memory buses sure beat interleaving. Of course, you can use BOTH.

I think that we *could* then insist that the modules gate in a
well-behaved way to high impedance, and allow, say, 16-way interleaving
on their bus.

But still limited to only one word per clock cycle.

I think we can get enough pins on a chip to have a 256-bit bus, never
mind 64-bit.

Sure, but only ONE of them. Remember, you want a design where the pieces can all be built using the same technology, so if you are going to have a wide memory bus, you then need some complex logic to multiplex in and out of it. If this logic is then limited in speed to the same sorts of speeds as memory runs at, it will choke your speed down.

Your statements seem to presume logic that runs much faster than the buses, a condition that does not so much apply on a wafer or VLSI because the buses are MUCH shorter.

Then, I imagine that one could have eight memory buses in parallel,
each with a controller that doubles as an external floating-point (and
integer, just in case) vector unit.

I don't see how this would work. The ALUs would then each need their own parallel buses to feed data. It sounds like you may have something interesting here. How would it work?

Connect those to a particularly
*fast* bus going into the CPU.

On a wafer or VLSI, everything pretty much runs at the same speed

64-way interleaving seems rather more bearable than 2,048-way
interleaving. Still going to be pricey compared to the ordinary CPU,
but instead of $180,000, I think it could be pulled off for, say,
$5,000.

YES, my Y2K price point.

Occasionally some of the "real vector" guys whinge that no-one's building
them a 3GHz SX-6 or Y-MP or Cyber, without looking closely at why that
might not be happening.

As I see it, this is NOT possible unless it is all on one chip. Once there, you can have a nearly unlimited number of parallel buses, and can use these to keep any number of parallel ALUs busy.

I would have thought it's mainly not because physical limits make it
impossible, but simply because the advantage in speed over your current
common garden Pentium or Opteron just isn't *remotely* worth the price
tag these days.

But, it is the physical limits that force "conventional" architectures that so limit present systems. Once everything is on the same piece of silicon, you can build your CPU to have the thousands of pseudo-pins needed to support many parallel (not through interleaving) memory buses needed to keep a bunch of ALUs turning results out every clock cycle.

So what I'd like to do is take the Pentium or Opteron, and make it look
a *little* bit more like an SX-6 or Cray X-1 without launching the
price into the stratosphere.

It only gets expensive when you start chopping the wafer up into little pieces.

And so I'm thinking of having signalling tech equivalent to RDRAM, on a
256-bit wide bus. I figure it's doable; current packages have plenty of
pins.

Still, this sounds GREAT for systems where the processor is separate from the memory, but not good for wafer/VLSI implementations.

So it's all being pushed as hard as it's possible to go: the limit is how
many pins you can put on the chip, and how fast you can transmit the bits
across those pins. With today's processors, that rate is significantly
lower than the "peak" rate that you can cycle a floating point MAC unit,
when operating from on-chip registers or cache.

This is ONLY a consideration because of the present prohibition against WSI implementations. In WSI, pins are no limitation.

No argument there, I know that too. But if one's vectors are not too
terribly long, one might well be able to operate out of cache for some
problems; they're putting a lot of cache on chips these days.

I'm assuming, in my fantasy chip that could be implemented in perhaps
two to four years (but which wouldn't be practical, except in a form
stripped of certain *other* excesses, for a long time after that) a
4,096-bit wide bus between cache and the ALU bank. 64 flops every
cycle. And that isn't so far-out any more. A CELL can do 16 flops a
cycle, 8 SPUs, 128-bit vectors, two 64-bit numbers in parallel.

This would require a mixed-technology VLSI which no one would ever huild. Without mixed technology, every time you do anything with the data (like mux and de-mux it onto the wide bus), you would either have to slow the entire system down to do it, or utilize pipeline registers which would add lots of latency.

I wasn't castigating the Opteron-based Crays the way I *would* have
castigated them if Cray were Myrias - that is, if they were trying to
build a supercomputer out of a whole pile of 68020 chips, or 386 chips,
for example.

They *are* pretty much doing what I believe is the correct strategy -
first, make a processor that is as big, fast, and powerful as is
practical under current technology, and *then* put a bunch of them in
parallel.

At best, one might squeeze in a *small* improvement - maybe a bit
better than a factor of two, but that sort of thing - by adding a few
vector supercomputer tricks to what the commodity microprocessors are
already doing.

Only if you INSIST on putting your memory on different silicon. On the same wafer or VLSI you can have many buses per ALU and all of the speed that it brings.

Like "scalar promotion" where a temporary variable becomes an array, you would probably need "vector promotion" where an array sometimes gets copied from one memory module to another to avoid multiple-ALUs from attempting to access the same memory module at the same time.

The
catch is that until languages and system management software catch up, you
have to program each of those parallel memory busses separately.

Another pet peeve:

The size of an optimizing compiler is proportional to the SQUARE of the size of the language times the SQUARE of the complexity of the machine - because all interactions must be considered. Without considering this, language standards committees throw in everything including the kitchen sink, and Intel keeps adding new things rather than cleaning up what they already have. The net result is a mess where it takes years to implement a good compiler even with Microsoft's unlimited budgets purchasing hoards of Russian compiler writers.

On other forums I have attempted to talk up a new concept in languages, where the same simple semantics can be stated in two different ways - In English much like COBOL, or in a simple formula form much like Algol. A given program could be edited or listed in either way. The relatively simple internal form could then be targeted to various CPUs with reasonable efforts.

Unfortunately, no one wanted to discuss this, because they all wanted MORE and MORE in their languages, not less and less. Obviously, a lesser language would NOT be compatible with current language specs, and who would ever want THAT?

Note in passing that the first language implemented on many supercomputers is APL, because there it is almost an assembly language!

Perhaps we need some sort of SPL (Supercomputer Programming Language) that would be a stripped down form of APL that uses a standard character set, and just abandon thoughts of ever running C++. Any thoughts.

Nine women cannot have a baby together in one month. It isn't
necessarily just a problem of waiting for backwards software writers to
catch up with the modern world.

With enough money, you could order a baby today and indeed have it arrive in less than one month. Obviously, the method of manufacture would be proprietary.

Steve Richfie1d
.



Relevant Pages

  • Re: Superstitious learning in Computer Architecture
    ... Without a LOT of logic or some other better approach, re-executing the instructions requires re-decoding and it ties up the cache memory bus transferring more data as instructions than the instructions are working on. ... There is most of an order of magnitude in speed sacrificed by even HAVING a cache in a single ALU system, and more than an order of magnitude in multiple-ALU systems! ...
    (comp.arch.arithmetic)
  • Re: Superstitious learning in Computer Architecture
    ... frequently) saving vast gobs of precious memory bandwidth that would be ... compilers, but a lot of the modern compilers are very much that good. ... core is L1 cache. ... then needed by another ALU. ...
    (comp.arch.arithmetic)
  • Re: How do I flush/invalidate the CPU instruction cache?
    ... this is for a interactive language that is mostly compiled ... >> and has the ability to dynamically compile to machine code, ... As the program is executing the cache may ... contain the old code at the modified memory addresses. ...
    (comp.os.linux.questions)
  • Re: How do I flush/invalidate the CPU instruction cache?
    ... this is for a interactive language that is mostly compiled ... >> and has the ability to dynamically compile to machine code, ... As the program is executing the cache may ... contain the old code at the modified memory addresses. ...
    (comp.os.linux.development.system)
  • Re: Cached memory never gets released
    ... Stock linux 2.4.26 kernel. ... Due to flash bug 3M of memory gets lost due to font memory getting lost ... The output of "free" cache number steadily grows. ... longer to exhaust all of system memory with the cache. ...
    (Linux-Kernel)