Re: Superstitious learning in Computer Architecture



On Sun, 27 Aug 2006 09:48:20 -0600, Steve Richfie1d wrote:

John,

There's no practical advantage to
having that managed by one single "instruction" or a bunch of simpler
execution units operating in parallel, in a modern super-scalar or VLIW
CPU.

It's true that short loops can stay in the cache, and so instructions
don't really eat up that much memory bandwidth.

Without a LOT of logic or some other better approach (like in the
GE/Honeywell 600/6000 systems), re-executing the instructions requires
re-decoding (or lots more instructions on a RISC) and it ties up the
cache memory bus transferring more data as instructions than the
instructions are working on.

So? That's what instruction caches and Harvard architecture is for. The
sort of one-instruction-replacement vector operation that we're talking
about is a loop with a 100% hit in the instruction cache, which has
probably 256 bits or more of fetch width, independent of the ALU's access
to memory. The "LOT" of logic is either in the instruction scheduler or
the compiler, depending on whether we're talking about a VLIW like the TI
C6000 series or super scalar out-of-order, like the Opteron, Intel Core or
Power processors.

Sure, that's easy, if you want to build a processor with a peak flop rate
limited by memory bandwidth.


No, I'm rather more bold than that. I want a peak flop rate that
matches *cache* bandwidth.

I am a LOT more bold than even that! I want a peak flop rate that is
limited by many parallel memory buses all running at cache/clock speeds.
Only then would I consider replicating these cores.

The concept of cache is fundamentally flawed in that it STILL restricts
access to one word per clock cycle

No it doesn't, in general. Most modern cache systems have busses that are
wider than a "machine word". Whether they support multiple independent
parallel accesses across several banks, or interleaved wide accesses
through special multi-word loads and stores is very variable by
architecture.

, when a single modern ALU can easily
use 5 plus whatever is eaten up with instruction accesses. If/when you
put several ALU in there, you need proportionally more buses. There is
most of an order of magnitude in speed sacrificed by even HAVING a cache
in a single ALU system, and more than an order of magnitude in
multiple-ALU systems!

/*expletive-deleted*/ Cache is just another chunk of memory that makes
coding simpler. You can do the same with multiple pages of uniquely
addressable memory, (as is done in most DSPs and the auxiliary units on
the IBM Cell), but it's work to code.

Most people presume that at worst a cache will simply provide no
benefit, but they are MUCH MUCH worse than that even in normal
operation, because they force a choked memory architecture even with a
single ALU.

You are working from an incorrect assumption.

You're talking about heroic memory architecture, not processor
architecture, and even that's been impossible to organize essentially ever
since processor+cache moved onto one chip (for performance reasons,
remember). There isn't enough bandwidth.

This ONLY applied if you are NOT implementing the system on a wafer or
VLSI. On a wafer or VLSI, just implement more memory buses.

The argument against this that is generally presented, when I've seen it
discussed, is that the silicon processing optimized for high-density DRAM
is not good at doing processors, and vice-versa. Now, I know that IBM
now have an on-processor (embedded) DRAM process available, but I haven't
seen much being done with that, other than the big caches on their
p-series modules (which are pretty close to wafer-scale gizmos, using very
fancy bare-chip bonding things.) It doesn't seem to be taking the world
by fire, otherwise, so perhaps there are disadvantages, too.

There's also a processor+DRAM chip (Mitsubishi DN10000 series, from
memory) that is/was mostly used in cameras, I think. That was
particularly interesting because some use was made of the 1k-bit wide data
path. But again, it's not taken the world by storm, so there must be
other issues.

Yes, I realize that it is heroic memory architecture. Unless, of
course, your problems don't need more RAM than, say, 8 Megs or so -
which we can provide for you as cache in today's technology.

But, cache is very inefficient in real estate. Why not just use some
scratchpad RAM instead?

Why do you say "very" inefficient? I doubt that the tag infrastructure is
more than a few percent over the cost of the SRAM itself. The reason "why
not" is that programming overlays is a totaly brain-melting experience,
and makes it really hard to make the resulting code portable. That kind
of hardware assist is well worth it, IMO.

Are you ready to trade multiple simultaneous memory buses for it? Sounds
like a bad bargain to me.

What's so incompatible with the notion of cache and multiple memory
busses? Essentially all cache-based processors since the dawn of the RISC
era have had two memory busses on-chip: one for instructions, and another
for data. An increasing number have multiple independent data busses to
at least one level of cache.

Multiple memory buses sure beat interleaving. Of course, you can use
BOTH.

Not necessarily. It all comes down to bandwidth. If you have to fetch
and store data in multi-word lumps, but can do so in parallel with your
ALU operation, then you get the advantage of simplified address decoding
and bus architecture while still keeping your ALU busy. Makes the code
more complicated, perhaps, but most of the complexity lives in your highly
optimized matrix library.

I think that we *could* then insist that the modules gate in a
well-behaved way to high impedance, and allow, say, 16-way interleaving
on their bus.

But still limited to only one word per clock cycle.

Says who?

And so I'm thinking of having signalling tech equivalent to RDRAM, on a
256-bit wide bus. I figure it's doable; current packages have plenty of
pins.

Still, this sounds GREAT for systems where the processor is separate
from the memory, but not good for wafer/VLSI implementations.

The main reason for limiting signalling rate on wide DRAM, and why PCIe
has moved to byte-wide "lanes", and why even wide busses on-chip cause
problems is skew.

So it's all being pushed as hard as it's possible to go: the limit is
how many pins you can put on the chip, and how fast you can transmit
the bits across those pins. With today's processors, that rate is
significantly lower than the "peak" rate that you can cycle a floating
point MAC unit, when operating from on-chip registers or cache.

This is ONLY a consideration because of the present prohibition against
WSI implementations. In WSI, pins are no limitation.

In WSI you can't (as far as I know) simultaneously have dense DRAM and
fast processors. Also: you do have to have something that looks like
pins. You can't just expose a whole wafer on one shot at the moment.
Chips like the Itanium are limited in size by the optics used by the
printing process. If they could make those chips bigger, they would. So
even with WSI, you need a tessilation of things that look like chips,
which means that you need to connect them together with wires that are
"long" by on-chip signalling standards, which means that you are still
going to be effectively pin-limited. Mind you, IBM's multi-chip-module
technology and Sun's capacitive pin stuff are all aimed directly at this
issue of increasing the number of signals that you can get on and off a
chip. So it *is* being worked on.

Only if you INSIST on putting your memory on different silicon. On the
same wafer or VLSI you can have many buses per ALU and all of the speed
that it brings.

I don't believe that that's the case, or rather, there seem to be catches
and caveats to doing it that make it not work as well in practice as one
might like. Flip chips and 3D stacking seem like more promising
alternatives. Staying 2D requires wires that are too long, and making too
many concessions at the process level.

Unfortunately, no one wanted to discuss this, because they all wanted
MORE and MORE in their languages, not less and less. Obviously, a lesser
language would NOT be compatible with current language specs, and who
would ever want THAT?

Java is a pretty small language (as was Modula-3 before it). Even
C-the-language is fairly small, if a bit ungainly. Not everyone is
bent on adding features and complexity.

Perhaps we need some sort of SPL (Supercomputer Programming Language)
that would be a stripped down form of APL that uses a standard character
set, and just abandon thoughts of ever running C++. Any thoughts.

There is currently a megabuck DARPA-funded program where (I think) Cray,
Sun and IBM are competing to come up with just such an animal. Will be
interesting to see how that goes. Google for DARPA HPCS, Chapel, X10, and
Fortress (the names of the Cray, IBM and Sun projects, respectively).

Cheers,

--
Andrew

.



Relevant Pages

  • Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs
    ... :This doesn't contradict your claim since main memory is not really involved. ... that gives the same not-very-real-world cache state for all iterations ... full, and the cpu stalls anyway. ... static instruction order makes it easiest for them, ...
    (freebsd-arch)
  • Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs
    ... :This doesn't contradict your claim since main memory is not really involved. ... that gives the same not-very-real-world cache state for all iterations ... full, and the cpu stalls anyway. ... static instruction order makes it easiest for them, ...
    (freebsd-current)
  • Re: Superstitious learning in Computer Architecture
    ... It's true that short loops can stay in the cache, ... don't really eat up that much memory bandwidth. ... The DSP chip that I was using more ... well-behaved way to high impedance, and allow, say, 16-way interleaving ...
    (comp.arch.arithmetic)
  • Re: Instruction And Data memory
    ... The difference is that instruction memory is exactly that: ... Cache efficiency. ... instructions, requiring an I-cache refill. ...
    (sci.electronics.design)
  • Re: Problem: Creating a raw binary string
    ... > While its true that a 64-bit cpu will move twice the data per instruction it ... > Memory bus width plays an important role here and unless it too is widened / ... You are forgetting the two levels of cache in the processor. ... The memory chips are addressed in Row col fashion. ...
    (alt.comp.lang.borland-delphi)