Re: Superstitious learning in Computer Architecture
- From: Steve Richfie1d <Steve@xxxxxxxxxxxxxxxxxxxxx>
- Date: Sun, 27 Aug 2006 22:29:17 -0600
Andrew,
Hmmm, this discussion IS getting interesting...
There's no practical advantage to
having that managed by one single "instruction" or a bunch of simpler
execution units operating in parallel, in a modern super-scalar or VLIW
CPU.
It's true that short loops can stay in the cache, and so instructions
don't really eat up that much memory bandwidth.
Without a LOT of logic or some other better approach (like in the GE/Honeywell 600/6000 systems), re-executing the instructions requires re-decoding (or lots more instructions on a RISC) and it ties up the cache memory bus transferring more data as instructions than the instructions are working on.
So? That's what instruction caches and Harvard architecture is for. The
sort of one-instruction-replacement vector operation that we're talking
about is a loop with a 100% hit in the instruction cache, which has
probably 256 bits or more of fetch width, independent of the ALU's access
to memory.
Of course, you must fetch and decode this operation during the previous operation to overlap its execution. This just all seems like a lot of complexity for no obvious (to me yet) gain.
The only gain that I can see is if the loops sometimes execute different instructions. I suppose that there might be some application for this sort of flexibility, though no obvious example immediately comes to mind.
The "LOT" of logic is either in the instruction scheduler or
the compiler, depending on whether we're talking about a VLIW like the TI
C6000 series or super scalar out-of-order, like the Opteron, Intel Core or
Power processors.
I would like to avoid building big instruction schedulers because of the awful experience that people have had with them. Being a compiler guy, I have no problem with fancy compilers.
Sure, that's easy, if you want to build a processor with a peak flop rate
limited by memory bandwidth.
No, I'm rather more bold than that. I want a peak flop rate that
matches *cache* bandwidth.
I am a LOT more bold than even that! I want a peak flop rate that is limited by many parallel memory buses all running at cache/clock speeds. Only then would I consider replicating these cores.
The concept of cache is fundamentally flawed in that it STILL restricts access to one word per clock cycle
No it doesn't, in general. Most modern cache systems have busses that are
wider than a "machine word". Whether they support multiple independent
parallel accesses across several banks, or interleaved wide accesses
through special multi-word loads and stores is very variable by
architecture.
These "modern" caches don't appear to be usable in a WSI world.
All of this presumes that the cache and CPU is faster than the memory or memory bus, which in WSI it is NOT. Memory is just a small number of gate delays. Also, after every 3 or 4 gates you need a pipeline register. In the world of fast CPUs and slower memories on long busses, I would agree with you.
In short, you need SIMPLE to avoid astronomical latencies for everything.
, when a single modern ALU can easily use 5 plus whatever is eaten up with instruction accesses. If/when you put several ALU in there, you need proportionally more buses. There is most of an order of magnitude in speed sacrificed by even HAVING a cache in a single ALU system, and more than an order of magnitude in multiple-ALU systems!
/*expletive-deleted*/ Cache is just another chunk of memory that makes
coding simpler. You can do the same with multiple pages of uniquely
addressable memory, (as is done in most DSPs and the auxiliary units on
the IBM Cell), but it's work to code.
Yes, but this way you don't have to slow the memory down to deal with all of the gate delays to use wide works, use content addressable (cache) memory, etc.
Most people presume that at worst a cache will simply provide no benefit, but they are MUCH MUCH worse than that even in normal operation, because they force a choked memory architecture even with a single ALU.
You are working from an incorrect assumption.
If so, I STILL haven't seen it.
You're talking about heroic memory architecture, not processor
architecture, and even that's been impossible to organize essentially ever
since processor+cache moved onto one chip (for performance reasons,
remember). There isn't enough bandwidth.
This ONLY applied if you are NOT implementing the system on a wafer or VLSI. On a wafer or VLSI, just implement more memory buses.
The argument against this that is generally presented, when I've seen it
discussed, is that the silicon processing optimized for high-density DRAM
is not good at doing processors, and vice-versa.
I have had some long discussions with both processor and memory fab people about this. The differences are curious:
1. The MOS processes used for DRAM intentionally have HIGH gate capacitance to maximize the leakoff time, while processor processes have LOW gate capacitance for maximum speed. Of course if you go to CMOS then capacitance doesn't help the memory process. Also, if you break memory down into small modules, CMOS may not take any more space than the refresh logic.
2. It used to be that memory was a highly hand-optimized process while processors used lots of standard cells, etc. Now, they are starting to come together somewhat as memories get SO complex that they can't devote the hand labor to every transistor, and outfits like Intel can devote man lifetimes to hand-optimizing their choke points.
3. There is a MAJOR cultural difference. The people here on this forum would get along fine with processor fab people, but would probably immediately dislike and disrespect memory fab people.
The only way (that I see) past this is for processor fab people to come up to speed on memory methods, which appears to be exactly what has happened with the hybrid products that are out there.
Now, I know that IBM
now have an on-processor (embedded) DRAM process available, but I haven't
seen much being done with that, other than the big caches on their
p-series modules (which are pretty close to wafer-scale gizmos, using very
fancy bare-chip bonding things.) It doesn't seem to be taking the world
by fire, otherwise, so perhaps there are disadvantages, too.
There have been microcontroller chips like the 8051 around for a LONG time (like ~20 years), but the memories in these are quite small.
There's also a processor+DRAM chip (Mitsubishi DN10000 series, from
memory) that is/was mostly used in cameras, I think. That was
particularly interesting because some use was made of the 1k-bit wide data
path. But again, it's not taken the world by storm, so there must be
other issues.
I am only interested in products where in putting the memory with the CPU they get for more than they could ever get from separate memory and CPU.
Yes, I realize that it is heroic memory architecture. Unless, of
course, your problems don't need more RAM than, say, 8 Megs or so -
which we can provide for you as cache in today's technology.
But, cache is very inefficient in real estate. Why not just use some scratchpad RAM instead?
Why do you say "very" inefficient? I doubt that the tag infrastructure is
more than a few percent over the cost of the SRAM itself. The reason "why
not" is that programming overlays is a totaly brain-melting experience,
and makes it really hard to make the resulting code portable. That kind
of hardware assist is well worth it, IMO.
OK, I'll make a statement that will hopefully show our disconnect:
Probably the biggest memory-related problem in Multi-ALU, many-memory-bus architectures is where one ALU creates an array that is then needed by another ALU. One common solution is some ALU-to-ALU connection to ship the array over to the other ALU. Another better solution is to simply switch which ALU has access to the block of memory. There isn't any (obvious) way to connect cache up to such a thing because with hundreds of ALUs connected to thousands of memory blocks, could you imagine a cache with literally thousands of independent ports, or thousands of cache memories with all of the logic to be coherent?
As soon as you attempt to reduce these numbers through multiplexing, etc., you end up stealing either time or latency from the system.
Of COURSE this makes a memory allocation nightmare, but what are good compilers good for anyway?!
Correct me if I am wrong here, but I think that I am envisioning a MUCH more complex system than you have been, probably with at least hundreds of ALUs and at least thousands of memory busses spread over a wafer. Even if you COULD design some super duper cache to manage such a thing, it must also be fault tolerant, which I just don't see.
Hopefully I have just missed a piece of brilliance. Can you lead me from my above viewpoint?
Are you ready to trade multiple simultaneous memory buses for it? Sounds
like a bad bargain to me.
What's so incompatible with the notion of cache and multiple memory
busses? Essentially all cache-based processors since the dawn of the RISC
era have had two memory busses on-chip: one for instructions, and another
for data. An increasing number have multiple independent data busses to
at least one level of cache.
A great idea for just two caches and virtually no coherency problems. How, how do you keep 10,000 of them coherent? If they are hitting the same addresses, you probably can't. Otherwise, I suppose that some small amount of cache wouldn't hurt, PROVIDED that you give up on coherency right from the start. However, if you design your instructions so that loops are RARE and provide enough registers to keep things in the CPU, it isn't obvious that you would have many cache hits.
Multiple memory buses sure beat interleaving. Of course, you can use
BOTH.
Not necessarily. It all comes down to bandwidth. If you have to fetch
and store data in multi-word lumps, but can do so in parallel with your
ALU operation, then you get the advantage of simplified address decoding
and bus architecture while still keeping your ALU busy. Makes the code
more complicated, perhaps, but most of the complexity lives in your highly
optimized matrix library.
Hmmm, let me restate this to see if I got it right.
You fetch and store several words at a time, feeding several ALUs with the data, so that you logically interleave the data without ever physically interleaving it. Would work for dome simple operations like matrix add and subtract, but not matrix multiply or scatter/gather operations.
I think that we *could* then insist that the modules gate in a
well-behaved way to high impedance, and allow, say, 16-way interleaving
on their bus.
I STILL don't see where this fits in with many data sources and sinks (store operations).
But still limited to only one word per clock cycle.
Says who?
OK, one something per clock cycle, only I STILL don't see how the same (slow) logic used for memory is ever going to deal with mus/de-mux at full speed with lots of pipeline registers.
And so I'm thinking of having signalling tech equivalent to RDRAM, on a
256-bit wide bus. I figure it's doable; current packages have plenty of
pins.
Still, this sounds GREAT for systems where the processor is separate
from the memory, but not good for wafer/VLSI implementations.
The main reason for limiting signalling rate on wide DRAM, and why PCIe
has moved to byte-wide "lanes", and why even wide busses on-chip cause
problems is skew.
The wider the bus is, the slower it can run. Wider buses should reduce skew.
So it's all being pushed as hard as it's possible to go: the limit is
how many pins you can put on the chip, and how fast you can transmit
the bits across those pins. With today's processors, that rate is
significantly lower than the "peak" rate that you can cycle a floating
point MAC unit, when operating from on-chip registers or cache.
YES, which is why it is necessary to move to wafers to blow away the pin limitation, which as I see it is 99% of the motivation for WSI.
This is ONLY a consideration because of the present prohibition against
WSI implementations. In WSI, pins are no limitation.
In WSI you can't (as far as I know) simultaneously have dense DRAM and
fast processors.
More accurately as I understand things: You must choose between DRAM and SRAM.
With DRAM, you get killed by all of the refresh circuitry unless the blocks are large, and even then you must somehow make it work timing-wise when almost anything you do will be timing skewed with other things that are happening across the wafers.
With SRAM, the cells are considerably larger - like 4 times or so depending on the cells you select.
There may be some clever approaches that haven't been fully evaluated, e.g. having coordinated DRAM blocks within a cluster used by a particular ALU, and then adjusting the timing to work with another connected ALU, etc., but this sounds hazardous with anything short of perfect design. I suspect that SRAM may be best until working wafers are made.
Of course, with a given process, pretty much everything runs at the same speed, from gates inside of your ALUs to the decoders inside of your RAM. This produces a different processor:memory speed ratio which some might call a slow processor, only more probably it is a fast RAM.
In short, you don't just throw a processor and a bunch of DRAM onto a wafer and consider it done.
Also, this must work in the presence of a defect for every million transistors or so.
Also: you do have to have something that looks like
pins. You can't just expose a whole wafer on one shot at the moment. Chips like the Itanium are limited in size by the optics used by the
printing process. If they could make those chips bigger, they would.
From what I have heard, the limitation is in the step-and-repeat equipment that makes the masks and not the masks themselves. However, in any case yes, you DO want to be able to chop wafers apart, as sometimes you DO have a catastrophic error, e.g. a short between power and ground.
I had envisioned maybe a dozen or so "cores" with wide interprocessor busses between them that are where the wafer is cut apart if you decide to cut it apart. Cutting the cores apart severs the bus that interconnects them. Within the cores, there will probably be lots of little rows of dots where different cells connect to one another. The cells would overlap at their connections. The point is that you don't need any 0.5MM pads for wire bonding equipment to attach to - just a few microns to make sure that the cells on both sides successfully interconnect.
The thing that now limits chip size is yield. When the chips get so big that 50% of them are throwaways, they quit. HOWEVER, they are attempting to produce a fixed product with zero defects, instead of a huge processor where you won't know how many ALUs and how much RAM has survived until you turn it on.
This reminds me of the way that vacuum tubes used to be sold. You purchased them in packs of 5, but there were actually 6 in every pack along with a statement by the manufacturer that they did NOT want to hear about any bad tubes!
So
even with WSI, you need a tessilation of things that look like chips,
which means that you need to connect them together with wires that are
"long" by on-chip signalling standards, which means that you are still
going to be effectively pin-limited.
Yes, there ARE limits, but up in the thousands and not in the low hundreds as at present. Further, these pins are faster and better controlled for skew IF you have an extra layer of metalization for busses.
Mind you, IBM's multi-chip-module
technology and Sun's capacitive pin stuff are all aimed directly at this
issue of increasing the number of signals that you can get on and off a
chip. So it *is* being worked on.
Yes. It is nothing short of amazing to see the lengths that people will go through to avoid WSI, because no one will fund it as I explained in my article.
Only if you INSIST on putting your memory on different silicon. On the
same wafer or VLSI you can have many buses per ALU and all of the speed
that it brings.
I don't believe that that's the case, or rather, there seem to be catches
and caveats to doing it that make it not work as well in practice as one
might like. Flip chips and 3D stacking seem like more promising
alternatives.
I doubt that they will get the yields up high enough in the near future for WSI use of this - but then again, I have been wrong about lots of other things.
Staying 2D requires wires that are too long, and making too
many concessions at the process level.
You can have LOTS of layers of metalization. All it costs is money and some defects.
Unfortunately, no one wanted to discuss this, because they all wanted
MORE and MORE in their languages, not less and less. Obviously, a lesser
language would NOT be compatible with current language specs, and who
would ever want THAT?
Java is a pretty small language (as was Modula-3 before it). Even
C-the-language is fairly small, if a bit ungainly. Not everyone is
bent on adding features and complexity.
Both are more complex that original FORTRAN, which had INTENTIONAL impediments to push programmers to write more optimizeable code, like the restriction that expressions within subscripts must be of the form A*X+B. In any case, too high of a language is MUCH easier to deal with than too low of a language, so I am still thinking that APL or matrix BASIC or something like that is the way to go. Java simply has no ability to make the high level statements that translate directly into vector programming.
Perhaps we need some sort of SPL (Supercomputer Programming Language)
that would be a stripped down form of APL that uses a standard character
set, and just abandon thoughts of ever running C++. Any thoughts.
There is currently a megabuck DARPA-funded program where (I think) Cray,
Sun and IBM are competing to come up with just such an animal. Will be
interesting to see how that goes. Google for DARPA HPCS, Chapel, X10, and
Fortress (the names of the Cray, IBM and Sun projects, respectively).
Thanks. That should entertain me for a day or so.
Steve Richfie1d
.
- Follow-Ups:
- Re: Superstitious learning in Computer Architecture
- From: Andrew Reilly
- Re: Superstitious learning in Computer Architecture
- From: glen herrmannsfeldt
- Re: Superstitious learning in Computer Architecture
- References:
- Superstitious learning in Computer Architecture
- From: Steve Richfie1d
- Re: Superstitious learning in Computer Architecture
- From: jsavard
- Re: Superstitious learning in Computer Architecture
- From: glen herrmannsfeldt
- Re: Superstitious learning in Computer Architecture
- From: jsavard
- Re: Superstitious learning in Computer Architecture
- From: Steve Richfie1d
- Re: Superstitious learning in Computer Architecture
- From: Andrew Reilly
- Re: Superstitious learning in Computer Architecture
- From: jsavard
- Re: Superstitious learning in Computer Architecture
- From: Andrew Reilly
- Re: Superstitious learning in Computer Architecture
- From: jsavard
- Re: Superstitious learning in Computer Architecture
- From: Steve Richfie1d
- Re: Superstitious learning in Computer Architecture
- From: Andrew Reilly
- Superstitious learning in Computer Architecture
- Prev by Date: Re: Superstitious learning in Computer Architecture
- Next by Date: Re: Superstitious learning in Computer Architecture
- Previous by thread: Re: Superstitious learning in Computer Architecture
- Next by thread: Re: Superstitious learning in Computer Architecture
- Index(es):
Relevant Pages
|