Re: Superstitious learning in Computer Architecture



John,

Now, though, we seem to have reached the limit. We can't make our CPUs
any bigger or more powerful, we can only put more of them on a chip. So
the only change will be how badly we need to make our algorithms
parallel from now on.

Perhaps I didn't make myself 100% clear. There is an easy order of magnitude in speed available by simply switching over to the CDC Cyber 200 series architecture. There, they multiplexed several ALUs so that if you, say, had 4 of them on your system, the first ALU would handle the 1st, 5th, 9th, 13th, etc. operation on an array. They had amazingly complex array operations available. Your array instruction had lots of blanks for the compiler to fill in to get what you needed. When it executed, the entire supercomputer stopped for ~2us while a room full of switched hardware reconfigured everything to execute the operation you specified much as though it has been wired to do just that instruction!

They could even execute DO loops with IF statements in them as a series of array operations with NO LOOPS! They would first execute all of the occurances of the IF statement as array operations and develop a bit array of the outcomes, then they made each of the following array operations conditional on the bit array.

Yea, I should also include my tenure at CDC maintaining the optimizer and vectorizer in that compiler in my article. I got to chase the last of its bugs out just before they scrapped the whole program.

I think there are still a few tricks that a Pentium IV cannot do that
the processors from an SX-6 or a Cray X-1 were able to do,

Back when CDC and Cray were competitors, the Cray won where the loops typically involved ~10 or less elements, while the CDC won where the loops were larger. The CDC approach had a high startup overhead (~2us) to run a loop, but produced several real-world results per 12ns clock cycle once it got going.

In present computationally limited applications, the loops do indeed tend to be larger than 10 elements, and so the CDC architecture would seem to be preferred.

Note that this very argument is what split Cray from CDC when he made his famous statement about having built is last small computer. Cray computers definitely did win on a certain subset of applications, but I think that its time has passed as computers and the applications that they run have gotten larger - MUCH larger than 10 element arrays.

and thus a
*little* more effort should be made in improving the individual CPU.
The Itanium was Intel's attempt to make (some limited aspects of) that
kind of supercomputing available to the masses; there is much about it
I don't like, but I still admire the fact that they made an attempt.

.... without learning from Project Stretch

No one else seems about to make a try.

Perhaps because they DID read their history books.

Even without going to logarithmic ALUs or wafer scale integration, there is STILL an easy order of magnitude left to be collected by abandoning the scalar-only architecture of the Pentium.

Note that the CDC Cyber 200 had NO CACHE!!! Why? It had little
use for it! The need for cache is architecture dependent - they just dug a cache-dependent hole with the Pentium architecture. The array operations in the Cyber 200 used a pipelining approach similar to the burst mode used in present-day RAM chips, and the CPU had 256 registers.

Steve Richfie1d
.



Relevant Pages

  • Re: Global array operations: a performance hit?
    ... as if many DO loops were executed instead than just one. ... global array operations then? ... Note that the usual terminology is something more like "whole array ... once in a while they might also get you faster execution, ...
    (comp.lang.fortran)
  • Re: Speed versus memory
    ... loops are slow as mud and should be avoided like the plague. ... That's horrible advice. ... loops and turn it into convoluted and slow code using array operations. ... storage of 10 scalar variables. ...
    (comp.lang.fortran)
  • Re: Array descriptors
    ... When extents are used, array operations scalarize into ... For such loops, it is usually better to use the actual ... origin than the virtual origin. ... For array operations, where array indexing is implicit, ...
    (comp.lang.fortran)
  • Re: Language efficiency of C versus FORTRAN et al
    ... rather than expanding them out into loops or subroutine ... full set of array operations, including array sections, described at ...
    (comp.lang.c)