Re: Superstitious learning in Computer Architecture



Andrew,

Steve, I've just realized that I've been making a fundamental error about
your intent: you are not trying to make a general-purpose machine capable
of running existing code, but rather a special-purpose neural network
machine.

Close, but not exactly:

Apparently most of the REALLY BIG number crunching is in simulating various physical phenomena that do NOT have extreme volatility, e.g. asteroids that may pass near a planet, be deflected, and possibly hit another planet. These extreme volatility situations, if single thread in nature (e.g. you are only tracking one asteroid) would not benefit at all by my approach. If you are tracking many asteroids, you might gain an order of magnitude by what I am proposing, but certainly not the 10,000:1.

However, for weather simulation, heat transfer, and neural network problems this architecture should truly come into its own.

There are some "borderline" applications like digital wind tunnel, where most of the computation is the low precision stuff to track air movement and pressure, but then there is the high-precision boundary conditions at the airfoils. Most of the computation is NOT at the boundaries, but those boundary computations would be at a ~1000:1 disadvantage. These applications will run somewhere between 10:1 and 10,000:1 faster depending on that actual ratio of boundary computations. Given this situation, I expect that users would expand their low-precision computations to just barely keep up with the high-precision computations, e.g. by simulating a larger volume around the airfoils than they now do.

In that case, most of my previous arguments are moot.

Not entirely, as you are forcing me to better state what I intend, which avoids such confusion for other readers. This is good.

When
you've got a specific problem to solve, and your machine will effectively
only be running one program, then that's obviously a completely different
engineering exercise, and the tradeoffs all become very different.

I am carving out a specific CLASS of problems, and even if logarithms are completely unusable, this approach STILL should produce a ~10:1 improvement where arrays are involved. Remember, the original purpose of logarithms was to salvage bad memory so that WSI becomes possible, which then allows for the many-bus full vector implementations without needing thousands of pins. That logarithms are also useful for many applications like neural networks is really an EXTRA.

On Mon, 28 Aug 2006 09:59:53 -0600, Steve Richfie1d wrote:

You want to do RISC. I was contemplating vector operations BECAUSE of the reconfiguration that so often accompanies it. On a Wafer, you may have to stop things for many clock cycles while you throw all of the switches to do something. This really throws you into the world of CISC.

This is where out-of-order execution, speculation, and multi-threading
seem to be making a difference: you don't just sit around waiting for a
reconfiguration or data routing; you go on and do something else in the
mean time.

I don't know what else you might be doing during reconfiguration, as you would be halfway through rewiring your CPU to do a specific operation. Remember, your internal clock speed is probably so high that it takes several clock cycles just to get across the wafer, forcing the wafer to be subdivided into a number of cores. Sure, the cores can operate independently, but I don't see what else you could be doing within a particular core while it is being rewired.

Perhaps/hopefully you see an opportunity that I have missed?

[I said:]

The big gain, IMO, is the ability to do loop fusion, potentially (and
frequently) saving vast gobs of precious memory bandwidth that would be
used in naive raw-vector expressions of the algorithm. That requires good
compilers, but a lot of the modern compilers are very much that good.

A good compiler may cost you 4:1 in performance!!! Why? Because if it takes you two years to write it, and every year you wait for it the technology marches ahead by 2:1, well, you do the math.

The largely incremental change that has been going on in computer
architecture means that there are large chunks of mostly-done compiler
available, so that the compiler work tends not to lag by so much. Then
there are groups like Tensilica who can build a custom compiler to suit a
custom CPU architecture, using the one set of tools.

Yea, they presented a keynote speech at the last MultiConference in Las Vegas. Their approach is to implement arbitrary subsets of a particular superset architecture, which is VERY different from what I am proposing.

The usual approach for retargeting a compiler is to keep an existing "front end" that translates programs into a particular internal representation like digraphs, change the code generator that converts the internal representation to prospective code, and change the optimizer tables to optimize for the new computer. Unfortunately, when you make a LARGE architectural jump, you often have to start over. I suspect that done right it will only require one restart as a more "portable" platform that isn't so tied to present architectures is produced, but this should proceed in parallel with the first WSI/LVP development that will also take at least a couple of years.

Note that in human-sized neural network applications (which was the primary envisioned application as explained in my article), widely-ranging gather operations (for the input synapses) is a big part of the application.

Unless you're doing a lot of synapse reconnection, isn't there a way to
schedule distant values to be sent in advance, rather than chasing them
down as you need them? My limited exposure to neural networks (mostly in
the 1990 time frame (I even had a paper in an Australian NN conference,
in '90!)) involved fairly repetitive simulation-step sorts of processes.
The access pattern doesn't change from one iteration to the next.

Yes. I envisioned sorting them, to do all of the connections of a particular sort at the same time, so that everything would be pipelined. Unfortunately, there ARE limitations as to how well you can pipeline things over large distances, like across a wafer.

The way that latency kills you is that it counts as "setup time" (and sometimes "takedown time") for complex operations that require coordinating widely separated units (e.g. vector operations). THIS is why the Cyber 205 and similar machines just STOP for a couple of microseconds before proceeding when you issue a vector operation. Like the designers of the 205, I really don't see any way around this.

[repeat comment about multi-threading and out-of-order, here]
[repeat comment about what to do while CPU is being rewired here]

How important that sort of exercise will be obviously depends on what
fraction of execution time such stalls represent. Lots of big vectors:
small fraction, no problem. Lots of gnarly, small vectors or code: worth
the effort.

There are some BIG unknowns here, as synapses are NOT as simple as the NN folks would have you believe. Real-world synapses often integrate, differentiate, are nonlinear, etc. Indeed, ~20 years ago I predicted a particular non-linear and discontinuous transfer function for inhibitory synapses if neurons were communicating the logarithms of probabilities of assertions being true - which was then found in the laboratory! Further, a given neuron can probably have more than one type of synapses. How well this can be vectorized once it is fully understood (using the scanning UV fluorescence microscope) remains to be seen. I suspect that the answer will be somewhere in between great and horrible. Consider that CNS (Central Nervous System) neurons have about 50,000 synapses each, of which around 200 are active - the rest are there but are inhibited - like most of the potential connections on a Xilinx chip.

The REALLY BIG unknown is glial cells, which are almost a complete unknown - and comprise about 90% of the cells behind your eyeballs.

Now that I have stated what is needed, do you have any ideas as to how to achieve it other than building an entire RISC machine to accomplish this single task?!!

I don't think that you want a RISC machine at all, but rather a dedicated
neural net simulator.

Close. There are enough unknowns about what neurons REALLY do that tyeing the architecture down to what we KNOW they do would probably be a disaster. Instead, I think that the best approach is to build as general purpose a computer as possible without significantly sacrificing performance for doing what we know and reasonably suspect that neurons do.

Also, remember that the other intended purpose is to process data from a scanning UV fluorescence microscope first into 3D images and then into functional diagrams. This probably involves lots of low-precision image processing, followed by intensive list processing.

Didn't Marvin Minsky try something like that once? I seem to remember that his was more specifically targetted at vision. (Synthetic Retina, I think he called it.)

Yea, back in the Perceptron days. Just to illustrate how CRAZY things got back then, Lockheed's research labs developed a special-purpose perceptron computer to do large networks efficiently, with clever logic tied directly to a disk much like the early ILIAC computers! However, after the Perceptron "crash" they absolutely REFUSED to send me a manual that was referenced in an article that they had published - explaining that they considered anything to do with perceptrons to be embarrassing to Lockheed! I suggested that they remove all references to Lockheed in whatever they sent me, but they wouldn't budge!

Have I gone too far the other way, here?

Ever so slightly. We have TWO applications that it must work well for, and we don't fully understand either of those applications. This pushes us into designing a WSI supermicro for a particular class of problems that encompasses both of these applications, and probably includes a number of mainstream applications now run on existing supercomputers. The broader the range of applications that it can run, the safer we are that it will indeed be able to do what is expected of it.

Thanks again for your efforts.

Steve Richfie1d
.



Relevant Pages

  • Re: [FAQ][bozza] La grande FAQ dei Mac con processori Intel
    ... PowerPC versions of your applications, while preparing Intel versions of the same applications. ... To ease this transition while retaining and improving upon the remarkable performance of Mac OS X, Apple has introduced universal binaries, a format that places native code for both architectures in one package. ... In addition to providing a scalable execution platform for a variety of applications, Cell is also a scalable system architecture. ...
    (it.comp.macintosh)
  • Two PhD position at VERIMAG in collaboration with ST microelectronics
    ... To cope with the complexity of applications and time-to-market ... the software on the multi-processor architecture in order to obtain ... Programming Languages for Stream Processing Applications ... The proposed thesis will study the appropriate programming language ...
    (sci.research.postdoc)
  • Re: middle tier recommendations
    ... What is driving the architecture? ... simple middle layer, written as DLLs, that call SQL. ... applications have fewer than 100 concurrent users. ... you didn't provide the key constraints that drives the ...
    (microsoft.public.dotnet.framework)
  • Re: Using D7 and D8 together
    ... I find n-tier architecture to be over-hyped, ... pooling is useful for the middle tier (in database applications), ...
    (borland.public.delphi.non-technical)