Re: High-bandwidth computing interest group

On 7/20/2010 11:49 AM, Robert Myers wrote:
On Jul 20, 1:49 pm, "David L. Craig"<dlc....@xxxxxxxxx> wrote:

If we're talking about custom, never-mind-the-cost
designs, then that's the stuff that should make this
a really fun group.

If no one ever goes blue sky and asks: what is even physically
possible without worrying what may or may not be already in the works
at Intel, then we are forever limited, even in the imagination, to
what a marketdroid at Intel believes can be sold at Intel's customary

Coupling this to stuff we said earlier about

a) sequential access patterns, brute force - neither of us consider that interesting

b) random access patterns

c) what you, Robert, siad you were most interested in, and rather nicely called "crystalline" access patterns. By the way, I rather like that term: it is much more accurate than saying "stride-N", and encapsulates several sorts of regularity.

Now, I think it can be said that a machine that does random access patterns efficiently also does "crystalline" access patterns. Yes?

I can imagine optimizations specific to the crystalline access patterns, that do not help true random access. But I'd like to kill two birds with one stone.

So, how can we make these access patterns more effective?

Perhaps we should lose the cache line orientation - transferring data bytes that aren't needed.

I envision an interconnect fabric that is completely scatter/gather oriented. We don't do away with burst or block operations: we always transfer, say, 64 bytes at a time. But into that 64 bytes we might pack, say, 4 pairs of 64 bit address and 64 bit data, for stores. Or perhaps bursts of 128 bytes, mixing tuples of 64 bit address and 128 bit data. Or maybe... compression, whatever. Stores are the complicated one; reads are relatively simple, vectors of, say, 8 64 bit addresses.

By the way, this is where strided or crystalline access patterns might have some advantages: they may compress better.

Your basic processing element produces such scatter gather load or store requests. Particularly if it has scatter/gather vector instructions like Larrabee (per wikipedia), or if it is a CIMT coherent threaded architecture like the GPUs. The scatter/gather operations emitted by a processor need not be directed at a single target - they may be split and merged as they flow through the fabric.

In order to eliminate unnecessary full-cache line flow, we do not require read-for-ownership. But we don't go the stupid way of write-through. I lean towards having a valid bit per byte, in these scatter-gather requests, and possibly in the caches. As I have discussed in this newsgroup before, this allows us to have writeback caches where multiple processors can write to the same memory location simultaneously. The byte valids allows us to live with weak memory ordering, but do away with the bad problem of losing data when people write to different bytes of the same line simultaneously. In fact, depending on the interconnection fabric topology, you might even have processor ordering. But basically it eliminates the biggest source of overhead in cache coherency.

Of course, you want to handle non-cache friendly memory access patterns. I don't think you can safely get rid of caches; but I think that there should be a full suite of cache control operations, such as is partially listed at's_List_of_101_Cache_Control_Operations

Such a scatter/gather memory subsystem might exist in the fabric. It works best with processor support to generate and handle the scatter/gather requests ad replies. (Yes, the main thing is in the interconnect; but some processor support is needed, to get crap out of the way of the fabric).

The scatter/gather interconnect fabric might be interfaced to conventional DRAMs, with their block transfers of 64 or 128 bytes. If so, I would be tempted to create a memory side cache - a cache that is in the memory controller, not the processor - seeking to leverage some of the wasted parts of cache lines. With cache control, of course.

However, if there is any chance of getting DRAM architectures to be more scatter/gather friendly, great. But the people who can really talk about that are Tom Pawlowski at Micron, and his counterpart at Samsung. I've not been at a company that could influence DRAM much, since Motorola in the late 1980s. And I dare say that Mitch didn't make much headway there. I've mentioned Tom Pawlowski's vision, as presented at SC09 and elsewhere, of an abstract DRAM interface for stacked DRAM+logic units. I think the scattter/gather approach I describe above should be a candidate for such an abstract interface.

If there is anyone that thinks that there is a great new memory technology coming down the pike that will make the bandwidth wars easier, I'd love to hear about it. For that matter, the impending integration of non-volatile memory is great - but as I understand things, it will probably make the memory hierarchy even more sequential bandwidth oriented, unfriendly to other access patterns.


On this fabric, also pass messages - probably with instruction set support to directly produce messages, and mechanisms such as TLBs to route them without OS intervention.


I.e. my overall approach is - eliminate unnecessary ful cache line transfers, emphasize scatter gather. Make the most efficient use of what we have.


Now, I remain an unrepentant mass market computer architect. Some people want to design the fastest supercomputer in the world; I want to design the computer my mother uses. But, I'm not so far removed fromn the buildings full of stuff supercomputers that Robert Myers describes. First, I have worked on such. But, second, I'm interested in much of this not just because it is relevant to cost no barrier supercomputers, but also because it is relevant to mass markets.
Most specifically, datacenters. Although datacenters tend not to use large scale shared memory, and tend to be unwilling to compromise the memory ordering and cache coherency guidelines in their small scale shared memory nodes, I suspect that PGAS has applications, e.g. to Hadoop like map/reduce. Moreover, much of this scatter/gather is also what network routers want - that OTHER form of computing system that can occupy large buildings, but which also comes in smaller flavors. Finally, the above applies even to moderate sized, say 16 or 32, multiprocessor systems in manycore chips.

I.e. I am interested in such scatter/gather memory and interconnect, that make the most efficient use of bandwidth, because they apply to the entire spectrum,