Re: POWER7 information - global shared memory, eDRAM for cache



On Aug 13, 1:45 pm, Robert Myers <rbmyers...@xxxxxxxxx> wrote:
On Aug 13, 4:09 pm, David Kanter <dkan...@xxxxxxxxx> wrote:

I just finished up a preview of hot chips, which is by itself pretty
interesting.  However, I also discussed some information about POWER7
that I dug up, on page 2 of my article (near the bottom).

http://www.realworldtech.com/page.cfm?ArticleID=RWT081209143650

The two most interesting facts are that IBM has a feature to enabled
clusters of systems to appear as if they have globally shared memory,
and their last level, on-die, cache is eDRAM and probably around 16MB.

So comp.arch is now a forum to advertise your website?

Not per se - however, I think that folks who read comp.arch would be
interested in novel architectural features that aid global shared
memory, and novel cache architectures. Both of those are significant
to computer architects, and one is significant for developers.

"It seems likely that the POWER7's L3 cache will be around 16MB of
eDRAM. This will hopefully reduce the need for external bandwidth, as
the POWER6 systems will be very hard to improve upon; 300GB/s is just
a tremendous amount of I/O period."

I'm a subscriber to the Seymour Cray dictum about *bandwidth* (not
latency): "You can't fake it."

I don't think I'd totally agree. You can't fake pins. You can fake
bandwidth (in some circumstances).

I'm accustomed to doing memory-bound applications (and the customers
that Seymour dealt with most often had memory-bound applications), or,
rather, when I'm not doing memory-bound applications I'm not so much
worried about performance (for reasons that seem obvious enough to
me).  I'm imagining that larger cache actually does relieve pressure
on bandwidth in some cases (and Intel has apparently been playing that
game for years) but I really have a hard time seeing how.  As Linus
Torvalds so tartly put it here, working sets don't usually fit into
cache, no matter how large.

You don't need the whole working set, only part. Say you are doing a
analytic DB query and you are joining two tables, A and B. Say that A
is about 10MB and B is about 1TB. A good DBMS would move A into cache
(on a chip with 16-24MB of L3 cache) and then do a streaming join
against it.

Also, remember that the working set of a program might be rather
large, but the dynamic working set may be much much smaller.

Failing that, you can sometimes hold your instruction dynamic working
set in the caches : )

My amateur's guess is that the huge cache is to allow large numbers of
I/O bound operations on the eight-core chip to queue up as they stall
and wait for slow off-chip operations to complete.  I suspect that
Intel's transaction-oriented chips have been doing pretty much the
same.

That's probably true as they are designing for high end DBMS systems
and HPC.

Bottom line: you *still* can't fake it.

If the difference between 16 and 32GB of memory matters (which it
often does), why wouldn't the difference between 16 and 32MB of cache
matter?

David
.



Relevant Pages

  • Re: PCBs internals
    ... If it takes 0.1 seconds to redraw the screen ... while running cleanly out of l1/L2/L3 cache, ... that is the first step in countering working set problems ... ... Linked lists and trees frequently have a very poor memory usage ...
    (sci.electronics.cad)
  • Re: Dynamic heat budget allocation
    ... of the memory system; with demand-based memory ... ideal memory system priority might be lower than the ... bandwidth and memory bandwidths. ... CPU, minimum cache, and fat bus to disk? ...
    (comp.arch)
  • Re: Has anyone produced a board using Kicad?
    ... On larger designs, memory is being pushed to maintain lists and objects ... application running concurrently with equally large working set, ... provoke substantial cache thrashing, which will show up as memory ...
    (sci.electronics.cad)
  • Re: How does one stop cache flushing?
    ... cache size drops dramaticly and unnesssarily). ... The working set is the memory that has been recently used so Windows expects that it is used again in the near future. ...
    (microsoft.public.windowsxp.general)
  • Re: big-little
    ... so that it fits into the swath of memory you might ... Would there be any benefit in using a Bloom filter ... the cache line, but not which processor. ... It seems that even on-chip bandwidth may be ...
    (comp.arch)