Re: POWER7 information - global shared memory, eDRAM for cache
- From: David Kanter <dkanter@xxxxxxxxx>
- Date: Thu, 13 Aug 2009 23:51:02 -0700 (PDT)
On Aug 13, 1:45 pm, Robert Myers <rbmyers...@xxxxxxxxx> wrote:
On Aug 13, 4:09 pm, David Kanter <dkan...@xxxxxxxxx> wrote:
I just finished up a preview of hot chips, which is by itself pretty
interesting. However, I also discussed some information about POWER7
that I dug up, on page 2 of my article (near the bottom).
http://www.realworldtech.com/page.cfm?ArticleID=RWT081209143650
The two most interesting facts are that IBM has a feature to enabled
clusters of systems to appear as if they have globally shared memory,
and their last level, on-die, cache is eDRAM and probably around 16MB.
So comp.arch is now a forum to advertise your website?
Not per se - however, I think that folks who read comp.arch would be
interested in novel architectural features that aid global shared
memory, and novel cache architectures. Both of those are significant
to computer architects, and one is significant for developers.
"It seems likely that the POWER7's L3 cache will be around 16MB of
eDRAM. This will hopefully reduce the need for external bandwidth, as
the POWER6 systems will be very hard to improve upon; 300GB/s is just
a tremendous amount of I/O period."
I'm a subscriber to the Seymour Cray dictum about *bandwidth* (not
latency): "You can't fake it."
I don't think I'd totally agree. You can't fake pins. You can fake
bandwidth (in some circumstances).
I'm accustomed to doing memory-bound applications (and the customers
that Seymour dealt with most often had memory-bound applications), or,
rather, when I'm not doing memory-bound applications I'm not so much
worried about performance (for reasons that seem obvious enough to
me). I'm imagining that larger cache actually does relieve pressure
on bandwidth in some cases (and Intel has apparently been playing that
game for years) but I really have a hard time seeing how. As Linus
Torvalds so tartly put it here, working sets don't usually fit into
cache, no matter how large.
You don't need the whole working set, only part. Say you are doing a
analytic DB query and you are joining two tables, A and B. Say that A
is about 10MB and B is about 1TB. A good DBMS would move A into cache
(on a chip with 16-24MB of L3 cache) and then do a streaming join
against it.
Also, remember that the working set of a program might be rather
large, but the dynamic working set may be much much smaller.
Failing that, you can sometimes hold your instruction dynamic working
set in the caches : )
My amateur's guess is that the huge cache is to allow large numbers of
I/O bound operations on the eight-core chip to queue up as they stall
and wait for slow off-chip operations to complete. I suspect that
Intel's transaction-oriented chips have been doing pretty much the
same.
That's probably true as they are designing for high end DBMS systems
and HPC.
Bottom line: you *still* can't fake it.
If the difference between 16 and 32GB of memory matters (which it
often does), why wouldn't the difference between 16 and 32MB of cache
matter?
David
.
- Follow-Ups:
- References:
- POWER7 information - global shared memory, eDRAM for cache
- From: David Kanter
- Re: POWER7 information - global shared memory, eDRAM for cache
- From: Robert Myers
- POWER7 information - global shared memory, eDRAM for cache
- Prev by Date: Re: HardBound and SoftBound
- Next by Date: Re: HardBound and SoftBound
- Previous by thread: Re: POWER7 information - global shared memory, eDRAM for cache
- Next by thread: Re: POWER7 information - global shared memory, eDRAM for cache
- Index(es):
Relevant Pages
|