Re: PowerPC or PARISC?
- From: Bill Todd <billtodd@xxxxxxxxxxxxx>
- Date: Mon, 04 Sep 2006 00:35:31 -0400
davewang202@xxxxxxxxx wrote:
Bill Todd wrote:Rob Warnock wrote:[Apologies if I've messed up the attributions... the trail was messy...]Since I addressed this elsewhere, it's not likely that I forgot it here.
Don't forget that with Opteron even *local* memory accesses require
getting snoop responses back from *all* of the other CPUs.
My observation on that point was that an 8-socket configuration
generates around 3x the coherency traffic *per HT link* that a
quad-socket configuration does, suggesting that the quad-socket
configuration may have rather a lot of bandwidth to spare (given that
the 8-socket configuration manages to function at all).
Three-and-half years ago, I wrote that doing snoop-broadcast and going
from 4 to 8 sockets, Hammer would not scale well without "additional
support".
At the time, you referred to it as sound like hot air that lacks
quantitative analysis for support.
As indeed it was: just stating that an 8-socket configuration required a lot more snooping was meaningless without also establishing that the HT links did not have the capacity to *support* that additional snooping.
Furthermore, I stated in the interchange which you cite below that
"the degree to which the additional snooping activity will compromise scalability (by exceeding the rather substantial bandwidth capacity of the HT interconnect) will vary according to the nature of the load"
i.e., that even if the additional snooping significantly compromised scalability for memory-bandwidth-intensive workloads, other kinds of workloads would not be similarly encumbered.
http://www.realworldtech.com/forums/index.cfm?action=detail&id=14858&threadid=14827&roomid=11
The interchange to which you refer was this:
[quote]
> just wanted to make it clear that without additional
> support, going up above 4 isn't going to be easy for *hammer, and
> scalability will be poor.
Lacking a quantitative analysis proving that 8-processor Hammer systems will scale poorly, I'm afraid your argument sounds a bit like hot air.
[end quote]
Which, of course, is precisely the observation I made again today, above.
Since you never provided any such quantitative analysis (in fact, David Kanter was still babbling incompetently in a similar vein at RWT about next-year's 8-socket configuration changes a few months ago, until I spelled things out for him quantitatively - now that more quantitative data *is* available), that statement hardly seems unreasonable.
The HT coherency traffic goes up with the number of CPU cores, not justAMD's presentations are not clear on this point, but they at least
the number of sockets.
suggest that coherency traffic on the HT links does *not* increase with
the number of cores, just with the number of sockets (which is certainly
at least possible, given the architecture).
If you actually have something useful running on each one of those
cores, each threaded context will generate independent memory requests
that will likely have to leave the socket - unless it hits on a cache
somewhere within the socket. The coherency traffic scales relative to
the number of independent outstanding misses, not to the socket or cpu
per se.
If you had bothered to look at the context in which Rob made the statement to which I replied, you would have found that it was that of snoop *responses* from CPUs. My point was that (if I understand AMD's presentations correctly) each socket gives a single response to a snoop request regardless of the number of cores present there.
And had you finished reading my post (well, you also would have had to have understood it, I guess) before writing your response, you would have noticed that I fully understand that more cores may *generate* more snoops - but only (at least for the NUMA-optimized access being discussed there) up to the point where they have saturated local memory bandwidth.
And even if *all* the CPUs are hitting *only*local memory [perfect NUMA placement], there will be HT coherencyOnly up to the point where the local memory bandwidth is saturated:
traffic proportional to the product of the cache miss rate and the
number of CPU cores.
once that point is reached, it doesn't matter how many more local cores
you add - there won't be any more coherency traffic, because there won't
be any more local accesses.
So if the links can support the coherency traffic generated by
local-only accesses sufficient to saturate the local memory bandwidth on
all sockets (are there STREAMS results for quad-socket Opteron systems
that could shed light on that?), the only question is how much link
bandwidth is left over to satisfy some percentage of remote accesses.
A friendly sparring partner just pointed out via email that Sun's 8-socket/dual-core 2.6 GHz Opterons scale up from its presumably very similar 4-socket/dual-core configuration at 1.73x for SPECint_rate_base and 1.79x for SPECint_rate_peak: hardly stellar, but indicative that for reasonably computationally-intense workloads scaling to 8 sockets (even using dual-core processors) can be quite useful. By contrast, in the far more bandwidth-intensive SPECfp_rate scores the scaling is far worse: only 1.14x base and 1.27x peak (that's why I asked about STREAMS results above, though the issue there was whether today's *quad*-socket systems could satisfy the demands of quad-core processors for NUMA-optimized workloads, which it now looks as if they may). Using the roughly 3:1 increase in per-link snoop activity in the 8-socket system for a given level of per-socket uncached memory access activity (though this may vary noticeably according to its topology), this suggests that a 4-socket system may currently have something close to twice the HT bandwidth headroom that it needs even for memory-intense workloads, while the 8-socket system varies from fairly acceptable to very disappointing indeed depending on the intensity of memory accesses (though still not going negative, as Nick suggested - unless he was referring to per-processor rather than system throughput, in which case *all* systems that did not scale perfectly linearly would exhibit that behavior, just some a lot more than others).
- bill
.
- References:
- PowerPC or PARISC?
- From: mailbag99
- Re: PowerPC or PARISC?
- From: Bill Todd
- Re: PowerPC or PARISC?
- From: Rob Warnock
- Re: PowerPC or PARISC?
- From: Bill Todd
- Re: PowerPC or PARISC?
- From: davewang202
- PowerPC or PARISC?
- Prev by Date: Re: PowerPC or PARISC?
- Next by Date: Re: Admired designs / designs to study
- Previous by thread: Re: PowerPC or PARISC?
- Next by thread: Re: PowerPC or PARISC?
- Index(es):
Relevant Pages
|
Loading