Re: PowerPC or PARISC?



davewang202@xxxxxxxxx wrote:
Bill Todd wrote:
Rob Warnock wrote:
[Apologies if I've messed up the attributions... the trail was messy...]

Don't forget that with Opteron even *local* memory accesses require
getting snoop responses back from *all* of the other CPUs.
Since I addressed this elsewhere, it's not likely that I forgot it here.
My observation on that point was that an 8-socket configuration
generates around 3x the coherency traffic *per HT link* that a
quad-socket configuration does, suggesting that the quad-socket
configuration may have rather a lot of bandwidth to spare (given that
the 8-socket configuration manages to function at all).

Three-and-half years ago, I wrote that doing snoop-broadcast and going
from 4 to 8 sockets, Hammer would not scale well without "additional
support".
At the time, you referred to it as sound like hot air that lacks
quantitative analysis for support.

As indeed it was: just stating that an 8-socket configuration required a lot more snooping was meaningless without also establishing that the HT links did not have the capacity to *support* that additional snooping.

Furthermore, I stated in the interchange which you cite below that

"the degree to which the additional snooping activity will compromise scalability (by exceeding the rather substantial bandwidth capacity of the HT interconnect) will vary according to the nature of the load"

i.e., that even if the additional snooping significantly compromised scalability for memory-bandwidth-intensive workloads, other kinds of workloads would not be similarly encumbered.


http://www.realworldtech.com/forums/index.cfm?action=detail&id=14858&threadid=14827&roomid=11

The interchange to which you refer was this:

[quote]

> just wanted to make it clear that without additional
> support, going up above 4 isn't going to be easy for *hammer, and
> scalability will be poor.

Lacking a quantitative analysis proving that 8-processor Hammer systems will scale poorly, I'm afraid your argument sounds a bit like hot air.

[end quote]

Which, of course, is precisely the observation I made again today, above.

Since you never provided any such quantitative analysis (in fact, David Kanter was still babbling incompetently in a similar vein at RWT about next-year's 8-socket configuration changes a few months ago, until I spelled things out for him quantitatively - now that more quantitative data *is* available), that statement hardly seems unreasonable.


The HT coherency traffic goes up with the number of CPU cores, not just
the number of sockets.
AMD's presentations are not clear on this point, but they at least
suggest that coherency traffic on the HT links does *not* increase with
the number of cores, just with the number of sockets (which is certainly
at least possible, given the architecture).

If you actually have something useful running on each one of those
cores, each threaded context will generate independent memory requests
that will likely have to leave the socket - unless it hits on a cache
somewhere within the socket. The coherency traffic scales relative to
the number of independent outstanding misses, not to the socket or cpu
per se.

If you had bothered to look at the context in which Rob made the statement to which I replied, you would have found that it was that of snoop *responses* from CPUs. My point was that (if I understand AMD's presentations correctly) each socket gives a single response to a snoop request regardless of the number of cores present there.

And had you finished reading my post (well, you also would have had to have understood it, I guess) before writing your response, you would have noticed that I fully understand that more cores may *generate* more snoops - but only (at least for the NUMA-optimized access being discussed there) up to the point where they have saturated local memory bandwidth.


And even if *all* the CPUs are hitting *only*
local memory [perfect NUMA placement], there will be HT coherency
traffic proportional to the product of the cache miss rate and the
number of CPU cores.
Only up to the point where the local memory bandwidth is saturated:
once that point is reached, it doesn't matter how many more local cores
you add - there won't be any more coherency traffic, because there won't
be any more local accesses.

So if the links can support the coherency traffic generated by
local-only accesses sufficient to saturate the local memory bandwidth on
all sockets (are there STREAMS results for quad-socket Opteron systems
that could shed light on that?), the only question is how much link
bandwidth is left over to satisfy some percentage of remote accesses.

A friendly sparring partner just pointed out via email that Sun's 8-socket/dual-core 2.6 GHz Opterons scale up from its presumably very similar 4-socket/dual-core configuration at 1.73x for SPECint_rate_base and 1.79x for SPECint_rate_peak: hardly stellar, but indicative that for reasonably computationally-intense workloads scaling to 8 sockets (even using dual-core processors) can be quite useful. By contrast, in the far more bandwidth-intensive SPECfp_rate scores the scaling is far worse: only 1.14x base and 1.27x peak (that's why I asked about STREAMS results above, though the issue there was whether today's *quad*-socket systems could satisfy the demands of quad-core processors for NUMA-optimized workloads, which it now looks as if they may). Using the roughly 3:1 increase in per-link snoop activity in the 8-socket system for a given level of per-socket uncached memory access activity (though this may vary noticeably according to its topology), this suggests that a 4-socket system may currently have something close to twice the HT bandwidth headroom that it needs even for memory-intense workloads, while the 8-socket system varies from fairly acceptable to very disappointing indeed depending on the intensity of memory accesses (though still not going negative, as Nick suggested - unless he was referring to per-processor rather than system throughput, in which case *all* systems that did not scale perfectly linearly would exhibit that behavior, just some a lot more than others).

- bill
.



Relevant Pages

  • Re: PowerPC or PARISC?
    ... Don't forget that with Opteron even *local* memory accesses require ... the number of sockets. ... number of CPU cores. ...
    (comp.arch)
  • Re: What factors influence required memory alignment?
    ... preventing such alignment on DOUBLE PRECISION operands. ... point values are packed into a single register and/or memory location. ... The ability to trap misaligned accesses was added to various x86 ... Its slightly easier to catch bugs when an odd address read immediately ...
    (comp.arch)
  • Re: What application requires 500MHz for embedded processors
    ... N narrow memory accesses in parallel or a single N-way wide ... High end DSPs use both. ... in a single cycle, probably the ARM can the same etc. ...
    (comp.arch.embedded)
  • Re: how to wait for socket communications
    ... the JVM when the Java program gets closed, and ending up with memory leaks. ... Unfortunately, in connection with getting advised not to use sockets, my ... in this context as memory allocation technique, not file I/O. ...
    (microsoft.public.win32.programmer.networks)
  • [PATCH] Document Linuxs memory barriers [try #2]
    ... The attached patch documents the Linux kernel's memory barriers. ... I've tried to get rid of the concept of memory accesses appearing on the bus; ... barring implicit enforcement by the CPU. ...
    (Linux-Kernel)

Loading