Re: University rank of Computer Architecture



On Mar 28, 9:14 pm, Bill Todd <billt...@xxxxxxxxxxxxx> wrote:
MitchAlsup wrote:
At this point in time, one can actually build a whole processing core
in a die area smaller than the branch predictor(s) in a modern x86.
Doubling the size of this on-die branch predictor might gain 7%
performance, while adding one of these tiny cores might add 50%. So
there is more bang for the buck in adding smaller cores than in making
already large branch predictors even larger.

Is 7% a worthy gain--yes and no. The bigger branch predictor consumes
more power than the tiny core and delivers less performance. So while
it might be a good decision en-the-small, it is unlikely to be the way
forward en-the-larger-term.

That logic presupposes that multi-threaded throughput is at least as
important as single-thread throughput - a position rather clearly at
odds with nearly the entire industry until very recently. For that
matter, it's not clear that that stance has changed even now: only when
real limits to continued significant increases in single-thread
performance started to appear did it turn to multi-thread optimizations

While I can agree that Software is more desirous of more powerful
fewer core design points, HW designers have reached the point where
this is no longer feasible. So, a desing team can invent a new
pipeline and spend 3 years working at the myrriad of issues in order
to exploit the next step in semiconductor processing; OR that same
team can paste 2 current cores on a die and roll it out in 1 year. The
PACE of being able to deliver a capability is simply better by
replication than through complicated design. Software needs to step up
and figure out how to utilize this upcomming capability.

If significant single-core performance advances were still feasible, do
you really think that most of the industry would forsake them for
improved multi-thread throughput, any more than they ever did?

No, Amdhal's law still applies, one big node outperforms several
smaller nodes of the same peak aggregate performance. The problem is
we can no longer build that one big node--it ends up consuming more
silicon real estate, more power, and more design time than
replication. Customers indicate that we cannot consume more than X
amount of power, the FAB guys tell us we can consume no more than Y
amount of nanoacres. So, we live within these means and try to do as
good as we can.

Given
the amount of on-chip cache deployed these days, use of chip area
doesn't seem to be a legitimate obstacle to improving branch-prediction
units, though *significant* increases in power consumption might be.

There are many senarios where a smaller core with more cache
outperform the bigger core where the silicon acreage is identicle. The
future looks to find (stumble over) many more.

Given the 'memory wall', isn't sub-optimal branch prediction one of the
most important performance-limiters today for a large percentage of
important workloads (i.e., doesn't it contribute a great deal to the
amount of time a processor spends idle with a single-threaded workload)?

No, memory latency is a much bigger limitation than is branch
misprediction. Memory latency as seen at the processor pipeline is
currently on-the-order-of 200 clocks. There is NO AMOUNT OF BUFFERING
that is capable of absorbing this amount of dealy. So, in effect,
every cache miss causes a processor stall. This would not be true if
DRAM was only 50 clocks away. There are applications (Business mainly)
where one could assume an infinite speed processor and the current
memory hierarchy and end up with (almost) no gain in the performance
over the current processors!

Secondly, it is not clear that one can build branch predictors that
are significanly better than the one we currently use (given power and
frequency constraints). Of the parts I am most familiar, doubling of
the memory associated with the branch predictor would result in a 7%
reduction in the frequency of operation along with a 7% increase in
performance. This is difficult to justify.

Until such time as multiple cores have been effectively harnessed to
improve single-threaded execution significantly (scouting being one of
the mechanisms being explored?), I'd think that there would still be
significant value to mine in areas like SMT and branch prediction that
multi-core approaches can't duplicate. How much of today's rush toward
increasing core counts is simply the path of least resistance toward
something the flacks can market, in the same ways that the GHz stampede
was? Sure, maintaining performance with reduced socket counts is great
for server systems anywhere above the low end, but the industry
certainly didn't seem to think it was that important before - why should
it suddenly be now (other than being easier to get to market quickly
than major internal design changes)?

A) it will be found that scout processes consume power faster than
they deliver performance. Thus such a design would end up being faster
only when one is allowed to burn more power. This era ended about 4
years ago.
B) if we, processor designers, had any good ideas as to how to push
the one big node forward, we would (and are)
C) however, with the power wall, the memory wall, and the pin wiggling
wall; much of the path we have been on is no longer viable looking
forward.
D) it is LIKELY that DRAM bandwidth will rise to the point where the
boundary of the coherent domain is the chip that this DRAM is
attached. Thus SM-MP will be relegated to the single socket, and
additional sockets will (in essence) become a different cluster. This
is still 8 years away, however.

As a system software type I *like* multiple cores: being able to
perform multiple tasks in parallel on behalf of multiple clients is what
I grew up wrestling with. But it's certainly not where the industry has
been concentrating for that last half of my professional life, so its
sudden enthusiasm is difficult for me to understand as anything but
opportunism - taking the path of least resistance after an obstacle has
appeared in their preferred direction.

Software has been screwing around with MP for 35 years, its time to
get your act together, because more tiny processors are going to have
very signficant performance potential above the big node design
points. Exploiting that potential can only be pulled off with SW that
gets its act together.

As a CPU architect, I am personally sorry not to be able to continue
down the path of the one big node. More of this, wider that,... But it
needs to be recognized that this ILP race has been run, and a new race
is ongoing. You can complain or get on board.

Mitch

.



Relevant Pages

  • Re: The Future Navy Will Be Nuclear
    ... :Naval Nuclear Power Training). ... You need to have designed a reactor to figure out that NR has put ... of the core - including how to achieve longer life. ... You really wouldn't design a 'short ...
    (sci.military.naval)
  • Re: Any DIY balun info for S-video to Cat5 conversion?
    ... There is lots of balun design info out there, and here on my drive I ... baluns for use in a receiving converter between 40 and 70 MHz. ... The number Al will be different but is obtained from the core manufacturers literature. ... We haven't considered power levels or core saturation since we ...
    (sci.electronics.design)
  • Re: The Future Navy Will Be Nuclear
    ... They also discuss a plan for a 30 year life for the Ford ... :>going to keep them for 50 years, you may as well design for a 24 year ... :>core life and refuel them during the mid-life SLEP. ... Naval Nuclear Power Training). ...
    (sci.military.naval)
  • Re: The Future Navy Will Be Nuclear
    ... :>:to be redesigned for longer life to achieve this. ... :>24 year core life and refuel them during the mid-life SLEP. ... known as Naval Nuclear Power Training). ... with a different core design (D2W versus the original D1G-2 ...
    (sci.military.naval)
  • Re: University rank of Computer Architecture
    ... mispredict rate, and by your number has been reduced from about 7% to about ... one can actually build a whole processing core ... Doubling the size of this on-die branch predictor might gain 7% ... Is 7% a worthy gain--yes and no. ...
    (comp.arch)

Loading