Re: Intel publishes Larrabee paper



In article <ggtgp-4EC5AC.17084105042009@xxxxxxxxxxxxxxxxxxx>,
Brett Davis <ggtgp@xxxxxxxxx> wrote:

In article <Vr6dnRcUBPijNUXUnZ2dnUVZ_sLinZ2d@xxxxxxxxxxxx>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:

Brett Davis wrote:
In article
<baa19365-e7f8-4c8a-bbca-319671d990dc@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
MitchAlsup <MitchAlsup@xxxxxxx> wrote:
As noted, here, about 2 years ago, I indicated that a company could
build an x86 CPU that would have 50% of an Opteron's performance in

That was a great post!

The natural conclusions that come out of this is that Intels next big
uber chip will have four big cores and 64ish Larrabee cores.

This will go up against AMDs four big cores plus 800 ATI vector
processors. The ATI chip will also do graphics, removing a big expensive
chip off the motherboard. Intel will claim the same for Larrabee, and
Intel is happy to sell chips with crappy graphics. ;)

I think your prediction is going to be pretty close.

Have you all read the DDJ article by Mike Abrash where he goes into the
low-level detail of the LRB cores?

http://www.ddj.com/architect/216402188

The Larrabee architecture did not quite make me barf. ;)
Using x86 as the base it is understandable that the third vector operand
can be a memory operand, the opcode bits are there and you have a
limited register set. Make virtue out of necessity, etc.

There is a certain logic to Intels design decisions for Larrabee if you
think four way multi-threading makes sense due to memory latency stalls
for reads.

32 registers times 4 sets, plus some rename registers and you are
looking at ~200 actual registers. If you try and do this with 256
visible registers your actual register set becomes large, this might be
an issue, or not.

The die size tradeoffs of 4 way multi threading is mostly about reducing
the cost of x86 decode die space, and other die costs. To bring costs
down near ATI space, for the amount of work done. For anyone besides
Intel if multi threading makes sense the first thing they would do is
dump the x86 instruction set, to free up that waste.

The future looks to be RRAM, in which case we will get 8 gigs of
embedded RAM that is only two dozen cycles away or so. Simple prefetch
and a big register set can hide that latency. Multi threading dies,
except as a marketing gimmick.

Brett
.