Re: Intel publishes Larrabee paper



In article <ggtgp-4EC5AC.17084105042009@xxxxxxxxxxxxxxxxxxx>,
Brett Davis <ggtgp@xxxxxxxxx> wrote:

In article <Vr6dnRcUBPijNUXUnZ2dnUVZ_sLinZ2d@xxxxxxxxxxxx>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:

Brett Davis wrote:
In article
<baa19365-e7f8-4c8a-bbca-319671d990dc@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
MitchAlsup <MitchAlsup@xxxxxxx> wrote:
As noted, here, about 2 years ago, I indicated that a company could
build an x86 CPU that would have 50% of an Opteron's performance in

That was a great post!

The natural conclusions that come out of this is that Intels next big
uber chip will have four big cores and 64ish Larrabee cores.

This will go up against AMDs four big cores plus 800 ATI vector
processors. The ATI chip will also do graphics, removing a big expensive
chip off the motherboard. Intel will claim the same for Larrabee, and
Intel is happy to sell chips with crappy graphics. ;)

I think your prediction is going to be pretty close.

Have you all read the DDJ article by Mike Abrash where he goes into the
low-level detail of the LRB cores?

http://www.ddj.com/architect/216402188

The Larrabee architecture did not quite make me barf. ;)
Using x86 as the base it is understandable that the third vector operand
can be a memory operand, the opcode bits are there and you have a
limited register set. Make virtue out of necessity, etc.

There is a certain logic to Intels design decisions for Larrabee if you
think four way multi-threading makes sense due to memory latency stalls
for reads.

32 registers times 4 sets, plus some rename registers and you are
looking at ~200 actual registers. If you try and do this with 256
visible registers your actual register set becomes large, this might be
an issue, or not.

The die size tradeoffs of 4 way multi threading is mostly about reducing
the cost of x86 decode die space, and other die costs. To bring costs
down near ATI space, for the amount of work done. For anyone besides
Intel if multi threading makes sense the first thing they would do is
dump the x86 instruction set, to free up that waste.

The future looks to be RRAM, in which case we will get 8 gigs of
embedded RAM that is only two dozen cycles away or so. Simple prefetch
and a big register set can hide that latency. Multi threading dies,
except as a marketing gimmick.

Brett
.



Relevant Pages

  • Re: non load/store architecture?
    ... Programmers can write great code on a RISC ... Especially in a PC marketplace dominated by the x86. ... Lots of registers and an orthogonal instruction set are important - both have these. ... The lack of registers means much more memory IO, which causes stalls and requires complex scheduling. ...
    (comp.arch.embedded)
  • Re: Switch from SBCL to Erlang backend due to scalability issues(GC).
    ... to the SBCL developers saying that they don't use a precise GC on x86 ... should be more likely that their stack locations have to be touched ... than on architectires with more registers. ... and initializing all those stack locations would ...
    (comp.lang.lisp)
  • Re: Relocatable/PIC asm (was intra-segment CALL and JMP)
    ... I'm only writing in x86 assembly right now, ... data back and forth between registers and memory. ... instructions, especially if the instruction is "pairable", i.e., ... Keeping the top two stack items in registers ...
    (alt.lang.asm)
  • Re: More blather about atom
    ... execute on whatever OOO core keeps your x86 workload happy. ... Itanium is so big and complicated, and has so many registers, that it ...
    (comp.arch)
  • Re: Software vs hardware floating-point [was Re: What happened ...]
    ... or 32 registers that can be used by FP annd 32 register ... For "best cost/performance" today I would put the FPU in the integer ... another register set and pipeline. ... market to pay for the hardware design, ...
    (comp.arch)