Re: IBM 45nm -- new or licensed from Intel?



"Wilco Dijkstra" <Wilco_dot_Dijkstra@xxxxxxxxxxxx> writes:

"Anton Ertl" <anton@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message news:2008Feb14.202346@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
"Wilco Dijkstra" <Wilco_dot_Dijkstra@xxxxxxxxxxxx> writes:
....
But it's not just the encoding that is irregular, it is the odd register
set and instructions that act on them. Think of the weird set of 8, 16, 32
and 64-bit registers

AMD64 has 16 64-bit general-purpose registers, of which the lower 8
bits, 16 bits, or 32 bits can be used, just as in most other register
machines. Why do you think they are weird? Ok, there's also AL BL CL
DL, but there is no requirement to use them. For IA-32, the 8-bit
registers are more restricted, but you do not need to use them (see
below).

(and the complex stalls to avoid if you do use the
narrow registers)

That's not an ISA issue. And if you program like on RISCs (i.e., load
with MOVSX or MOVZX), that's not an issue at all. And it also solves
the IA-32 8-bit register issue.

the hard coded registers in common instructions etc.

The only ones I can think of are CALL, RET, PUSH, POP, and most code
generators would not use any other register than ESP for these
instructions (especially CALL) even if they had the option.

All other hard-coded things are for not-so-common instructions like
shifts and division.

Interestingly, I as compiler writer have found IA-32 easier to
generate code for than for PPC, and interestingly because of the
encoding. That's because the code generation technique I used
<http://www.complang.tuwien.ac.at/papers/ertl%26gregg04pact.ps.gz>
just concatenated little snippets of binary code generated by a C
compiler, and patched the appropriate constants into these snippets.
Constants are encoded very straightforwardly in IA-32: I just had to
store them at the right offset from the start of the snippet; this was
somewhat more complex for PPC, where large constants are split into
two halves.

That depends a bit on how you designed your snippets. It would have
been equally possible to create snippets that use a full 32-bit or 64-bit
immediate that is loaded from a constant pool.

The decision on the encoding of constants was performed by the C
compiler which generated the snippets. However, the constant-pool
approach would add more complexity than the constant-splitting
approach: my compiler would have to find out that the snippet refers
to a constant pool, where the offset into the constant pool is, set up
its own constant pool, patch the offset, and support dealing with
constant pools that cannot be addressed with a single offset.

It is actually more efficient doing a load from a constant pool rather than
encoding a big immediate in an instruction (or several instructions on a
RISC).

On what evidence do you base this claim? Why don't the compiler
writers for the PPC use this knowledge?

And making it go fast
has some significant costs. For example the "low power" Silverthorne is
more power hungry than current embedded RISCs despite having 2
process nodes advantage...

Silverthorne appears not to be designed to be the ultimate in
low-power, more just one step below their other offerings. E.g., they
made Silverthorne superscalar and IIRC OoO, whereas I would expect a
very low-power CPU to be single-issue in-order (what microarchitecture
do the RISCs you refer to have?).

Silverthorne is 2-way in-order:
http://arstechnica.com/news.ars/post/20080205-small-wonder-inside-intels-silverthorne-ultramobile-cpu.html

The ARMs that go in today's high-end mobiles are mostly 1-way in-order
running at up to 750MHz using less than 500mW on 90nm. The next
generation has 2-way in-order and out-of-order cores using the same
power budget at 1GHz on 65nm.

That's a more appropriate comparison, but Silverthorne is still quite
a bit faster, so it's not surprising if it uses more power. The
article you cited says:

|Intel claims that the device has a TDP of 2 watts at 2GHz on 1.0V. At
|lower speeds, the device gets down to 0.5 watts, but it's not clear
|how far down Intel will have to ratchet the clockspeed to get there.

So overall it's not at all clear that Silverthorne consumes more power
at the same performance.

But we may see Intel go for lower-power markets, and then we may see
how much the ISA costs in power. Once upon a time people claimed that
the IA-32 ISA has costs in terms of speed, but with enough effort this
problem has mostly been overcome. Maybe we will see a similar effort
applied to power-efficiency.

It's called Silverthorne :-) It's unlikely to be their last try, but it shows that
the x86 ISA cost cannot be overcome with a major process advantage.

One example does not prove a universal claim. And I don't even accept
your claim that this example does worse wrt power than a RISC. You
would have to compare Silverthorne with a RISC designed for similar
performance.

Concerning the process advantage, my impression is that nowadays this
is mainly in the die size (i.e., production cost) and/or transistor
count, but hardly in clock rate or power consumption; e.g., consider
the Northwood and the Prescott: the Prescott had a process advantage,
but was hardly faster and hotter.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@xxxxxxxxxxxxxxxxxxxxxxxxxx Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
.



Relevant Pages

  • Re: Two Click disassembly/reassembly
    ... Map the extra x86 registers to memory. ... > equivalents to the string instructions. ... > got such a limited RISC like instruction set that the assembler is more ...
    (alt.lang.asm)
  • Re: IBM 45nm -- new or licensed from Intel?
    ... 12 have 'L' sub registers, ... Just don't tell your compiler that they exist, ... to insert extra movzx instructions and avoids partial stalls all in one... ... If you want to compare the int results, you usually need to extend the ...
    (comp.arch)
  • Re: speed it up
    ... can load many registers at once from memory and put many instructions ... the inner loop is unrolled ... The above loop tells the compiler that 4 registers ...
    (comp.lang.cpp)
  • Re: PIC vs ARM assembler (no flamewar please)
    ... addressing modes - you don't have to use half a dozen instructions just ... perhaps including the CPU32 addressing modes (all ... And having some more registers ... The ARM instruction set is not particularly nice for assembler ...
    (comp.arch.embedded)
  • Re: Scatter/gather and memory alignment restrictions
    ... These registers are big enough to hold 4 offsets from a ... the width of a cache line ar architecturally independent quantities. ... This is a solvable power at some cost in power and ... additional cruft in the instruction set. ...
    (comp.arch)