Re: IBM 45nm -- new or licensed from Intel?




"Paul A. Clayton" <paaronclayton@xxxxxxxxxxxxx> wrote in message
news:ec93af5b-0922-46c6-bae3-199a35fa171b@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
On Feb 18, 7:12 pm, "Wilco Dijkstra" <Wilco_dot_Dijks...@xxxxxxxxxxxx>
wrote:
[snip]
Basic arithmetic. Consider 64-bit constants that are repeatedly used
(such as address constants). It's obvious that it is always better to load
them from a literal pool rather than build them from multiple instructions,
both in terms of codesize and performance.

For repeatedly used constants, a constant table would seem to be a
win for memory size. However, it is not clear that it would be a
performance win. It is not clear how much of a cache block in a
constant table would be used while in cache, whereas it is likely
that much of the code in an instruction block is likely to be used.

One would typically group constants used by the same function(s)
to minimise D-cache misses. Although you will get extra D-cache
misses, the reduced codesize also means fewer I-cache misses.

In addition, sequential prefetch for instructions is a relatively
simple,
obvious performance boost, but prefetching from a constant table
is more challenging. (Perhaps there are compilers smart enough
[possibly even without profiling] to place global values intelligently
reduce cache misses; but that seems a relatively sophisticated
optimization.)

Prefetching is not an issue here: we're talking about constants, so
there are no data dependencies, and thus such loads can be
scheduled freely (to avoid stalls, to fill unused issue slots etc).

A load immediate instruction might also have a
slight power-use advantage relative to a load from table
instruction (no base address register read and the potential to avoid
an additional cache access [at the cost of a wider instruction
fetch]).

Correct. Though on RISCs you need several sethi/setlo instructions
to create a constant and these likely use more resources than one load
instruction.

Widening instruction fetch to allow for larger immediates is
probably not that complex.

However this increases power consumption for all fetches, while loads
only consume power when you use them...

An immediate provides another advantages over a load from a
constant table: the value is available early in the pipeline (this is
particularly significant for control flow instructions, but could
have
an impact on dependent instructions [e.g., load constant $A, load
value at $A + offset to Rx, operate using Rx could be one cycle
shorter with an immediate relative to using a load from table with a
single-cycle DCache]).

Again, this is not at all an issue for constants. They can be loaded as
early as necessary to avoid any stalls. In most cases constant loads
are not performance critical as they are placed in registers and reused
many times (some compilers even place often used constants in global
registers so they are available to every function for free). You often see
a few constant loads at the start of a function, and nowhere else -
certainly not in time critical loops!

[snip]
It's obvious to me that a 2-way in-order CISC is going to lose against a
2-way RISC clock for clock, so it needs to run at a higher frequency for the
same performance. Of course we'll have to wait and see until benchmarks
are run on both.

Huh? That would seem to depend on what one considers 2-way (2 AGUs
and 2 integer FUs--i.e., 2 macro-ops--or one AGU/simple integer FU and
one complete integer FU).

On x86 2-way typically means 2 AGU/ALU (load+operate), on RISCs it
is usually 1 AGU + 2 ALU (due to needing fewer memory operations).

The usual advantages of a RISC are the larger
number of registers (which is not the case between x86-64 and ARM),
the availability of instructions that do not overwrite one of the sources
(avoiding extra moves), simpler decode (some of the extra front-end
pipeline length can be hidden by predecoding at the cost of ICache size
[and access width] and so some power penalty), and simpler pipeline.

And not to forget an advantage on the compiler side due to a more
orthogonal ISA. And while x64 uses a RISC calling convention, x86
uses the stack as much as it possibly could.

(ARM, of course, has load/store multiple, shift-and-operate, and
predication, so the useful work per instruction might be closer to
x86-64

Or perhaps a bit higher :-)

[of course, no one would require an ARM to have 16 load and 16 store
ports to support 'true' 2-wide issue, though I suspect that a moderately
optimized higher-performance implementation could support 64-bit
loads and stores])

Indeed, all high-end ARMs support 64-bit accesses.

Anyway, it is not obvious to me that a 2-way
x86-64 would have lower performance than an equally clocked
ARM (MIPS, even with its greater register count, might be at a
disadvantage in that it has less work per instruction). An x86-64
would lose in area, power, and design complexity to an ARM,
but x86-64 might have enough greater work per cycle to equal or
surpass an ARM at the same frequency.

I'm not sure how you see an in-order x86 doing more work - ARM has more
powerful instructions compared to x86, and so needs fewer instructions
to do the same job. Despite being powerful, the instructions are simple
to implement in a shallow pipeline. Contrast this with a deep load+operate
pipeline needed for an in-order x86...

Current x86's are fast because of the out-of-order execution engine.
Take that away and the ISA matters again...

Wilco


.



Relevant Pages

  • Re: questions about Public Constants
    ... You have an OBSCENE amount of processing on your computer now. ... 20 million vba instructions per second. ... However, in both cases, VBA, or the macro can execute the command to load ...
    (microsoft.public.access.modulesdaovba)
  • Re: OT: Spanish (I think) translator help, please
    ... This sentence doesn't give general instructions, I'm pretty sure practi-taza is the name for a cup included with it, you might call it "practi-scoop", but it will be unique to that company, rather than some kind of size that you'd know if you speak spanish. ... If the weight per load is 5lb for wherever this is from, I'm guessing a US front loader takes 15-20lb, so you'd multiply the amount needed by 3-4, but without the cup, that's going to give a wide range of values. ...
    (rec.crafts.textiles.quilting)
  • Re: How to implement the speculative loading?
    ... Must it re-execute ALL the instructions that after the ... I'm not quite sure what a "speculative load" is, ... directly dependant on the load can't execute until the load finishes ...
    (comp.arch)
  • Re: Unusual Christmas presents
    ... I burn to DVD, copy to another machine on the LAN and to other drives in ... or instructions at all (they're probably around here somewhere ... let anything load itself. ...
    (uk.rec.sheds)
  • Re: Xilinx Virtex-4 OCM Usage Issues
    ... boot code. ... this code and using the debugger reset. ... 0x2000-0x200C (the first four instructions) and program the PC to ... If I load the design that uses the PLB RAM instead of the OCM ...
    (comp.arch.fpga)