Re: AMD Bulldozer optimization guide



On 1/22/2012 3:40 PM, Joe keane wrote:
In article<11457910.970.1327007257462.JavaMail.geo-discussion-forums@yqcz12>,
MitchAlsup<comp.arch@xxxxxxxxxxxxxxxx> wrote:
One of the BAD points of RISC instruction set design is that the
compiler/assembler/programmer has to invent
constants/displacements/immediates,... Many of these displacements need
more address bits than the instruction carries naturally (16-bits
typically, 13-bits SPARC, 12-bits 360+). The lack of enough displacement
bits creates a lot of instructions that simply paste bits. Decoders are
ALREADY good at pasting bits, arguably better than instruction sets.

On the RT, r0 always points to the function's "contant pool". So if
you need a big constant (e.g. full word), you just load it from there,
normal load/store. It doesn't make any attempt of a way to get bigger
immediates.

I think it's fine. You don't really miss the immediates.

Except in locore.

Nearly every OS, on machines that do not switch some registers or run on a special interrupt stack in at least some modes, have some sort of "locore" - an area of memory whose address is such that it can be constructed via an immediate WITHOUT DESTROYING ANY REGISTERS, so that you can save the registers that you would have to destroy in order to access any region of memory further afield.

Heck, quite a few have locore even if they switch registers on an event. It's always nice to have something that you can guarantee access to.

---

"Constant pools" - a frequently used constant in a constant pool occupies DRAM, several places in the cache (full cache line, including neighbours that you should probably arrange to access together), and the register it is currently held in. And load instructions are necessary to pull it in, costing power.

Constants embedded in the instruction stream occupy those bits in the instruction stream, howsoever many times they are used. It is likely that such large immediates can be moved to registers just as often as constant pool memory locations can be moved to registers.

So, the difference between constant pool and constants in the instruction stream are
(a) the difference between the decoder extracting the bits,
and the load instruction executing to get the bits
(b) the cost of the cache occupancy.

Oh, and there's an additional benefit to immediate constants embedded in the instruction stream: they need never occupy scheduler bandwidth, RF read port bandwidth, etc. So you would probably NOT want to optimize them to registers as often as you need to optimize constant pool values, because even if the code size may be bigger die to having replicas of the constant, the power costs are lower.

Hmm...

---

GPUs have "constant register files", and/or allow memory locations in a constant pool to be specified via a register number like field in an instruction, that is added to a base register to get the address.

Typically there are lots of swizzles and transforms available to be applied to constants, as well as "ordinary" GPU registers.

I suppose that there are some optimizations that you can create for a nearly constant register file.

Definitely, this saves a few bits on the destination register, since you don't write to constant registers (very often, and not in the instruction set - you may load and store the constant registers en masse).

You wouldn't have to rename the constants, you could just serialize when they are changed.

Constant registers aren't attached to a bypass network. However, if you are doing register file port reduction, they may participate.
.



Relevant Pages

  • Re: High-bandwidth computing interest group
    ... few dedicated registers? ... There are GPUs with 256-1024 SP FPUs on them. ... 32b SP in a given operation, in parallel in one instruction. ... pipelining, e.g. spreading an 8 element vector over 2 cycles. ...
    (comp.arch)
  • Re: Trivia Question
    ... >> Make all the excuses for your ignorance that you like, but, yes, ... I thought that this was the point of *all* programming languages ... registers are pushed on the stack for PUSHA/PUSHAD? ... And if you don't study the instruction set ...
    (alt.lang.asm)
  • Re: Two probably impractical register file ideas
    ... two operands of an instruction are from different banks: ... providing banking or clustering hints. ... any instruction that reads two registers will do so from different ... intra-instruction banking may not be especially helpful. ...
    (comp.arch)
  • Re: The coming death of all RISC chips.
    ... supporting full instruction decode from RAM, ... opcode up to the new top of the market. ... the the low 8 registers, as those registers do not have the extra ...
    (comp.arch)