Re: OoO VAX (was: Code density and performance?)
- From: "Eric P." <eric_pattison@xxxxxxxxxxxxxxxxxx>
- Date: Sat, 30 Jul 2005 11:17:15 -0400
Anton Ertl wrote:
>
> "Eric P." <eric_pattison@xxxxxxxxxxxxxxxxxx> writes:
> >I certainly wasn't. It was just the topic of this thread, and in
> >particular, compared to RISC. If an Alpha can parallel decode and
> >schedule 4 instructions per clock, is there any way to do that for
> >the VAX instruction set?
>
> Of course there are ways, just as for the 386 architecture (the K7/K8
> can decode into 6 micro-instructions per cycle), and of course they
> are much more expensive than for the Alpha.
It does seem to be an issue of critical mass of gates.
But the x86 instruction is considerably simpler than VAX.
E.g. For arith/logic instructions an x86 can have 1 memory address
operand, VAX allows up to 3 which gives many more permutations.
> >The other problems might be:
> >
> >- Strong memory ordering prevents any access reordering
>
> Well, the 386's memory model has also been described as strong (and in
> the meantime there have been postings that refined the notion), and it
> has not prevented fast superscalar implementations. Indeed, I dimly
> remember a posting (by Andy Glew?), where he wrote about the
> performance benefits of that model; I did not understand it at the
> time.
>
> >- Any number of idiosyncrasies wrt the order that data values
> > are read or written vs the order that auto increment/decrement
> > deferred operation are performed will inject pipeline stalls due
> > to potential memory aliases that probably never actually happen.
> > This combines with strong ordering to basically serialize everything.
>
> Not everything, only those cases where this actually happens. One
> could, e.g., speculate that everything can be executed nicely in
> parallel, and if that turns out to be wrong, cancel the instruction
> and reexecute it serially. Good compilers would avoid generating the
> slow cases.
Yeah, it won't take people long to learn what to avoid.
But here I was thinking that instructions can only be sequenced and
converted to internal micro-risc instructions one at a time because
they must enter the internal micro-op queue in a specific order.
Once in an internal uOp queue they can be scheduled in parallel.
On reflection, memory ordering shouldn't be a problem.
Once in the uOp queue, the normal OOO rules and a memory model such
as 'processor ordering', as started with P-Pro, should be sufficient.
As long as reads don't bypass writes to the same physical
cache line it should be ok.
> >- Having program counter in a general registers that can be
> > manipulated by auto increment addressing modes probably
> > causes many pipeline problems later to feed value forward
>
> Possibly. Most of these cases are probably sufficiently rare that
> they need not be optimized; for the frequent idioms special
> optimizations are possible, as is done for some stuff in the 386
> implementations (e.g., parallel PUSHes).
It is used for all immediate operand and address values.
Those would need to be picked off as a special case which
requires matching both the address mode bits and a particular
register number 15. More decode logic.
> >- 16 integer registers with many having predefined functions
> > is too small and causes lots of register spills.
>
> I would agree, but the AMD64 architects seem to disagree.
>
> Anyway, the 8 integer registers of the 386 architecture hurt, but the
> implementations are still competetive in performance.
>
> >- Having a single integer and float register set means extra
> > time moving float values over to float registers and back.
>
> I don't understand this. Is there a single register set, or separate
> integer and FP registers? In the former case, what problem are you
> discussing? In the latter case, almost all RISCs and the 386
> architecture have the same "problem".
I mean moving float values from the integer reg set over to
the FPU so they can be worked on, then moving them back.
Floats also have to be unpacked and repacked when returned.
A separate FP reg set can store values in a form that allows
direct calculation, allows the register bank to be positioned
near the FPU and have their own data paths and controls.
Alternatively it might have separate integer and float register banks
and a register renamer that tracks which bank the value was in.
> >- Small combined register set means spilling float values a lot
>
> This could be fixed by extending the architecture with additional FP
> registers, as was done in the 88000 architecture, and in the 386
> architecture (by adding SSE, SSE2).
>
> >- Small page size requires lots of TLB entries, which should be
> > fully assoc. for performance, which means big TLB chip real estate.
>
> Well, if VMS and Unix (and their applications) could be moved to an
> Alpha with 8KB pages, it should be easier to move them to a VAX with
> 8KB pages.
>
> Another option is page clustering.
Or a compatibility mode. It seemed to work well enough for
the PDP-11 to VAX switch.
Eric
.
- Follow-Ups:
- Re: OoO VAX (was: Code density and performance?)
- From: Anton Ertl
- Re: OoO VAX (was: Code density and performance?)
- References:
- Re: Code density and performance?
- From: Eric P.
- Re: Code density and performance?
- Prev by Date: Re: Itanium versus Others
- Next by Date: Re: Cluster computing drawbacks
- Previous by thread: Re: Part 1 of 3: Micro economics 101 (was: Code density ...)
- Next by thread: Re: OoO VAX (was: Code density and performance?)
- Index(es):
Relevant Pages
|