Re: Scheduling on Cortex
- From: "Wilco Dijkstra" <Wilco.removethisDijkstra@xxxxxxxxxxxx>
- Date: Mon, 11 May 2009 12:29:22 +0100
"Ben Avison" <usenetspambin@xxxxxxxxxxxx> wrote in message news:op.utoklmhvb3tjgs@xxxxxxxxxxxxxx
I've been trying to get my head around how to schedule instructions,
including all the intricasies of dual issuing, but there just doesn't seem
to be enough information in the Cortex-A8 TRM to work out cycle counts even
for relatively simple real-world ARM code.
TRM's are written by hardware people for hardware people who already have
very detailed understanding of the core. It is telling for example that the pipeline
is not described at all, not even a single pipeline diagram is shown nor an
explanation what E1, E2 etc mean. So in order to understand the TRM at all,
you'll need to search for presentations and papers on Cortex-A8, eg:
http://www.arm.com/pdfs/TigerWhitepaperFinal.pdf
There are various old presentations that you can find on the internet that go
into more detail.
Some examples would be:
1)
LDREQ r0,[r0]
LSLNE r0,r0,#1
You'd hope that this would issue one instruction to each pipeline, because
they're mutually exclusive, but the data output hazard means that they can't
be dual issued. However, nothing is said about whether you still get a
further 2-cycle stall, supposedly between the output to Rt of the LDR and
the input to Rm of the MOV. In other words, does the decision to stall take
the condition codes into account?
Instructions are statically scheduled based on the assumption they will be
executed, so you do indeed get a 2 cycle stall.
2)
LDM r0,{r1-r2} ; loads r1 in 1st cycle, r2 in 2nd cycle
MUL r3,rn,r3
Do the number of stalls before the MUL change depending upon whether rn is
r1 or r2?
Yes. The first cycle is assumed to only transfer the first register, the 2nd the
next 2 etc, even if the base address is 64-bit aligned. For STM it reads the
first 2 registers in the first cycle but may only transfer 1 register.
In your example the LDM takes 2 cycles to execute, with r1 available in E1
in the 4th cycle, and r2 in E1 in the 5th cycle. So the multiply stalls for 1 cycle
if rn = r1, and 2 if rn = r2.
3)
MOVS r0,r1 ; sets or clears Z flag
MOVEQ r2,r3
Do these get dual-issued? The TRM doesn't say they don't.
The flags are updated in E2 and conditional executed ALU instructions
must resolve in E2 so these cannot be dual issued. However Cortex-A8
can dual-issue 2 flag-setting instructions and merge their flags in E2 -
this is essential for Thumb-2 code.
4)
There are a number of examples in the Cortex-A8 TRM where an unconditional
branch instruction is shown as executing in pipeline 0 while the following
instruction is dual-issued in pipeline 1. Is this correct? In other words,
is there some sort of signal from pipeline 0 to pipeline 1 to abort the
instruction being decoded? (Something similar would permit dual-issue in
example 3.)
A branch can be dual-issued with either the previous instruction (even if it is
a compare) or the next predicted instruction.
I'm sure there are many other cases I've not thought of yet. Perhaps I'm
missing some other document that describes scheduling in more detail?
Not that I know of. One can only guess what ARM has in mind when not
giving software people essential information they need to get the best
out of their cores. What we need is a detailed software optimization manual.
Wilco
.
- Follow-Ups:
- Re: Scheduling on Cortex
- From: Marcus Harnisch
- Re: Scheduling on Cortex
- From: Torben Ægidius Mogensen
- Re: Scheduling on Cortex
- References:
- Scheduling on Cortex
- From: Ben Avison
- Scheduling on Cortex
- Prev by Date: AXD literal pool error
- Next by Date: ED-hardy jeans paypal *** free shipping from www.guoshitrade.com
- Previous by thread: Re: Scheduling on Cortex
- Next by thread: Re: Scheduling on Cortex
- Index(es):