Re: Instruction Cache Optimisations
- From: "Paul A. Clayton" <paaronclayton@xxxxxxxxxxxxx>
- Date: Fri, 16 Nov 2007 16:20:21 -0800 (PST)
On Nov 15, 12:27 pm, Tim Frink <plfr...@xxxxxxxx> wrote:
[snip]
http://www.dspdesignline.com/howto/processors_fpga/202804122
I'm little bit confused about the effectiveness of the memory
layout achieved by the described algorithm. Unfortunately,
there's no detailed reasoning about why the found chains
(functions placed continuously in memory) improve instruction
cache performance.
The significant point is that blocks are prefetched sequentially
after an accessed block.
"Modern caches have mechanisms which prefetch memory that
is in the vicinity of the currently executing block. During the
execution
of a function, code placed after it in main memory is brought into
the instruction cache." is the key statement.
The suggested chains are:
1
2~10~21
3~8~12~18
4~5~9~13~14~20
6~7~11~19
15
16~17
Linear execution sequence (Number refers to function
number using call ordering; letter indicates what
part of that function, 'z' indicates the end/tail
of the function):
1a 2a 3a 4a 4z 3b 5a 6a 6z 5b 7a 7z 5z 3z 2b 8a 9a 9z
8z 2z 1b 10a 11a 12a 13a 13z 12b 14a 15a 16a 16z 15b 17a 17z
15z 14z 12z 11b 18a 18z 11z 10b 19a 20a 20z 19z 10z 1c
21a 21z 1z
If the ICache prefetcher fetches forward from a
demand access, at the end of a function (the 'z'
part) one would want to place the next function
start (the 'a' part); so after 2z one would place
10a-z (so when 2z is accessed, the next blocks--
which will be the first of 10--will be prefetched
and only 1b can interfere [1b is the only code
section accessed after 2z and before 10a]).
For the chain you question (3~8~12~18) 2z, 1b,
10a and 11a ARE fetched after 8z and before 12a;
but 10a is already placed at the end of 2 (only
1b intervening, maximizing the probability of
useful prefetching). Then why not place 11a after
8z (only 2z, 1b, and 10a intervening)? The
algorithm has already greedily placed 11a after
7z. (I would tend to think one would just give
up on prefetching the start of a function after
7z. With 10 code sections intervening between
7z and 11a, one might well be better off
placing 11a after 8z [3 intervening code
sections] and perhaps 12a after 9z [moderately
likely for the prefetched 12a to remain after
5 code sections] and give up on 13a. Placing
20a after 14z [7 intervening code sections]
also seems a bit of a stretch.) The algorithm
is rather simple, but given its simplicity it
does seem to do a fair job of procedure
placement.
(Of course, if all the functions had single call
points and are always called [as implied by the
simple example] one might just inline all the
functions and get ideal prefetching. :-)
Paul A. Clayton
just a technophile
reachable as 'paaronclayton'
at "embarqmail.com"
.
- Follow-Ups:
- Re: Instruction Cache Optimisations
- From: Paul A. Clayton
- Re: Instruction Cache Optimisations
- References:
- Instruction Cache Optimisations
- From: Tim Frink
- Instruction Cache Optimisations
- Prev by Date: Re: "Server" processors for numbercrunching?
- Next by Date: Re: Instruction Cache Optimisations
- Previous by thread: Instruction Cache Optimisations
- Next by thread: Re: Instruction Cache Optimisations
- Index(es):
Relevant Pages
|
Loading