Re: Instruction Cache Optimisations



On Nov 15, 12:27 pm, Tim Frink <plfr...@xxxxxxxx> wrote:
[snip]
http://www.dspdesignline.com/howto/processors_fpga/202804122

I'm little bit confused about the effectiveness of the memory
layout achieved by the described algorithm. Unfortunately,
there's no detailed reasoning about why the found chains
(functions placed continuously in memory) improve instruction
cache performance.

The significant point is that blocks are prefetched sequentially
after an accessed block.

"Modern caches have mechanisms which prefetch memory that
is in the vicinity of the currently executing block. During the
execution
of a function, code placed after it in main memory is brought into
the instruction cache." is the key statement.

The suggested chains are:
1
2~10~21
3~8~12~18
4~5~9~13~14~20
6~7~11~19
15
16~17

Linear execution sequence (Number refers to function
number using call ordering; letter indicates what
part of that function, 'z' indicates the end/tail
of the function):

1a 2a 3a 4a 4z 3b 5a 6a 6z 5b 7a 7z 5z 3z 2b 8a 9a 9z

8z 2z 1b 10a 11a 12a 13a 13z 12b 14a 15a 16a 16z 15b 17a 17z

15z 14z 12z 11b 18a 18z 11z 10b 19a 20a 20z 19z 10z 1c

21a 21z 1z


If the ICache prefetcher fetches forward from a
demand access, at the end of a function (the 'z'
part) one would want to place the next function
start (the 'a' part); so after 2z one would place
10a-z (so when 2z is accessed, the next blocks--
which will be the first of 10--will be prefetched
and only 1b can interfere [1b is the only code
section accessed after 2z and before 10a]).

For the chain you question (3~8~12~18) 2z, 1b,
10a and 11a ARE fetched after 8z and before 12a;
but 10a is already placed at the end of 2 (only
1b intervening, maximizing the probability of
useful prefetching). Then why not place 11a after
8z (only 2z, 1b, and 10a intervening)? The
algorithm has already greedily placed 11a after
7z. (I would tend to think one would just give
up on prefetching the start of a function after
7z. With 10 code sections intervening between
7z and 11a, one might well be better off
placing 11a after 8z [3 intervening code
sections] and perhaps 12a after 9z [moderately
likely for the prefetched 12a to remain after
5 code sections] and give up on 13a. Placing
20a after 14z [7 intervening code sections]
also seems a bit of a stretch.) The algorithm
is rather simple, but given its simplicity it
does seem to do a fair job of procedure
placement.

(Of course, if all the functions had single call
points and are always called [as implied by the
simple example] one might just inline all the
functions and get ideal prefetching. :-)


Paul A. Clayton
just a technophile
reachable as 'paaronclayton'
at "embarqmail.com"
.



Relevant Pages

  • Re: cciss update for 2.4.24-pre1, #3
    ... We found a bug in the ASIC used on the 64xx Smart Array ... If this occurs on a memory boundary the machine will crash. ... This patch turns on prefetch for x86 based systems only. ... Has the prefetching been tested for long? ...
    (Linux-Kernel)
  • Re: [PATCH 2/2] cciss: disable dma prefetch for P600
    ... falling off into one the holes on IPF and AMD. ... It doesn't happen on Proliant because the last 4kB of memory is ... prefetching was walking off the end of real mmeory and into the AGP region ... There is a bug in the DMA engine that that may result in prefetching ...
    (Linux-Kernel)
  • Re: Inner loop and out of cache question
    ... > can run concurrent with long memory fetches. ... Prefetching can more than double performance as it effectively turns ... including bus protocol limitations, limitations on numbers of ...
    (comp.lang.asm.x86)
  • [PATCH][4/4] mm: Implement swap prefetching tweaks
    ... and prefetch when free memory is greater ... Check that only background priority tasks are running before prefetching. ... +struct prefetch_stats { ... static void clear_current_prefetch_free ...
    (Linux-Kernel)

Loading