Cache size restrictions obsolete for unrolling?



I've made the experience that for some DSPs it's better to unroll
loops as much as possible without taking care of the instruction
cache. In my experiments, I've written a program that fits exactly
into the cache (i.e. the code of the loops was slightly smaller than
the I-cache capacity). For this test program I've measured the
execution time using a cycle-true simulator. Next, I increased the
unrolling factor stepwise resulting in the unrolled loop that exceeded
the cache capacity.

I expected to get a performance decrease, i.e. the stronger the loop
was unrolled the more capacity cache misses should arise leading to a
decrease in the execution time. However, my measurement showed the
opposite. For a loop with 100 iterations, the increase of the
unrolling factor (with one exception) continuously reduced the program
run time. How is this possible?

My feeling is that modern processors have sophisticated features (like
prefetching, fast memories ...) that heavily help to hide/avoid
instruction cache misses, thus they rarely occur even if a frequently
executed loop exceeds the cache capacity. In contract, aggressive
unrolling reduced the expensive execution of branches (especially
mispredicted) in the loop header and produced more optimization
potential. In total, this pays off even at the cost of some more cache
misses. So my first conclusion is that the commonly found restriction
of unrolling factors to avoid too large loops not fitting in the cache
is obsolete and does not hold for modern processors and compilers.

Do you agree or are my assumptions wrong in some point?

Thank you for your opinion.

Cheers,
Stephan

.



Relevant Pages

  • Re: Spill code and unrolling
    ... which are candidates for loop unrolling contain any spill code. ... compilers will naturally unroll loops when execution time is important ...
    (comp.compilers)
  • Re: Help in optimizing branches [OT]
    ... You could try folding some constants out of the loop but it's possible ... unrolling. ... Finally you need to look at cache. ... why many image algorithms work with "tiles". ...
    (comp.lang.cpp)
  • Re: Spill code and unrolling
    ... I've a question about loop unrolling. ... which are candidates for loop unrolling contain any spill code. ... a loop may help the instruction scheduler (at compilation or execution ...
    (comp.compilers)
  • Re: hot path optimizations in uma_zalloc() & uma_zfree()
    ... > I suppose the reason of first gain lies in increasing of cpu cache hits. ... > separate buckets. ... I ran ministat against your tests with 1000 sockets loop and there isn't a lot ...
    (freebsd-hackers)
  • Re: hot path optimizations in uma_zalloc() & uma_zfree()
    ... > I suppose the reason of first gain lies in increasing of cpu cache hits. ... > separate buckets. ... I ran ministat against your tests with 1000 sockets loop and there isn't a lot ...
    (freebsd-hackers)