Re: A way to speed up level 1 caches



In article <1172508745.608179.72450@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
"David Kanter" <dkanter@xxxxxxxxx> wrote:

(2) A constant complaint about OS code is that it runs at random times
and pulls in random instructions, data, and TLB entries that probably
weren't in the L1 caches and that evict good stuff that was there.

That's true, but I'd be curious, how much I$ and D$ does the OS
process really displace? If it kicks a bunch of stuff out of the L1
cache into the L2 cache, then you can still get it back pretty quickly
for an OOO machine. If it moves all your application code and data
into memory, then you're screwed.

The complaints I have heard in this regard tend to involve interrupt
code, and (on processors with SW TLB support) TLB miss code.
It is a fair point that these may be the sorts of complaints that were
both accurate and reasonable in 1999 with an off-chip L2 of maybe
256KiB, and like many other computing complaints, they're now
anachronistic and we should be worrying about something else.

Each L1 cache is now two identical such caches.
Which one is used is gated by whether the system is in user or system
mode. This is a simple extra bit line, so doesn't hurt your cycle time,
unlike doubling associativity to double the cache size.
Now we get the system material segregated in its world, the user
material segregated in its world, and the two aren't stumbling over each
other.

Hrmmm. Your idea would address cycle time, however, what it makes
worse is the scarcity of die area within the 'core' of a MPU. When I
say 'core', I mean the computational logic and the L1 caches. This
area keeps on shrinking with fine process geometries, and it's in
pretty high demand. You need to offer a compelling performance
improvement over spending a lot of die area on other things. Die
space and transistors are plentiful, but not adjacent to the ALUs;
that's still prime value real estate.

Thanks for a real world answer on this. Another poster has pointed out a
different aspect of the problem, saying that the performance issue is
now not associativity but wires, and that what I propose will use more
wire or longer wires. I don't (yet) have a good feel for quite what this
answer really means, since I guess I'm still living in the late 90s and
the pre 90nm era of design, but some reading should clarify this.

DK
.