Re: Avoiding Branch Misprediction in Virtual Machine (VM) Portably




moron451-compiler1@xxxxxxxxx writes:
Hello,

I've read several web sites and papers about coding virtual machines
using a threaded code technique (Anton Ertl's papers for example). He
recommends the use of GNU C's labels-as-values as a somewhat portable
way to do threaded coding. However, I don't want to rely on a non-
portable method like that. I want my VM to be 100% ANSI C.

The classical approach would be to use macros and define them
appropriately for switch dispatch (for ANSI C) or for threaded code.

That seems to only leave me with the "Giant Switch" method.
However... I recently thought about a different method that still
uses the switch statement, but instead of one giant switch, it uses
one for each instruction (see below). I was wondering if anyone can
tell me if this is a good idea or not.

Looks like it is a good idea. I implemented a variant of my
threaded-code microbenchmark using your technique
<http://www.complang.tuwien.ac.at/forth/threading/repl-switch.c>, and
here are the results on a 2.2 GHz Athlon 64 X2 4400+ (times in
seconds):

direct single replic.
threaded switch switch
0.38 1.11 0.51 i386 gcc-2.95
0.73 1.11 0.50 i386 gcc-3.3 -fno-crossjumping
0.75 1.10 0.54 AMD64 gcc-3.3 -fno-crossjumping
0.73 1.10 0.55 AMD64 gcc-4.1

As you can see, thanks to performance regressions in recent gccs your
replicated switch technique even beats direct threaded code on these
compilers, although it is still quite a bit slower than well-compiled
(i.e., gcc-2.95) direct threaded code. I had the impression that the
performance regressions were mostly fixed in gcc-4.x, but apparently
that is not generally the case.

The disadvantage is that you have n^2 jumps, which make compilation
slower, and lower the number of VM instructions you can use (although
probably not dramatically). Also, you have n replicas of the jump
tables, which may increase the data cache miss rate.

This is ironical: gcc's data flow analysis modeled an indirect goto
just as n jumps to all n possible labels (so with n labels and n gotos
it modeled n^2 jumps). They then changed the implementation of
indirect gotos to have one shared indirect jump, and every indirect
goto would perform a jump to that shared indirect jump; this defeats
the branch prediction by the BTB and causes the performance regression
seen above. Now you introduce another technique that makes all these
n^2 control flow edges explicit, and this might become the method of
choice for implementing virtual machines. So we are back where we
started, only in a slightly worse place: compilation is just as slow
(probably a little slower) as before, and execution is a somewhat
slower than before. Well, maybe the gcc guys will work hard to undo
the replication and make the replicated switch technique just as slow
as the ordinary switch dispatch.

I'm a bit worried that I haven't
found anything advocating this in "the literature" however, so maybe I
am missing an important problem with it. Any comments?

AFAIK you are the first to invent this technique. Whom should I
attribute it to? moron451-compiler1@xxxxxxxxx does not sound so
great:-)

Our esteemed moderator writes:
[Ignoring the detail that there's no "label" keyword in C, I don't see
the point. A reasonable compiler will recognize collapse the common code
sequences and jumps to jumps so you'll end up with the same object code
you would have from the giant switch.

As the numbers above show, this is not the case for various gcc
versions, and hopefully that will stay that way. The capabilities of
compilers are frequently overestimated.

- anton
--
M. Anton Ertl
anton@xxxxxxxxxxxxxxxxxxxxxxxxxx
http://www.complang.tuwien.ac.at/anton/
.