Re: defered word options
- From: Alex McDonald <blog@xxxxxxxxxxx>
- Date: Fri, 23 Nov 2007 14:47:33 -0800 (PST)
On Nov 23, 7:48 pm, foxchip <f...@xxxxxxxxxxxxxxxxxxx> wrote:
Let me use a set of logical bitblit words as an example where we
have a half dozen words characterized by outer nested loops
that manage the screen and memory pointers and any compression
or decompress of the bitstream and differ from one another only
in the content of their inner loops and given a situation in which
we are required to make this set of words significantly smaller
to make the application work. The words were originally
written in high level and were then coded in assembler and
optimized individually to their minimal size and maximum
speed. When we make it smaller we expect some penalty.
We would expect some performance penalty for making a
significant reduction in size but the set of routines can easily
be made about six times smaller with an introduction of
relatively little overhead using something in the family of
defered or vectored execution. The question then becomes
which of the techniques to make the code the smaller will be
the fastest?
To reduce program size code can be used for more than one
purpose and the same large nested outer loop structure can
be used once and the inner loop can have more than one
function. The ways one can have multiple different functions
in the inner loop fit into two catagories: those that make
the decision on which code to execute inside of the loop and
those that make the decision outside of the loop.
Forth programmers are aware of the concept that different time
phases have different values. Doing something inside of a loop
that iterates more than once is more expensive in time than
doing it once outside the loop. It is more expensive by a factor
of however many times above one the loop is taken and the
code is executed.
Methods that make the decision of which code to execute
inside of the loop have their overhead multiplied by the number
of loops. These involve such techniques as switch statements,
case statements, nested conditional code, jump tables,
traditional vectored code, and self-modifying and \
non-self-modifying implementations of defer. Of these
many introduce significant overhead inside of the loop.
Nested decsions trees may introduce the greatest overhead.
Traditional Forth code might look like outside the loop:
VARIABLE 'FUNCTION
' MYFUCTION 'FUNCTION !
and inside the loop
'FUNCTION @ PUSH ;
Portable ANSI Forth cannot rely on the existence of an
actual return stack and has added a layer of abstraction:
'FUNCTION @ EXECUTE
(some systems provide an @EXECUTE function similar to
the operation of the machineforth sequence
IF PUSH ; THEN DROP ;
DEFERed words are not part of the standard but are
common practice. Similar to the traditional style
'FUNCTION @EXECUTE
in their implementation they are often used with syntax
' MYFUCTION IS THEFUNCTION
I have argued for many years that the most efficient
implementation of defered words is simply a jump opcode.
It is of course an example of using self-modifying code
to implement DEFER.
In first making the logical bitblit example about six times
smaller the large amount of code is not replicated six
times. And the code in the inner loop can have
multiple functions.
I think it is logical to assume that one can mathematically
prove that of the methods offered above, or by anyone else,
that there is one above with the lowest overhead inside of
the loop, the fastest of the smallest code, and that there
is only one other thing that can possibly be faster.
Loading an address to fetch from it a data value and
then passing that data value to a routine that then
executes that address will require more memory
cycles than a simple branch.
I further assert that a self-modifying defer that uses
only a branch will be the second to smallest and
fastest way to implement this type of code reuse.
None of the other methods listed above, or any
that I can imagine, will be smaller or have less
overhead except for one. And this is something
that should be provable.
The only thing smaller than a jump opcode to the
vectored routine is no jump opcode, simply inline
the vectored routine into the inner loop. This
does introduce an overhead outside the loop,
and in the case of the bitblit implemenation it
addess a couple fetches and store outside
the loop. This is significantly lower overhead
than even a jump opcode multiplied by a
hundred thousand loop cycles.
The first objective was smaller, smallest if possible.
The second objective was fastest of the smallest.
It should be pretty obvious that code sequences
that add an overhead of multiple fetches and stores
to the vectoring operation should be slower than a
single jump.
It should also be pretty obvious that no overhead is less
than the overhead of one jump opcode. The math involved
is that zero is less than one and one is less than more
than one. (no deceptive rounding to zero allowed.)
In the example code size was reduced by about 6x and
the fastest of the ways that could be done introduces
zero overhead inside the inner loop and only a couple
fetches, a store, and sometimes a call outside the loop.
1st place goes to self-modifying native.
2nd place goes to self-modifying defer *1
3rd place to non-self-modifying methings.
1st place has no overhead inside the loop. Self-modifying
code offers that, no other method does. No other method
can achieve the zero overhead inside the loop.
2nd place, self-modifying defer as jump adds the overhead
of a single jump in size and time inside th inner loop.
*1 A tie for 2nd goes to non-selfmodifying code using a
single jump on register contents opcode. It would modify
data like defer or 'vector words not self-modify 'code'.
3rd place goes to all other non-self-modifying
methods that simply cannot produce less
than 0 to tie for first or even 1 to tie for second.
A shortened version of the above logic is that where
self-modifying code is used it will always be the
smallest and fastest option. The only thing that
is smaller than one, that is zero in size and, and
that is faster than a single jump, that introduces
no overhead is inlined native code in the inner loop.
Self-modifying code can manipulate that inlined
native code to be the fastest of the smallest.
Best Wishes,
Jeff Fox
I am no mathematician, so I can't comment on your assertions that such
things are mathematically provable or not, although Marcel Hendrix's
refutation here http://groups.google.com/group/comp.lang.forth/msg/019e265770edcc9e?dmode=source
(that self-modifying code cannot be smallest) seems compelling.
However, in the real world run-time inlining by modifying code has
serious penalties on modern weakly ordered memory architectures, and
the balance between the penalty of management of self-modifying code
and the speed-up of subsequent loops may be of such an order to make
it not worthwhile. For instance, before executing modified code in
certain x86 implementations, the instruction stream must be preceded
by a jump, and the first execution of the code can be hundreds of
times more expensive. Multi-processor systems can suffer unacceptable
delays while shared memory and other processors' cache are made
coherent after such modification. A good discussion of memory ordering
in modern chips can be found here; http://www.linuxjournal.com/article/8211
and http://www.linuxjournal.com/article/8212.
As a side note, self-modifying code is inherently not thread safe
without taking precautions such as locking; this may add yet more
overhead.
Many operating systems do not permit code areas to be modified, nor
data areas to be executed as code. Unusually, Synthesis (http://
www.cs.columbia.edu/~library/TR-repository/reports/reports-1992/cucs-039-92.ps.gz)
was an operating system that heavily used self modifying code. How
reliable such a system would be, and how easy to debug, is another
matter.
You have also omitted a class of architectures that provide
instructions that can modify and execute out of line instructions on
the fly; an example is the EX operation in the IBMs S\360 and later
architectures.
The latter part of your analysis I find confusing, and the conclusion
is wrong as a generalisation. Fastest depends on the loop count, the
hardware it runs on, support for modifiable instructions as in the S
\360, and (where used) the OS. Self-modifying code is not a free
lunch.
--
Regards
Alex McDonald
.
- Follow-Ups:
- Re: defered word options
- From: Elizabeth D Rather
- Re: defered word options
- From: foxchip
- Re: defered word options
- References:
- defered word options
- From: foxchip
- defered word options
- Prev by Date: defered word options
- Next by Date: Re: RPN maps to basic mathematic expressions nicely
- Previous by thread: defered word options
- Next by thread: Re: defered word options
- Index(es):
Relevant Pages
|