Bytecode source
- From: Jonah Thomas <jethomas5@xxxxxxxxx>
- Date: Sun, 14 Jun 2009 07:28:42 -0400
I am looking at a simple project and I am looking for advice about
specification.
This is not a really good Forth approach. If you start out solving a
simple problem that you have, you don't need any help with
specification. When your problem is solved you know it. I want to make
something that might have more general uses, and this could be
considered a mistake in itself. Acknowledging that, I proceed.
Bytecode has various uses. You can compile bytecode and get Forth code
that is particularly compact. It will probably run slower because it
needs an "inner interpreter" that converts bytecode to actual commands.
Possibly the reduced size of the code might make up for that but more
likely not. It will be small but probably slow.
Source code converted to bytecode will be smaller than text source. It
will not be compressed as well as a good compression algorithm could do
it. But it will probably be fast to compress and to decompress, and the
code to do the compressing and decompressing will itself be small. A
compression algorithn that only works on Forth source code and not on
anything else. It can use special knowledge about Forth source code to
discard things that a general-purpose compression routine would have to
compress.
The simplest way to do this compression is as follows:
1. Prepare a numbered list of the Forth words you will compress, at
most 256 of them. Replace each Forth word in your source code with its
number. Replace each linebreak with a linebreak token.
2. Have one token for strings. Replace each string that is not a
Forth word with the string token followed by the length followed by the
string. Some strings (all comments?) can be discarded.
To decompress,
1. Read the same numbered list of Forth words.
2. Replace each number with its word. Replace each linebreak token
with a linebreak.
3. When you find a string token, discard the token itself, and use
the length to choose the string.
So we reconstitute the original file, minus comments and whitespace, and
then can load it at will. The tokeniser handles nothing except the list
of tokens. It doesn't care whether a word is immediate or compile-only
unless it is a parsing word. The tokeniser must know whether the word
parses now or later so it can tell whether to save a string for it to
parse. Apart from that it doesn't matter -- the word can compile or
execute etc and the interpreter will handle it.
I don't like this. In many cases the original file must temporarily go
into mass storage, so the maximum mass storage use is the file plus the
compressed file. And the mass storage is likely to be slow.
Well, you can decompress it a line at a time and then parse that line.
Better. But you still have to track the linebreaks.
Here's where I am now. Compress:
1. Read the list of predefined Forth words, noting whether they parse
and if they parse whether they define new words.
2. Replace each word with its token. If it parses, parse the string
and save it. If it defines a new word, make a token for the new word.
3. Have a token for strings.
Decompress:
1. Read the same list of predefined words.
2. For each token, put the name into a buffer. If the word parses,
put its parsed string into the buffer.
3. If it is a defining word, use the name of the word it creates to
make a new token.
4. EVALUATE that buffer.
This is a little simpler and a little more limited than the simplest way
above. ;) All of the parsing words and defining words must be in the
list of predefined Forth words; it doesn't give a way to make new ones
out of bytecode. Since you never compile a parsing word or defining
word, you don't need to track whether you're compiling or executing.
Apart from that it lets you compress existing Forth source without
changing it at all -- you just provide the list of primitives.
If you want more than 256 words that's OK, at any time you can throw
away whichever bytecodes you like and re-use them. The only limitation
is that the tokeniser and the token compiler must change their bytecode
at the same place in the code.
So you can use the same token compiler for as many different bytecode
systems as you want, just provide different lists of primitives and then
let the system add to the list as it creates new words.
With this limitation the approach is very simple. It adds no portability
problems among Forths -- if the original source code will tokenise and
if the original compiles correctly on another Forth, then the tokenised
version will compile on that Forth too. It's somewhat compressed -- all
comments, whitespace and linebreaks can be removed, each name is
presented once plus one byte per occurrence, but numbers and strings are
represented with a token and a length -- two bytes larger instead of one
for the whitespace. Reasonably fast to compile. What would be better?
With some added complications the limitation could be removed. Even
without that it could work, if you accept modifying the source code to
mark new defining words and new parsing words.
It would be possible for the token compiler to store xt's instead of
names, and compilation would be faster. Provide names only for the
primitives and for routines that the user calls. But that adds more
complication too. We must track which parsing words deal with names, and
those must be made to store tokens instead of strings. We must deal with
the clash between words that parse bytecode source and words that later
parse text. The same command can't easily do both.
I will code the simplest approach since it doesn't look hard. But when I
look at things that ought to be useful instead of solving my problem
today, I can't be sure whether it's actually worth doing. Or whether
something else would be much more useful. Any ideas?
Incidentally, this general problem seems to me to tie into the Forth
library problem. We have talked about a standard format for Forth
libraries, and then the format specification never seems to gel and the
libraries don't get written. When you write a library function you are
doing something you think will be generally useful instead of solving
today's problem. And when you create a standard library format you're
doing the same thing at a further remove. You don't find out how good
the library format is until there are generally-useful libraries that
use it ... and so it doesn't happen. In other languages people do this
on faith with varying results, but Forthers tend to be more
down-to-earth and practical and so it doesn't get done at all.
.
- Follow-Ups:
- Re: Bytecode source
- From: Stephen Pelc
- Re: Bytecode source
- From: BruceMcF
- Re: Bytecode source
- From: P.M.Lawrence
- Re: Bytecode source
- Prev by Date: Re: WHILE
- Next by Date: Re: WHILE
- Previous by thread: ANN: 4tH version 3.5d Release 2 has been released
- Next by thread: Re: Bytecode source
- Index(es):
Relevant Pages
|