Re: A truely BIJECTIVE BWT is here!
- From: Klaus Stengel <Klaus.Stengel@xxxxxxxxxx>
- Date: Sat, 15 Dec 2007 09:46:19 +0100
Phil Carmody wrote:
Mark Nelson <snorkelman@xxxxxxxxx> writes:On Dec 13, 8:43 am, Phil Carmody <thefatphil_demun...@xxxxxxxxxxx>
wrote:
I suspect this is impossible. In order to achieve anI don't think this is quite correct Phil. If you look at the BWT
improvement in compression, BWT needs to map less-
compressible vectors onto more-compressible vectors.
However, there are fewer of the latter.
alone, it is truly a bijective function. It is a one-to-one mapping of
the set of input messages to the set of output messages, for all input
and output.
That surprises me. Let's assume your output is <3,bnnsaaa>, then are
you saying that there is no possible inverse image for <0,bnnsaaa>,
<1,bnnsaaa>, <2,bnnsaaa>, <4,bnnsaaa>, <5,bnnsaaa>, and <6,bnnsaaa>?
For your particular examples inverse images exist and are they
character-wise rotations of the word "bananas" (e.g. "sbanana",
"asbanan", ...). The exact meaning depends on how you define the index
number in your implementation.
But there are indeed values that look valid but don't have an
corresponding input value for BWT. For example <3,snnbaaa> has no
possible inverse image. This stems from the fact that BWT maps each
rotation of a given input to the same output string. So there have to be
some strings that BWT can't produce and that's also why you need to
store that index/rotation number along with the output string if you
want to get your original input back.
The output vectors that we get are more compressible ONLY because the
input data has a specific type of correlation - the chances that
symbol X appears after a given sequence of symbols, {a, b, c, ... } is
larger than would be suggested by mere chance. And of course, most
interesting files have this property. Files that aren't interesting
won't have this property, but they will still be transformed by BWT -
into equally uncompressible output vectors.
The reason that I am suspicious of David's claim is just that I don't
see how the transform can be reversible without additional
information.
The index information is stored by extending the number of possible
output strings. Instead of mapping each possible rotation of an input to
the same output (like BWT without index number would do), David's
algorithm maps different rotations to different output strings.
I haven't even looked at David's code yet but I was hoping he could
give a words-only description of what he did to achieve his claim. But
I'll give David a break, he says his attempts at exposition are rarely
successful, and that's not likely to change overnight. What worries me
is my own personal conviction that there is a bijection between clear
writing in spoken language and clear writing in code.
One I share.
There are no useful comments in it, but at least the variable and
function names make sense, at least sometimes... Well, here's a textual
description of David's algorithm with some ASCII art for visualization
;-) It took me some time to get the details right, but I think it's
accurate. I'll start with a short illustration of the regular BWT
algorithm and then try to explain what David's code does (as I
understand it).
Forward BWT
0123456789 (character position in string, counting from 0)
FOOBAR2000 <-- orignal input string, 10 characters
0FOOBAR200 rotate 10 times by 1 char to get array of strings
00FOOBAR20
000FOOBAR2
2000FOOBAR
R2000FOOBA
AR2000FOOB
BAR2000FOO
OBAR2000FO
OOBAR2000F
... sort ...
,------- BWT output column
.L.
000FOOBAR|2|
00FOOBAR2|0|
0FOOBAR20|0|
2000FOOBA|R|
AR2000FOO|B|
BAR2000FO|O|
FOOBAR200|0|<-- line no. 6 containing original string
OBAR2000F|O| (counting from zero)
OOBAR2000|F|
R2000FOOB|A|
`-´
Last column and index with original string is output of BWT
--> 200RBO0OFA, 6
All rotations of the input (in this example "FOOBAR2000", "0FOOBAR200",
"00FOOBAR20", ...) actually map to the same BWT'ed string. Because of
this property it's necessary to save the original index or some other
kind of "offset" in order to distinguish the possible rotations.
----------------------------------------
Reverse BWT:
0 2 A B F O R <-- list of different characters in BWT string
------------- (sorted alphabetically)
3 1 1 1 1 2 1 Count[]: how often each character appears in the string
0 3 4 5 6 7 9 RunningTotal[]: position where given character starts
in the sorted list
,---- Each character Count[] times in alphabetical order
| ,---- Column with BWT string
| | ,---- RunningTotal[] of character in BWT string
| | | ,---- Number of times the character from the BWT
| | | | string already occured
| | | | ,---- Sum gives line no. containing previous
| | | | | character in original string
| | | | |
Line V________V V V V Reverse mapping of sum (=forward links)
0 0 2 3 + 0 = 3 1
1 0 0 0 + 0 = 0 2
2 0 0 0 + 1 = 1 6
3 2 R 9 + 0 = 9 0
4 A B 5 + 0 = 5 9
5 B O 7 + 0 = 7 4
6 FOOBAR2000 0 + 2 = 2 8 <-- Start decoding here by
7 O O 7 + 1 = 8 5 following line numbers and
8 O F 6 + 0 = 6 7 noting the characters from
9 R A 4 + 0 = 4 3 the 2nd column
This results in the following sequence with the letter in the first
column noted below:
6 -> 8 -> 7 -> 5 -> 4 -> 9 -> 3 -> 0 -> 1 -> 2
F O O B A R 2 0 0 0
Note that following the links in the last column forms one cyclic
sequence, spanning all characters from the input. If we continued past
the end (line 2 in this example) we'd get the same output again and
again. This is always the case if the input came from a BWT. Applying
the reverse BWT to some random or corrupted input usually leads to some
shorter cycle that doesn't include all characters.
David's basic idea is now to modify the algorithm to create an output
containing multiple cycles by design. With multiple cycles available it
is possible to split the input in a way that decoding from index zero
for each part always yields the correct result. The information formerly
stored in the extra index value is now encoded in the size and order of
the cycles.
To make this work we have to determine where to split up the input
first. In David's program the function part_cycle() is responsible for
this. When applying the BWT, line/index zero always contains the
rotation with the lowest value. In case of "FOOBAR2000" this is
"000FOOBAR2" (see forward BWT example from the beginning). As our
original input string starts with "FOOBAR..." we need to split this into
two cycles: "FOOBAR2" and "000". Now if we apply the BWT to the
remaining "FOOBAR2" again, we notice that the index zero contains
"2FOOBAR". This is still not what we want, so we create another part
containing only "2". Now we try again with "FOOBAR" and get "ARFOOB"...
When we continue to chop off the unwanted prefixes and note them as we
go along, we'll finally get the remainder "FOO" which is still looking
the same in index zero of the BWT. The complete partitioning sequence is
as follows:
FOOBAR2000 -> 000 FOOBAR2
FOOBAR2 -> 2 FOOBAR
FOOBAR -> AR FOOB
FOOB -> B FOO
FOO -> FOO <-- equal, we're done
FOO|B|AR|2|000
`-´ U `´ U `-´
So we end up with five cycles. Now we need to create a transformed
string containing these cycles in the given order. This works
essentially the same way as the ordinary BWT, but instead of sorting all
possible rotations of one input string, we use each cycle in each
possible rotated position as the input for the sorting function:
,-- gap indicates cycle len ,--- result column
V .L.
FOO FOOFOO... 00|0|
OFO OFOOFO... 00|0|
OOF OOFOOF... 00|0|
B BBBBBBBB... |2|
AR ARARARA... A|R|
RA RARARAR... --> sort --> |B|
2 22222222... FO|O|
000 000000... OF|O|
000 000000... OO|F|
000 000000... R|A|
`-´
In case we have to compare strings of different lengths while sorting,
we need to keep in mind that we're actually comparing infinite cyclic
character sequences. That's why we need to extend the shorter one by
repeating it until a difference occurs or both were repeated at least
once. For example if we want to compare "AXYA" with "AXY" we have to
expand "AXY" by repeating it: "AXYAXY". Now "AXYA" is again shorter
and equals "AXYAXY" in all available characters, so we have to repeat
it too: "AXYAAXYA". Now we see that "AXYAAXYA" is smaller than
"AXYAXY", as the fifth character "A" < "X". In the sorted list "AXYA"
must be positioned before "AXY".
The resulting transform is the last character of each cycle in the
sorted list, almost like the ordinary BWT. Saving some index is no
longer required.
--> 0002RBOOFA
----------------------------------------
Reverse transformation
The reverse transformation works almost like the reverse BWT. We only
need to introduce an additional column later to track which characters
were already visited so we don't enter a cycle twice.
0 2 A B F O R <-- list of different characters in BWT string
------------- (sorted alphabetically)
3 1 1 1 1 2 1 Count[]: how often each character appears in the string
0 3 4 5 6 7 9 RunningTotal[]: position where given character starts
in the sorted list
,---- Each character Count[] times in alphabetical order
| ,---- Column with transformed string
| | ,---- RunningTotal[] of character in BWT string
| | | ,---- Number of times the character from the BWT
| | | | string already occured
| | | | ,---- Sum gives line no. containing previous
| | | | | character in original string
| | | | |
Line V V V V V Reverse mapping of sum (=forward links)
0 0 0 0 + 0 = 0 0
1 0 0 0 + 1 = 1 1
2 0 0 0 + 2 = 2 2
3 2 2 3 + 0 = 3 3
4 A R 9 + 0 = 9 9
5 B B 5 + 0 = 5 5
6 F O 7 + 0 = 7 8
7 O O 7 + 1 = 8 6
8 O F 6 + 0 = 6 7
9 R A 4 + 0 = 4 4
We start decoding in line zero. Output is generated backwards, as the
string resulting from the forward transformation always starts with the
last character of the original input string. Output is taken from the
column containing the alphabetically sorted characters (2nd column in
the table above). If a cycle consists of more than one character the
first character is skipped and will be sent to output at the end of the
cycle to undo a rotation introduced in the forward transformation. Any
lines processed are subsequently marked as "visited".
Line Character list Forward link Visited
0 0 0 1 <--
1 0 1 0
2 0 2 0
3 2 3 0
4 A 9 0
... ... ... ...
Output so far: _________0
The forward link points to line 0 again, but this is already marked as
"visited". So for the next step we need to find the next line not visited
yet. This is Line 1 in our case, where we have the same situation again.
Line Character list Forward link Visited
0 0 0 1
1 0 1 1 <--
2 0 2 0
... ... ... ...
Output so far: ________00
This continues always the same way until we reach line 4:
Line Character list Forward link Visited
... ... ... ...
3 2 3 1
4 A 9 1 <--
5 B 5 0
... ... ... ...
9 R 4 0
Output so far: ______2000
The forward link now points to line 9, which is not visited yet. As this
indicates a cycle with more than one character we delay the output of
the "A" in line 4 until we jump back. The next step in line 9:
Line Character list Forward link Visited
... ... ... ...
4 A 9 1
5 B 5 0
... ... ... ...
9 R 4 1 <--
Output so far: ____AR2000
First we output the "R" on line 9 and mark the entry as visited.
The forward link now points back to line 4, which has the visited flag
already set. This indicates we're at the end of the cycle and need to
send the delayed start character "A" to the output, too. As line 4 was
already visited we continue in the next line where visited is zero,
in our case line 5:
Line Character list Forward link Visited
... ... ... ...
5 B 5 1 <--
6 F 8 0
... ... ... ...
Output so far: ___BAR2000
Line 5 links to itself again, so we continue with line 6...
Line Character list Forward link Visited
... ... ... ...
6 F 8 1 <--
7 O 6 0
8 O 7 0
Output so far: ___BAR2000
In line 6 we enter a cycle with more than one character again, so we
delay the output of the "F" again and follow the link to line 8.
There we output an "O" and follow the link to line 7. Line 7 links back
to the start of the cycle, so in addition to the "O" on that line we
have to output the delayed "F" from the beginning of the cycle.
Line Character list Forward link Visited
... ... ... ...
6 F 8 1
7 O 6 1 <--
8 O 7 1
Output so far: FOOBAR2000
As there are no more lines left we could "visit", we're done.
-----------------------------------
In my opinion the practical merits of David's approach are questionable.
While I like the general idea of having a variation of the BWT without
an extra number in the output, the additional effort required just isn't
worth it. The algorithm is twice as complicated as the plain BWT and it
might (if at all) save only about 4 bytes for each block of input
(usually some megabytes).
Compression speed is also likely to be worse. In order to determine what
prefixes have to be cut off in the splitting phase, one needs to search
the smallest string repeatedly in the remaining part of the input
buffer. I haven't made any benchmarks but judging from the structure of
the algorithm, this "preprocessing" can become quite time intensive,
especially with large block sizes and some decreasing sequence in the
beginning of the input blocks (e.g. "987654321").
Bye,
Klaus
.
- Follow-Ups:
- Re: A truely BIJECTIVE BWT is here!
- From: biject
- Re: A truely BIJECTIVE BWT is here!
- From: Phil Carmody
- Re: A truely BIJECTIVE BWT is here!
- From: Mark Nelson
- Re: A truely BIJECTIVE BWT is here!
- From: biject
- Re: A truely BIJECTIVE BWT is here!
- From: biject
- Re: A truely BIJECTIVE BWT is here!
- References:
- A truely BIJECTIVE BWT is here!
- From: biject
- Re: A truely BIJECTIVE BWT is here!
- From: Phil Carmody
- Re: A truely BIJECTIVE BWT is here!
- From: Mark Nelson
- Re: A truely BIJECTIVE BWT is here!
- From: Phil Carmody
- A truely BIJECTIVE BWT is here!
- Prev by Date: Re: A truely BIJECTIVE BWT is here!
- Next by Date: Re: A truely BIJECTIVE BWT is here!
- Previous by thread: Re: A truely BIJECTIVE BWT is here!
- Next by thread: Re: A truely BIJECTIVE BWT is here!
- Index(es):
Relevant Pages
|