Re: A truely BIJECTIVE BWT is here!



On Dec 15, 1:46 am, Klaus Stengel <Klaus.Sten...@xxxxxxxxxx> wrote:

Since most like Klaus explanation I thought I would fix the errors or
add a
few points to clarafy or obscurafy its sort of how you look at it. I
have
spent over 3 Hours trying to get this right. I don't see how people
document
its a very hard job for me. When I worked at China Lake I was lucky
in that many people hated coming up with code. I use to do it for
many they just had to do the pretty paper work BS. How you wrote
this so fast is beyond me so again THANK YOU

I left the old BWT stuff since its the way you compare BWTS

Belive it or not when left Chain Lake I did proof reading what a joke
I see what I want in text and seldom write what I want.
English sucks


There are no useful comments in it, but at least the variable and
function names make sense, at least sometimes... Well, here's a textual
description of David's algorithm with some ASCII art for visualization
;-) It took me some time to get the details right, but I think it's
accurate. I'll start with a short illustration of the regular BWT
algorithm and then try to explain what David's code does (as I
understand it).

Forward BWT

0123456789 (character position in string, counting from 0)

FOOBAR2000 <-- orignal input string, 10 characters
0FOOBAR200 rotate 10 times by 1 char to get array of strings
00FOOBAR20
000FOOBAR2
2000FOOBAR
R2000FOOBA
AR2000FOOB
BAR2000FOO
OBAR2000FO
OOBAR2000F

.. sort ...

,------- BWT output column
.L.
000FOOBAR|2|
00FOOBAR2|0|
0FOOBAR20|0|
2000FOOBA|R|
AR2000FOO|B|
BAR2000FO|O|
FOOBAR200|0|<-- line no. 6 containing original string
OBAR2000F|O| (counting from zero)
OOBAR2000|F|
R2000FOOB|A|
`-´

Last column and index with original string is output of BWT
--> 200RBO0OFA, 6

All rotations of the input (in this example "FOOBAR2000", "0FOOBAR200",
"00FOOBAR20", ...) actually map to the same BWT'ed string. Because of
this property it's necessary to save the original index or some other
kind of "offset" in order to distinguish the possible rotations.

----------------------------------------

Reverse BWT:

0 2 A B F O R <-- list of different characters in BWT string
------------- (sorted alphabetically)
3 1 1 1 1 2 1 Count[]: how often each character appears in the string
0 3 4 5 6 7 9 RunningTotal[]: position where given character starts
in the sorted list

,---- Each character Count[] times in alphabetical order
| ,---- Column with BWT string
| | ,---- RunningTotal[] of character in BWT string
| | | ,---- Number of times the character from the BWT
| | | | string already occured
| | | | ,---- Sum gives line no. containing previous
| | | | | character in original string
| | | | |
Line V________V V V V Reverse mapping of sum (=forward links)
0 0 2 3 + 0 = 3 1
1 0 0 0 + 0 = 0 2
2 0 0 0 + 1 = 1 6
3 2 R 9 + 0 = 9 0
4 A B 5 + 0 = 5 9
5 B O 7 + 0 = 7 4
6 FOOBAR2000 0 + 2 = 2 8 <-- Start decoding here by
7 O O 7 + 1 = 8 5 following line numbers and
8 O F 6 + 0 = 6 7 noting the characters from
9 R A 4 + 0 = 4 3 the 2nd column

This results in the following sequence with the letter in the first
column noted below:

6 -> 8 -> 7 -> 5 -> 4 -> 9 -> 3 -> 0 -> 1 -> 2
F O O B A R 2 0 0 0

Note that following the links in the last column forms one cyclic
sequence, spanning all characters from the input. If we continued past
the end (line 2 in this example) we'd get the same output again and
again. This is always the case if the input came from a BWT. Applying
the reverse BWT to some random or corrupted input usually leads to some
shorter cycle that doesn't include all characters.


***** my comments start here when you talk aout BWTS ****

David's basic idea is now to modify the algorithm to create an output
containing multiple cycles by design. With multiple cycles available it
is possible to split the input in a way that decoding from index zero
for each part always yields the correct result. The information formerly
stored in the extra index value is now encoded in the size and order of
the cycles.

To make this work we have to determine where to split up the input
first. In David's program the function part_cycle() is responsible for
this. When applying the BWT, line/index zero always contains the
rotation with the lowest value. In case of "FOOBAR2000" this is
"000FOOBAR2" (see forward BWT example from the beginning). As our

actually the reason for two seperate compare functions is here.
And I think he missed it. If you use the same compare you don't
get a good BTWS and that may be way people missed this
bijective form of BWT

I compare for first partiation with this verstion of code
000|000|000|000...
2000|2000|2000|...
AR2000|AR2000|AR2000|....
BAR2000|BAR2000|....
...

original input string starts with "FOOBAR..." we need to split this into
two cycles: "FOOBAR2" and "000". Now if we apply the BWT to the
remaining "FOOBAR2" again, we notice that the index zero contains

actually we don't do a full here BWT which implies a sort we just
Search
for the lightest string for part left like this

2|2|2|2|2|2|2|...
AR2|AR2|AR2|AR2|AR2|...
FOOBARS|FOOBAR2|...


"2FOOBAR". This is still not what we want, so we create another part

again note that "2FOOBAR" is not even a string tested during the
partitioning of string

containing only "2". Now we try again with "FOOBAR" and get "ARFOOB"...
When we continue to chop off the unwanted prefixes and note them as we
go along, we'll finally get the remainder "FOO" which is still looking
the same in index zero of the BWT. The complete partitioning sequence is
as follows:

FOOBAR2000 -> 000 FOOBAR2
FOOBAR2 -> 2 FOOBAR
FOOBAR -> AR FOOB
FOOB -> B FOO
FOO -> FOO <-- equal, we're done

FOO|B|AR|2|000
`-´ U `´ U `-´


Notice the string of zeros. Actually version one of the code treats
each
Zero as seprate cycle and it actually sorts first than current version
that
matchs comments. I thought I changed code to match comments but
it seems to slow the execusion for certain files the only change in
code is to "return 0;" instead of "return I<j?-1:1;" in last line fo
bounded_compareS(). But let me stress the output of sort
does not change

the line would be FOO|B|AR|2|0|0|0

but in either case they sort to same output

So we end up with five cycles. Now we need to create a transformed
string containing these cycles in the given order. This works
essentially the same way as the ordinary BWT, but instead of sorting all
possible rotations of one input string, we use each cycle in each
possible rotated position as the input for the sorting function:

,-- gap indicates cycle len ,--- result column
V .L.
FOO FOOFOO... 00|0|
OFO OFOOFO... 00|0|
OOF OOFOOF... 00|0|
B BBBBBBBB... |2|
AR ARARARA... A|R|
RA RARARAR... --> sort --> |B|
2 22222222... FO|O|
000 000000... OF|O|
000 000000... OO|F|
000 000000... R|A|
`-´

In case we have to compare strings of different lengths while sorting,
we need to keep in mind that we're actually comparing infinite cyclic
character sequences. That's why we need to extend the shorter one by
repeating it until a difference occurs or both were repeated at least
once. For example if we want to compare "AXYA" with "AXY" we have to
expand "AXY" by repeating it: "AXYAXY". Now "AXYA" is again shorter
and equals "AXYAXY" in all available characters, so we have to repeat
it too: "AXYAAXYA". Now we see that "AXYAAXYA" is smaller than
"AXYAXY", as the fifth character "A" < "X". In the sorted list "AXYA"
must be positioned before "AXY".


For this paticular for of code the "return 0;" repeated cycles like
|0|0|0| mapped to one cycle |000|
EVERY CYCLE UNIQUE so if you write the compare function
(and I did not yet) to take advantage of this fact you will only
compare till a DIFFERENCE they can never be the same.

Now if the compare is from same cycle they can be the same as BWT
so thats exactly the same problem that regular wrap aroung BWT solves

The resulting transform is the last character of each cycle in the
sorted list, almost like the ordinary BWT. Saving some index is no
longer required.

--> 0002RBOOFA

----------------------------------------

Reverse transformation

The reverse transformation works almost like the reverse BWT. We only
need to introduce an additional column later to track which characters
were already visited so we don't enter a cycle twice.

0 2 A B F O R <-- list of different characters in BWT string
------------- (sorted alphabetically)
3 1 1 1 1 2 1 Count[]: how often each character appears in the string
0 3 4 5 6 7 9 RunningTotal[]: position where given character starts
in the sorted list

,---- Each character Count[] times in alphabetical order
| ,---- Column with transformed string
| | ,---- RunningTotal[] of character in BWT string
| | | ,---- Number of times the character from the BWT
| | | | string already occured
| | | | ,---- Sum gives line no. containing previous
| | | | | character in original string
| | | | |
Line V V V V V Reverse mapping of sum (=forward links)
0 0 0 0 + 0 = 0 0
1 0 0 0 + 1 = 1 1
2 0 0 0 + 2 = 2 2
3 2 2 3 + 0 = 3 3
4 A R 9 + 0 = 9 9
5 B B 5 + 0 = 5 5
6 F O 7 + 0 = 7 8
7 O O 7 + 1 = 8 6
8 O F 6 + 0 = 6 7
9 R A 4 + 0 = 4 4

We start decoding in line zero. Output is generated backwards, as the
string resulting from the forward transformation always starts with the
last character of the original input string. Output is taken from the
column containing the alphabetically sorted characters (2nd column in
the table above). If a cycle consists of more than one character the
first character is skipped and will be sent to output at the end of the
cycle to undo a rotation introduced in the forward transformation. Any
lines processed are subsequently marked as "visited".

Line Character list Forward link Visited
0 0 0 1 <--
1 0 1 0
2 0 2 0
3 2 3 0
4 A 9 0
... ... ... ...

Output so far: _________0

The forward link points to line 0 again, but this is already marked as
"visited". So for the next step we need to find the next line not visited
yet. This is Line 1 in our case, where we have the same situation again.

Line Character list Forward link Visited
0 0 0 1
1 0 1 1 <--
2 0 2 0
... ... ... ...

Output so far: ________00

This continues always the same way until we reach line 4:

Line Character list Forward link Visited
... ... ... ...
3 2 3 1
4 A 9 1 <--
5 B 5 0
... ... ... ...
9 R 4 0

Output so far: ______2000

The forward link now points to line 9, which is not visited yet. As this
indicates a cycle with more than one character we delay the output of
the "A" in line 4 until we jump back. The next step in line 9:

Line Character list Forward link Visited
... ... ... ...
4 A 9 1
5 B 5 0
... ... ... ...
9 R 4 1 <--

Output so far: ____AR2000

First we output the "R" on line 9 and mark the entry as visited.
The forward link now points back to line 4, which has the visited flag
already set. This indicates we're at the end of the cycle and need to
send the delayed start character "A" to the output, too. As line 4 was
already visited we continue in the next line where visited is zero,
in our case line 5:

Line Character list Forward link Visited
... ... ... ...
5 B 5 1 <--
6 F 8 0
... ... ... ...

Output so far: ___BAR2000

Line 5 links to itself again, so we continue with line 6...

Line Character list Forward link Visited
... ... ... ...
6 F 8 1 <--
7 O 6 0
8 O 7 0

Output so far: ___BAR2000

In line 6 we enter a cycle with more than one character again, so we
delay the output of the "F" again and follow the link to line 8.
There we output an "O" and follow the link to line 7. Line 7 links back
to the start of the cycle, so in addition to the "O" on that line we
have to output the delayed "F" from the beginning of the cycle.

Line Character list Forward link Visited
... ... ... ...
6 F 8 1
7 O 6 1 <--
8 O 7 1

Output so far: FOOBAR2000

As there are no more lines left we could "visit", we're done.

-----------------------------------

In my opinion the practical merits of David's approach are questionable.
While I like the general idea of having a variation of the BWT without
an extra number in the output, the additional effort required just isn't
worth it. The algorithm is twice as complicated as the plain BWT and it
might (if at all) save only about 4 bytes for each block of input
(usually some megabytes).

Compression speed is also likely to be worse. In order to determine what
prefixes have to be cut off in the splitting phase, one needs to search
the smallest string repeatedly in the remaining part of the input
buffer. I haven't made any benchmarks but judging from the structure of
the algorithm, this "preprocessing" can become quite time intensive,
especially with large block sizes and some decreasing sequence in the
beginning of the input blocks (e.g. "987654321").

Bye,
Klaus

The reverse is pretty much the same as regular BWT you do each cycle
and that codes each partation. I feel the reverse is more obvious and
I thingk he has it

So think you Klaus everybody likes your write up as I only added
a few points but I do like I hope others can see some of my added
points

I like
0002RBOOFA with its repeat of 000 and OO and no index better than
BWT
200RBO0OFA,6 with its extra "index" and only the 00 repeat great
example

Take Care

David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
.



Relevant Pages

  • Re: A truely BIJECTIVE BWT is here!
    ... "000FOOBAR2" (see forward BWT example from the beginning). ... With "B" as the only character left we'd be done. ... An entirely different way of tackling this problem is just start with some empty buffer and add characters from the input until it breaks, i.e. the cycle would no longer end up on the first line of the BWT table. ... In order to find out if we're adding a such a string we need to know how many characters we added recently match the beginning of the string, thus we need a position marker. ...
    (comp.compression)
  • Re: A truely BIJECTIVE BWT is here!
    ... If you look at the BWT ... rotation of a given input to the same output string. ... character sequences. ... Compression speed is also likely to be worse. ...
    (comp.compression)
  • Re: A truely BIJECTIVE BWT is here!
    ... If you look at the BWT ... rotation of a given input to the same output string. ... shorter cycle that doesn't include all characters. ... character sequences. ...
    (comp.compression)
  • Re: replacing substrings in strings
    ... This seems a bit more flexible (you can just use a string for the search ... I would cycle through character by character (IndexOf is ... better to just cycle through the characters once). ... >>// The search string. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: A truely BIJECTIVE BWT is here!
    ... EVERY CYCLE UNIQUE so if you write the compare function ... no partition has the first and last character the same ... 0002RBOOFA with its repeat of 000 and OO and no index better than ...
    (comp.compression)