Re: SMS compression ?



On Sep 7, 2:09 am, CryptoManceR <cryptoman...@xxxxxxxxx> wrote:
Hello all !

I live in Russia and here SMS are only 70 chars short for Russian
language, so i'm thinking about sending them compressed (translit
applets are plentiful here, so why not go further ?).

Wait, 70-character limit you mean?

I'm comlete rookie in data compression, so please give me advice and
criticism on methods i'm thinking on.

1. Alphabet reduction + Huffman or arithmetic.

Preprocessing:
Reduce alphabet to {SHIFT}{CAPS}{SPACE}{33 RU letters}{.,?!}
Big letters, latin letters, digits and special chars are accessed by
SHIFT and CAPS followed with selector codes corresponding to most
probable letters. Latin letters are "written under" Russian not by
sound or writing similarity, but by probability order.

That works.

Statistic collection:
Non-adaptive, made once at big computer. 1024-cell table including all
symbols and most probable bigrams and trigrams. For RU only, Latin
letters are stowaway's.

Ok.

Compression:
Either Huffman or arithmetic (not yet settled with opinion) based on
pre-collected data.

Arithmetic works better than huffman but takes longer to encode
because it's based on probability.

2. Bitwise compression with adjusted codepage.

Compress first all 8-th bits, then 7-th and so on. To make the stream
more uniform, adjust the codepage:

This is where you lose me.

8-th: 0-Russian or 1-Latin, space, digits and signs are repeated since
we need no pseudo-gfx
7-th: 0 - letter or space, 1 - non-letter, digit or sign
6-th: 0 - small letter or digit/punctuation, 1 - big letter or rare
sign
5-th: 0 - 16 most probable letters, 1 - 16 least probable
4-1th: same as 5 in smaller scale of subdivisions

So we have stream of mostly zeroes, with great series of them in the
beginning.
It may be compressed with adaptive, statistically "precharged"
arithmetic coder with 5 "symbols": 00, 01, 10, 11, END

Ooh I see now.

What would you say on this methods ? What is preferrable ? Are where
any hidden flaws or possibility for improvement ?

No way, you spent an unnecessary amount of time and planning to
perfect this possibility, i think this is the most economic you could
get. Whats the point, though? What are you sending over SMS that you
need so much bandwidth for? I don't know my cell's limit but I've sent
msgs over 70 chars long a couple times I think but my average is maybe
10-20.

And

Yeah, i like (-d?) straddled checkerboards, one-time pads and bit
twiddling.
But what about some real advice ?

Sorry but I'm a "rookie" as well. I thought I'd leave others to give
you better advice, I just liked the name Cryptomancer so thought I'd
give my 2 cents.

.



Relevant Pages

  • Re: SMS compression ?
    ... I live in Russia and here SMS are only 70 chars short for Russian ... Big letters, latin letters, digits and special chars are accessed by ... Bitwise compression with adjusted codepage. ...
    (comp.compression)
  • SMS compression ?
    ... I live in Russia and here SMS are only 70 chars short for Russian ... Big letters, latin letters, digits and special chars are accessed by ... Bitwise compression with adjusted codepage. ...
    (comp.compression)
  • Re: Google Translator treatment of Bulgarian and Macedonian
    ... letters are particularly good for calligraphy. ... can't read a Russian book for fun - it is already work. ... readability between Latin and Cyrillic script. ... The Russian alphabet is more self-similar than the English ...
    (sci.lang)
  • Re: Google Translator treatment of Bulgarian and Macedonian
    ... letters are particularly good for calligraphy. ... can't read a Russian book for fun - it is already work. ... The Russian alphabet is more self-similar than the English ... The Serbian alphabet is more self-similar than the Croatian ...
    (sci.lang)
  • Re: Google Translator treatment of Bulgarian and Macedonian
    ... letters are particularly good for calligraphy. ... can't read a Russian book for fun - it is already work. ... readability between Latin and Cyrillic script. ...
    (sci.lang)