Re: Representing futuristic English



whheydt@xxxxxxxxxxx (Wilson Heydt) writes:

> In article <MPG.1d7b9fe0c149560d98a4b9@xxxxxxxxxxxxxx>,
> Gerry Quinn <gerryq@xxxxxxxxxxxxxxxxxxx> wrote:
>>In article <ILwMn9.KH4@xxxxxxxxxxx>, whheydt@xxxxxxxxxxx says...
>>> In article <dso1h19dl6km5p3hlrgjdjoain89hgvkvc@xxxxxxx>,
>>> James A. Donald <jamesd@xxxxxxxxxxx> wrote:
>>> >Wilson Heydt
>>
>>> >> If and only if the concept of 8-bit (or less)
>>> >> character encoding is still generally extant.
>>> >
>>> >A glance at the binary image would cause the idea of 8
>>> >bit character encoding to at once occur to him
>>>
>>> If I showed you the raw dump from a file used by an IBM 1620,
>>> how would you decode it? It will consist of string of decimal
>>> digits.
>>>
>>> With the substituion of hex for decimal, this is the exact
>>> problem you are claiming is so easy.
>>
>>If each pair of digits consistently represents a letter of the alphabet
>>in large parts of the document (analogous to the 8-bit case), it will
>>be easy to see that the data structure involves pairs of digits. A
>>frequency analysis will then allow discovery of the letters.
>
> Assuming the string of digits you have is text and not, say, a compiler.
> Until you have a reasonably clear idea just what you're looking at,
> such an analysis is going to be difficult.
>

Hal, I think if you go hang out with the sci.crypt crowd for a couple
of months, you'll be amazed at the ease of determining the encoding of
electronic records, even when they've been delibertly obscured.

The redundancy that is inherent in English text comes shing through
like the morning sun given even an almost casual frequency analysis.

Stop assuming the archeologist is a dumb-ass, and instead assume
they're just as smart as you . :-) It's a 1000 years from now, so
assume they're as much more prepared to perform this analysis than you
are compared to , oh, say Newton.

Here's an alphabet, here's some electronic text in that alphabet,
encoding unknown. Here's a few thousand words of the language.

It won't be very many hours before you've automated a stat analysis
program that figures out how many bits per character they used, will
it ?

Hmmm, 26 letters, caps, numbers, ... and it's digital, and they were
human.

It's a good bet that they used as few bits as they could ... couldn't
have been less than 5 bits or so, and why would they use more than 16 ?

Run a few hundred megabytes of the input , and see which bit encoding
reproduces the expected frequency of letter occurance.

If this sort of thing wasn't dead-easy, our friends in the
three-letter-agencies wouldn't spend the massive amounts of time/money
they do creating and vetting encryption systems.

--
#include <disclaimer.std> /* I don't speak for IBM ... */
/* Heck, I don't even speak for myself */
/* Don't believe me ? Ask my wife :-) */
Richard D. Latham lathamr@xxxxxxxxxx
.



Relevant Pages

  • Re: Decimal carry-save adder using reversed biquinary notation
    ... encoding standard, I'd assumed that you might work ... I added information about the 3 digits in 10 ... bits encoding that IBM has proposed as a standard to the page. ... but *no* carry chain is better than a short carry chain. ...
    (comp.arch.arithmetic)
  • Re: IEEE Decimal Float on Itanium
    ... As it turns out, studying the IBM papers on this, they use an encoding ... techniques which gives them two additional digits of accuracy to 33, ...
    (comp.os.vms)
  • Re: [PHP] Regular expressions
    ... which gives you the full alphabet as unique digits. ... I agree that, of the many ways to encode this, using single ASCII digits would be one of the poorest. ... One rather obvious cypher would be the alphabet itself -- a set of one-character symbols of precisely the same number as the characters you want to encode: ...
    (php.general)
  • Re: coderwiki.com is starting and needs you!
    ... Whether a particular combination of letters is intended to be a ... number rather than a word is determined by context. ... I said that I don't know of any alphabet that uses digits *in* words. ...
    (microsoft.public.vc.stl)
  • Re: ID vs. SETI was: Re: Learning about Scientific Theory of
    ... encoded the first ten digits of pi into our DNA, ... Or a sequence of prime numbers. ... Given that the "encoding" is unknown, it is HIGHLY likely that someone ... actually does enumerate the first 10 digits of Pi. ...
    (talk.origins)

Loading