Re: Average length of words



"glen herrmannsfeldt" <gah@xxxxxxxxxxxxxxxx> wrote in message
news:h75j0l$s5r$2@xxxxxxxxxxxxxxxxxxx
Mok-Kong Shen <mok-kong.shen@xxxxxxxxxxx> wrote:
(snip)

< In such texts each word is mostly followed by a space. So, assuming
< the above value, the effective average space occupied by a word
< on the medium is 6 bytes with ASCII coding. By how much could this
< figure generally be reduced by a good text compression scheme?

The storage format used by the WYLBUR text editing system
used on some IBM systems compresses out blanks. The compressed
line length is known. The line contents consists of bytes where
four bits indicate the blanks until the next non-blank,
and four bits indicate the number of non-blank characters
until the next such descriptor byte or end of the line.

Compared to the IBM standard format, 80 byte fixed length
records padded with blanks, it is very good. This was used
starting in the late 1960s, where CPU time was somewhat more
important than today.

It also has the advantage that in many cases string searches
can be done on the compressed data.

-- glen

http://www.mang.canterbury.ac.nz/writing_guide/writing/flesch.shtml

========

http://able2know.org/topic/114565-1
Average English world length is 5.10 letters. For comparison, Korean
averages 3.05 letters and the German average is 6.26 letters.

Average English sentence length is 14.3 words. Much of this, and also word
length, depends on the subject and audience.

======

http://answers.yahoo.com/question/index?qid=20080526032554AAB28AF
What is the average length of a word in the English language?

Best Answer - Chosen by Voters
Five is a good rule-of-thumb (and the old standard for calculating how many
words one has counting toward a total for an assignment...).

A more precise calculation given at the following link is 5.1
http://blogamundo.net/lab/wordlengths/

One qualification -- this type of calculation is typically based on a chunk
of written text, and includes all those words that are used MULTIPLE times
within a text. Since these most repeated words tend to be shorter, esp.
common things like articles (a, the), pronouns (I, me, he, she, it. . ), and
conjunctions (and, but, or), the average is DECREASED by these repetitions.

If you took exactly the same text and list the different words it uses
without repetition (i.e "a" is only counted once, no matter how many times
the text uses it), the average would be significantly higher.

Also, note that this is "normal" language prose. Something written in
technical language (e.g., a scientific paper) would include many longer
words, and consequently have a higher average word length.

==========

http://blogamundo.net/lab/wordlengths/
Languages by Average word length
This table shows a listing of languages by average word length, as
calculated from the texts at the UDHR in Unicode.

Caveats:

1.. My definition of "word" consists of splitting on space. (Hence screwed
up counts for Amharic, Thai, etc, which don't use spaces.)
2.. I believe there are some incomplete texts in the UDHR collection I
used, not sure.
Rank -- Length -- Language
#122 -- 5.10 ----- English


.



Relevant Pages

  • Re: Language detection module..
    ... > it with texts of similar lengths taken from known languages and compress ... If the compression rate is similar or better than that of the ... > the compression deteriorates, the texts are dissimilar. ... "Language Trees and Zipping". ...
    (comp.lang.perl.misc)
  • Re: Representing futuristic English
    ... Otherwise the language looks mostly straightforward. ... >>> would be some regional dialects that would be easier and some that would ... > (One of the texts I read this afternoon said that at some point there ... That James hadn't offered any evidence for his claims--when the ...
    (rec.arts.sf.composition)
  • Re: Introducing the Hutter Prize for Lossless Compression of Human Knowledge
    ... compression algorithms. ... on a computer with the right algorithm. ... Which so far seems to be requirement for human language ... receive a meaning that is analogue to what a human understands, ...
    (comp.compression)
  • Re: How do extraterrestial languages work ? ? ? ?
    ... >>necessary for the periodic table to be of much use in understanding the ... >>has been of very little service in understanding the Etruscan language ... >>enough to interpret the longer texts. ... > conclude that it was a cake recipe. ...
    (sci.lang)
  • Re: engineering question for nate the "engineer"
    ... why does "cup and cone" fracture ... from the axial load and the compressive hydrostatic load. ... to be understood you have to learn the language. ... There's no compression generated from trying to maintain "constant ...
    (rec.autos.tech)