Re: unicode in ruby



On 3/11/06, Austin Ziegler <halostatue@xxxxxxxxx> wrote:
On 3/10/06, Michal Suchanek <hramrach@xxxxxxxxxx> wrote:
On 3/10/06, Austin Ziegler <halostatue@xxxxxxxxx> wrote:
On 3/8/06, Richard Gyger <richard@xxxxxxxxxxxxx> wrote:
so, you guys are telling me a language developed since the year 2000
doesn't support unicode strings natively? in my opinion, that's a
pretty glaring problem.
Please note that Ruby itself is ten years old. Unicode has only
*recently* (the last three or four years, with the release of Windows
XP) become a major factor, especially in Japan. Unix support for
Unicode is still in the stone ages because of the nonsense that POSIX
put on Unix ages ago. (When Unix filesystems can write UTF-16 as
their native filename format, then we're going to be much better.
That will, however, break some assumptions by really stupid
programs.)
Why the hell utf-16? It is no longer compatible with ascii, yet 16
bits are far from sufficient to cover current unicode. So you still
get multiword characters. It is not even dword aligned for fast
processing by current cpus. I would like utf-8 for compatibility, and
utf-32 for easy string processing. But I do not see much use for
utf-16.

UTF-16 is actually pretty performant and the implementation of wchar_t
on MacOS X and Windows is (you guessed it!) UTF-16. The filesystems for
both of these operating systems (which have *far* superior Unicode
support than anything else) both use UTF-16 as the native filename
encoding (this is true for HFS+, NTFS4, and NTFS5). The only difference
between what MacOS X does and Windows does for this is that Apple chose
to use decomposed characters instead of composed characters (e.g.,
LOWERCASE E + COMBINING ACUTE ACCENT instead of LOWERCASE E ACUTE
ACCENT).

Look at the performance numbers for ICU4C: it's pretty damn good. UTF-32
isn't exactly space conservative (since with UTF-16 *most* of the BMP
can be represented with a single wchar_t, and only a few need surrogates
taking up exactly *two* wchar_ts, whereas *all* characters would take up
four uint32_t under UTF-32). ICU4C uses UTF-16 internally. Exclusively.

I do not care what Windows, OS X, or ICU uses. I care what I want to
use. Even if most characters are encoded with single word you have to
cope with multiword characters. That means that a character is not a
simple type. You cannot have character arrays. And no library can
completely wrap this inconsistency and isolate you from dealing with
it.

Even if the library is performant with multiword characters it is
complex. That means more prone to errors. Both in itself and in the
software that interfaces it.

You say that utf-16 is more space-conserving for languages like
Japanese. Nice. But I do not care. I guess text consumes very small
portion of memory on my system. Both ram and hardrive. I do not care
if that doubles or quadruples. In the very few cases when I want to
save space (ie when sending email attachments) I can use gzip. It can
even compress repetitive text which no encoding can.



On 3/10/06, Anthony DeRobertis <aderobertis@xxxxxxxxxxx> wrote:
Austin Ziegler wrote:
Unix support for Unicode is still in the stone ages because of the
nonsense that POSIX put on Unix ages ago. (When Unix filesystems can
write UTF-16 as their native filename format, then we're going to be
much better. That will, however, break some assumptions by really
stupid programs.)
Ummm, no. UTF-16 filenames would break *every* correctly-implemented
UNIX program: UTF-16 allows the octect 0x00, which has always been the
end-of-string marker.

You're right. And I'm saying that I don't care. People need to stop
thinking in terms of bytes (octets) and start thinking in terms of
characters. I'll say it flat out here: the POSIX filesystem definition
is going to badly limit what can be done with Unix systems. One could do
what I *think* that Apple has done and provided two filesystem
interfaces that are synchronized. The native interface -- and the more
efficient one -- will be using UTF-16 because that's what HFS+ speaks.
The secondary interface (that also works on UFS filesystems) would
translate to UTF-8 and/or follow the nonsensical POSIX rules for native
encodings.

Personally, my file names have been in UTF-8 for quite some time now,
and it works well: What exactly is this 'stone age' you refer to?

Change and environment variable and watch your programs break that had
worked so well with Unicode. *That* is the stone age that I refer to.
I'm also guessing that you don't do much with long Japanese filenames or
deep paths that involve *anything* except US-ASCII (a subset of UTF-8).

Hmm, so you call the possibility to choose your encoding living in
stone age. I would call it living in reality. There are various
encodings out there.


UTF-8 can take multiple octets to represent a character. So can UTF-16,
UTF-32, and every other variation of Unicode.

This last statement is true only because you use the term "octet." It's
a useless term here, because UTF-8 only has any level of efficiency for
US-ASCII. Even if you step to European content, UTF-8 is no longer
perfectly efficient, and when you step to Asian content, UTF-8 is so
bloody inefficient that most folks who have to deal with it would rather
work in a native encoding (EUC-JP or SJIS, anyone?) which is 1..2 bytes
or do everything in UTF-16.

No, I suspect the reason for using EUC-JP, SJIS, or ISO-8859-*, and
other weird encodings is historical.
What do you mean by efficiency? If you want space efficiency use
compression. If you want speed, use utf-32 or similar encoding that
does not have to deal with special cases.


Depending on content, a string in UTF-8 can consume more octects than
the same string in UTF-16, or vice versa.

Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don't
get to have the fun of picking between big- and little-endian!

Are people always this stupid when it comes to things that they clearly
don't understand? Yes, UTF-16 may have the problem of not knowing if
you're dealing with UTF-16BE or UTF-16LE, but it's my understanding that
this is *only* an issue when you're dealing with both on the same
system. Additionally, most platforms specify a default. It's been a
while (almost a year), but I think that ICU4C defaults to UTF-16BE
internally, not just UTF-16.

iirc there are even byte-order marks. If you insert one in every
string you can get them identified at any time without doubt :)

But do not trust me on that. I do not know anything about unicode, and
I want to sidestep the issue by using an encoding that is easy to work
with, even for ignorants :P


Thanks

Michal


Relevant Pages

  • Re: Help me!! Why java is so popular
    ... Well, Unicode is not a storage encoding system, or anything like that. ... Unicode is primarily a mapping from characters (in the linguistic conceptual ... French, Russian, Japanese and Korean songs. ...
    (comp.lang.java.programmer)
  • Re: DB2 UTF-8 ODBC double conversion
    ... Unicode considers the various UTFs flavors completely equivalent. ... Just various encoding forms for the same thing. ... they can't use your database to represent as many characters as ... are required in order to support the GB-18030 Chinese National standard. ...
    (microsoft.public.vc.mfc)
  • Re: utf8 and ftplib
    ... It opens a new local file using utf8 encoding and then reads from a file ... characters from the source file (e.g. foreign characters, ... Is there any way that I can correctly retrieve a utf8 encoded file from an FTP server? ... to be decoded to unicode on being read later. ...
    (comp.lang.python)
  • Re: Unicode version of main in Unix
    ... even with UTF-16 you _still_ need to implement the ... you still have to deal with combining characters ... least with regards to Unicode. ... represented by a 32-bit data type), and can also be composed by combining ...
    (comp.unix.programmer)
  • Re: TCHAR string?
    ... According to Microsoft's documentation the 'A' functions are "ANSI" ... although Unicode is not itself an ISO standard; ... just as much an ISO encoding as any of the ISO encodings ... Windows) *was* to be able to represent any of the characters of the ...
    (microsoft.public.vc.mfc)