Re: What string encoding to pick as standard for a programming language?
- From: Robbert Haarman <comp.lang.misc@xxxxxxxxxxxxx>
- Date: Thu, 31 May 2007 11:22:52 +0200
First off, my reply to the subject line, without having read anything of
the message: UTF-8. It has pretty much all the advantages: it's Unicode,
and, as such should be able to express all writing (I am not sure it
does, but it does express all writing systems I know), it's compatible
with ASCII, and encodes many common symbols pretty compactly. Also,
unlike the larger UTFs, UTF-8 doesn't suffer from endianness issues.
Now on to your points:
My programming language (PILS) - which I am currently reimplementing
- needs a string type.
I don't want different string types in the language and I don't want them to
differ across implementations, so I need to define a standard way of
representing a string.
That would definitely point towards some sort of Unicode at least.
When reimplementing in 2003, I went for UTF16 for several reasons, of
which code page independency and easy interfacing with COM and .NET
were among the more important ones.
UTF-16 has a number of issues, as you seem to have discovered as well.
Besides those you mentioned, 16 bits aren't actually enough anymore to
represent all Unicode characters. Therefore, UTF-16 now has surrogates,
meaning a single character can be represented using multiple 16-bit
words. The advantage of simple processing that UTF-16 used to offer is
gone. You could go to UTF-32 and have _really_ large characters, or you
could go to UTF-8 and have all its advantages.
This is fine when I deal with strings as such, but it complicates the
file handling somewhat. I can't have an odd number of bytes,
In my opinion, there should be separate types for strings and byte
vectors anyway (although strings could, of course, be implemented on top
of byte vectors). Each of the UTFs rules out some byte patterns as
invalid, so using strings as byte vectors will never be completely
general, unless you have a broken string implementation.
Presently, I'm doing a cross-platform reimplementation. I started out with
UTF16 but I consider switching to UTF8. The string functions would then
simply work on bytes, except Upper, Lower, and InitialUpper (a convenience
function that makes the first character of a string uppercase) which would
have to "understand" the encoding, and a few search operations would
have to be refitted to work with sequences of chars rather than single
chars, but this is a nice generalisation anyway.
Libraries that deal with UTF-8 already exist, so getting this to work
should not be a major issue.
What will happen if I implement a simple byte-oriented search and
use it for searching UTF-8 strings? Will I get false hits because a
part of a multibyte char happens to look like the simple char I am
searching for, or is UTF-8 designed to avoid this?
You will get false positives. But, as I said, just use the code that
others have already written.
With Windows, UTF-8 will be slightly slower in some cases because
strings have to be marshalled to and from UTF-16 in most system calls
- which is no big deal, the conversion routines are readily available and
AFAIK fast, and there might be a slight gain in using UTF-8 for
program source files. But how is it with other platforms? Is there a
trend towards UTF-16...
On Unix, at least as far as the Free unices go, the trend is definitely
towards UTF-8. Java I think uses it's own almost UTF-8 compatible
encoding, at least for interacting with the outside world (something in
the back of my mind whispers that the Sun JVM uses UTF-16 internally).
I'd like to hear some opinions of what you'd prefer to work with if
you can't have both. (I really don't want to kludge the language design
and syntax with dual string types or a special type to handle raw
byte sequences.)
I would (and will) go with UTF-8 for strings and a separate byte vector
type for raw bytes. The reasons I already gave you.
Regards,
Bob
--
Reality is that which, when you stop believing in it, doesn't go away.
-- Philip K. ***
.
- References:
- What string encoding to pick as standard for a programming language?
- From: Ole Nielsby
- What string encoding to pick as standard for a programming language?
- Prev by Date: What string encoding to pick as standard for a programming language?
- Next by Date: Re: What string encoding to pick as standard for a programming language?
- Previous by thread: What string encoding to pick as standard for a programming language?
- Next by thread: Re: What string encoding to pick as standard for a programming language?
- Index(es):