Re: Unicode support in Smalltalk
- From: "Paolo Bonzini" <bonzini@xxxxxxx>
- Date: 21 Jul 2006 07:45:48 -0700
2) how do dialects that have Unicode deal with the overloading of
Characters 128-255, as they mean both "the bytes 128-255 used in the
encoding of a String" and "the Unicode Characters whose code points are
128-255"?
I think this is the important point -- more in a later post.
Waiting eagerly for it. Just for the record, my current design is like
this:
- Characters represent the encoding, UnicodeCharacters represent, well,
the unicode characters
- Unicode characters 0..127 can be represented as Characters, so that
you don't waste much memory about it, and because in practice the two
encodings are almost always shared. If they're not, as for the
ISO-2022 encodings, you have bigger problems waiting for you.
- Characters 0..127 hence are equal to the corresponding
UnicodeCharacters, but Characters 128..255 are *not* equal to the
corresponding UnicodeCharacters. Note that this is ANSI compliant,
because "Character codePoint: xxx" will never return a Character in the
range 128..255.
Dually, Strings represent the encoding, and UnicodeStrings represent
the real things (stored as UTF-32). The VM was modified (just a
little) so that UnicodeString>>#at: is a primitive constructing
UnicodeCharacters from integers, but that's not necessary at all (it
just fell nicely out of other changes I was making).
Obvious weak point: if you have things in multiple encodings, you'd
better convert everything to UnicodeStrings. Maybe later I will add an
EncodedString class that holds explicitly the encoding, and a String
object. This way, equality comparisons on EncodedStrings will go
through Unicode if they are in different encodings.
Then, what my "(Character codePoint: 279) printString" example does to
print "$e" is:
- Build a UnicodeCharacter, because 279 > 127.
- When you ask it to print itself, it creates a "WriteStream on: String
new" and puts $$ there.
- It constructs a UnicodeString with itself, asking it to print itself
on the stream
- It constructs a ReadStream on the UnicodeString
- It asks an EncodedStream to convert the UnicodeString to the
destination stream
- The EncodedStream "somehow" (in practice you can use the system
conversion function iconv, or write converters in Smalltalk) writes the
two bytes of the UTF-8 encoding of the "e with a dot above".
- The contents of the Stream are retrieved, and they are three
Characters 24-C4-97. Note that this is an encoding, so no
UnicodeCharacters are there!
(I omitted this from my original post because it sounded irrelevant,
but now it's here so you can bash it...)
Rather than break everything by introducing new syntax (which I can't do anyway
since I have no access to Dolphin's parser), I just added a new unary message
#U which is understood by Characters, Integers, and Strings, to answer the
obvious Unicode objects.
Nice idea, too.
On the other hand message sends cannot be used in Array literals
Agreed. That's a problem. I'm not convinced that it's really worth solving,
though, in relation to the compatibility problems.
Well, ##(279U) is about as long as $<279> (even though I really liked
the syntax, sigh). So with a syntax like 279U, the array literal
problem can be considered solved.
5) how many application would break if Character identity (i.e. "a ==
b" is the same as "a = b") would hold only for characters 0-127?
Personally I'd extend the range where identity holds to at least 0..255,
In my scheme it does hold for Characters 128..255 (so that we are
backwards compatible: the elements of a String can always be compared
with identity), but not for any UnicodeCharacter. I wrote 0..127 in my
original post, because usually you will not see UnicodeCharacters below
128 (unless someone plays tricks with #changeClassTo: or
#instVarAt:put:).
and probably to 2**16. I don't see a need for identity to hold across the entire
Unicode range.
If you want it to hold for 2^16, it means modifying the VM to encode
Characters specially, as it does for SmallIntegers, unless you want to
"waste" a meg of memory to hold 65280 objects. This is what VW does
for example. At that point, why not having identity hold for all the 1
million-odd characters.
Thanks very much,
Paolo
.
- References:
- Unicode support in Smalltalk
- From: Paolo Bonzini
- Re: Unicode support in Smalltalk
- From: Chris Uppal
- Unicode support in Smalltalk
- Prev by Date: Re: Viewing a bitmap Oracle Blob without writing it to file
- Next by Date: Re: Unicode support in Smalltalk
- Previous by thread: Re: Unicode support in Smalltalk
- Next by thread: Re: Unicode support in Smalltalk
- Index(es):
Relevant Pages
|
Loading