Re: Unicode support in Smalltalk
- From: Martin Kobetic <mkobetic@cincom-com>
- Date: Mon, 24 Jul 2006 10:36:21 -0400
Paolo Bonzini wrote:
>(with the exception of a few specific ByteString subclasses that were
introduced to reduce the amount of translation necessary to deal with
text coming from files on specific platforms).
And for these I guess you have special implementations of #at:/#at:put:
that do the mapping between the value stored and the Character
returned.
Yup.
It makes a lot of sense. I assume that you can, more or less on the
fly, tell a Stream to start decoding ISO-8859-4 and return Strings
encoded correspondingly (i.e. 192 will read as Unicode 256)?
Yes, you can.
I'm missing a detail, however: for UTF-8, you can implement most string
algorithms the same as you'd do with a 256-character character set or
with a single byte encoding. To some extent, that holds even for
complicated tasks such as regular expression matching. That is the
main reason why UTF-8 is not as bad as the old double-byte encodings
for the Far East. And that's also why I wanted to keep the input of a
text Stream to be still a String (or more generically a CharacterArray)
rather than a ByteArray. In VW, should everything that is read from a
Stream have a known encoding, and go through the decoding process, in
order to use say #match: on it? (Assuming you don't want to cheat and
specify ISO-8859-1, for which encoding 128-255 matches codepoint
128-255).
I'm not sure I understand. An EncodedStream in VW doesn't support any special matching operations. Even things like upToAll: et al. are eventually subject to some variant of #next, i.e. decoding into Characters and then matching those.
On the image side input to an EncodedStream *is* a Character or String. It does have an additional, "binary" pass-through mode which switches the API to work with a byte or ByteArray but that's just an additional feature. EncodingStream is a stream wrapper though and its underlying stream (internal or external) is expected to be binary.
As far as knowing the encoding goes, it is generally up to the application or surrounding frameworks (HTTP headers, etc) to specify which encoding should be applied. There are some platform defaults, but those are irrelevant in many use cases.
Does that answer your question ?
Martin
.
- References:
- Unicode support in Smalltalk
- From: Paolo Bonzini
- Re: Unicode support in Smalltalk
- From: Martin Kobetic
- Re: Unicode support in Smalltalk
- From: Paolo Bonzini
- Unicode support in Smalltalk
- Prev by Date: Re: New Smalltalk Blog
- Next by Date: Re: New Smalltalk Blog
- Previous by thread: Re: Unicode support in Smalltalk
- Next by thread: Planet Smalltalk machine dead?
- Index(es):
Relevant Pages
|