Re: Unicode support in Smalltalk



Paolo Bonzini wrote:
(with the exception of a few specific ByteString subclasses that were
introduced to reduce the amount of translation necessary to deal with
text coming from files on specific platforms).
>
And for these I guess you have special implementations of #at:/#at:put:
that do the mapping between the value stored and the Character
returned.

Yup.


It makes a lot of sense. I assume that you can, more or less on the
fly, tell a Stream to start decoding ISO-8859-4 and return Strings
encoded correspondingly (i.e. 192 will read as Unicode 256)?

Yes, you can.

I'm missing a detail, however: for UTF-8, you can implement most string
algorithms the same as you'd do with a 256-character character set or
with a single byte encoding. To some extent, that holds even for
complicated tasks such as regular expression matching. That is the
main reason why UTF-8 is not as bad as the old double-byte encodings
for the Far East. And that's also why I wanted to keep the input of a
text Stream to be still a String (or more generically a CharacterArray)
rather than a ByteArray. In VW, should everything that is read from a
Stream have a known encoding, and go through the decoding process, in
order to use say #match: on it? (Assuming you don't want to cheat and
specify ISO-8859-1, for which encoding 128-255 matches codepoint
128-255).

I'm not sure I understand. An EncodedStream in VW doesn't support any special matching operations. Even things like upToAll: et al. are eventually subject to some variant of #next, i.e. decoding into Characters and then matching those.

On the image side input to an EncodedStream *is* a Character or String. It does have an additional, "binary" pass-through mode which switches the API to work with a byte or ByteArray but that's just an additional feature. EncodingStream is a stream wrapper though and its underlying stream (internal or external) is expected to be binary.

As far as knowing the encoding goes, it is generally up to the application or surrounding frameworks (HTTP headers, etc) to specify which encoding should be applied. There are some platform defaults, but those are irrelevant in many use cases.

Does that answer your question ?

Martin
.



Relevant Pages

  • Re: Stream and Encoding Confusion
    ... We are each writing programs to read an input file and count the number of ... a simple list that says we the program found so many of each character; ... treated as a character stream or a byte stream. ... I'm also somewhat concerned about encoding. ...
    (comp.lang.java.programmer)
  • Re: Problem with encoding a character
    ... pound symbol is looking like 2 bytes instead of 1. ... I thought the pound sign was a unicode character, but when I tried to change ... encoding, so the receiving newsreader has to assume something...my ... I suspect that for whatever reason, your request stream is not getting the ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: platforms default charset ?
    ... > The only way the version which uses the platform's default encoding ... > particular character in a platform-specific byte sequence. ... all the platforms have UTF-8 character set by default? ...
    (comp.lang.java.programmer)
  • Re: 0D after 0A in hex when writing a binary file
    ... But note that C++ does not mandate a particular character ... 'something else' besides 'binary mode', ... translations, *for some platforms*, not all. ... This is not a stream object, ...
    (comp.lang.cpp)
  • Re: C# and encodings
    ... and they can be encoded into a binary stream using an encoding that either supports the full Unicode character set or an encoding that supports the subset that a codepage represents. ...
    (microsoft.public.dotnet.languages.csharp)