Ruby, Unicode - ever?



Well, as I could search the web so far, since about 2001 or even early,
once in a while appears question: why ruby does not support Unicode???
Why can't ruby use at least ICU libs?
(current state of UTF8 in Ruby, even with regexps, is too far away from
proper Unicode support, don't try to cheat me, that it's OK and enough,
it is not!)

And usual answer is (for years!): m17n will be in Ruby 2.0 (Rite) as
Unicode can't handle enough chars and Han unification is unacceptable.

But...

As for me, there are two big problems:
1. Ruby String class in current state is TOO MUCH OVERLOADED : it mixes
byte-array and character-text string behaviour at the same time. That is
definitely and absolutely wrong design decision. These are different
paradigms, which must not be mixed ever.

2. My impession about rite m17n is that for each string it will be
possible to set different encoding. I don't get it. As for byte array -
encoding is senseless - this is plain bit stream. And for text - how
will one compare/regexp/search using strings in different encodings???
(BTW, Unicode codepoint space is 10^21 - but do we really have over
million of *different* characters?) What is the sense to create
text-handling support code for all that multitude of encodings? (look in
oniguruma - each encoding plugin sets own procedures and char properties
to deal with multibyte encodings)

Well, I think, String class must be REMOVED from Rite.
Instead, two incompatible classes must be introduced: ByteArray and Text
with well-separated semantics and behaviour. Else it will never end but
eventually crash into crap ruins someday...


--
Posted via http://www.ruby-forum.com/.


.



Relevant Pages

  • Re: eval and unicode
    ... encoding your terminal/file/whatnot is written in. ... you have a byte string that starts with u, then ", then something ... The first item in the sequence is \u5fb9 -- a unicode code point. ...
    (comp.lang.python)
  • Re: querying using HTTP
    ... string truncation helper in a web page. ... now insufficient just to ask what O/S and ruby version they are running ... I think that LC_ALL is a very poor predictor of what encoding a specific ... Ruby doesn't trust it for source files (it uses #encoding ...
    (comp.lang.ruby)
  • Re: Why asci-only symbols?
    ... >> Perhaps string equivalence in keys will be treated like numeric equivalence? ... I know typewill be and in itself contain no encoding information now, ... >and a Unicode string, the system default encoding ...
    (comp.lang.python)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... and strings in Unicode - many modern languages allow it. ... proprietary half-baked encoding that is incompatible with every other tool ... tools for this new language whose codes will never be seen by its users. ... the effective string length is 1.0x or rare ...
    (comp.arch.embedded)
  • Re: Unicode drives me crazy...
    ... every string on some level). ... Python needs to know what encoding is used. ... The decode instruction converts s into a unicode string - where Python ...
    (comp.lang.python)