Re: double byte string numbers to_int??
- From: Christophe Grandsire <christophe.grandsire@xxxxxxx>
- Date: Sat, 5 Nov 2005 00:38:05 +0900
Selon Horacio Sanson <hsanson@xxxxxxxxxxxxxxx>:
>
> I made some testing and so far no luck getting encoded strings to convert to
> Numeric values.
>
> s = "ï¼?ï¼?" => "\357\274\221\357\274\227"
> puts s => ï¼?ï¼?
> s.to_i =>0
>
> I also tried converting the string with Iconv with no results (Illegal
> Sequence errors).
>
That's normal. Those characters are just Unicode characters, without any more
meaning (as far as to_i is concerned) than any other Japanese kanji or whatever
sign you might find in Unicode.
>
> Playing a little more I got this little method to convert the utf8 encoded
> string to Fixnum
>
> class String
> def w_to_i
> digits = self.size/3
> res = ""
> 0.upto(digits-1) { |d|
> res = res + (self[(d*3)+2] - 144).to_s
> }
> res.to_i
> end
> end
>
> # Example usage
> s = "ï¼?" => "\357\274\220"
> s.w_to_i => 0
>
> s = "ï¼?" => "\357\274\221"
> s.w_to_i => 1
>
> s = "ï¼?ï¼?" => "\357\274\225\357\274\221"
> s.w_to_i => 51
> s.w_to_i.class => Fixnum
>
>
> This little hack works so far but only for my specific application. Any tips
> on making this better are appreciated. Also if there exist any easier way
> (and I believe there must be) I will appreciate any directions.
>
I don't believe there is. The problem here is probably not to solve even if we
had a perfectly Unicode-aware language. The big problem is that besides the
ASCII digits, Unicode also has digits for plenty of other languages, which may
not even use the positional system our digits use. At what point should to_i be
aware of those digits? If we decide that to_i should be aware of both ASCII
digits and fullwidth ASCII digits, shouldn't it also be aware of Indic digits
(used for instance in Arabic, in the same positional system as ours)? What
about Devanagari digits (for Hindi), Tibetan digits, Mongolian digits, Thai
digits? While we're there, what about the Japanese kanji used as digits? What
about Roman numerals? Where should to_i stop being aware of the numeric nature
of the characters it receives? What happens when Unicode gets updated? And more
important: what do we do with alternative encodings? I'm not only talking about
other Unicode encodings besides UTF-8, but also the non-Unicode encodings,
especially those used for Asian languages.
This problem doesn't have a general solution I'm afraid. One just can't account
for all the different cases...
--
Christophe Grandsire.
http://rainbow.conlang.free.fr
It takes a straight mind to create a twisted conlang.
.
- References:
- Re: double byte string numbers to_int??
- From: Horacio Sanson
- Re: double byte string numbers to_int??
- Prev by Date: postal code/zip code distance
- Next by Date: Re: Lack of SystemStackError
- Previous by thread: Re: double byte string numbers to_int??
- Next by thread: RUBYLIB on Windows
- Index(es):
Relevant Pages
|
Loading