Re: Premature end of regular expression with non-ascii chara



Nick Snels wrote:
> Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
> is that I would like to work in UTF-8, but I have to read in files. And
> these files are often (almost always) in ISO-8859-1. And I haven't found
> a way of converting these strings to Unicode in Ruby. é and è etc. form
> part of ISO-8859-1.

I have to deal with similar problems when processing the infamous german
umlaute äöü. My solution has been to convert a string from latin1 or
latin15 to utf8 via this
utf8_string=latin1_string.unpack("C*").pack("U*")

and the other way round with
latin1_string=utf8_string.unpack("U*").pack("C*")

Did work so far and does not include changes in the environment.
HTH,
Lars
.



Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: RfD: XCHAR wordset
    ... It's somewhat worse, because Windows has "A" prototypes, which convert the ... current code page into UTF-16 on the fly. ... Actually, it might be possible to change the current code page to UTF-8, but ... Windows strings are usually not C strings, ...
    (comp.lang.forth)
  • Re: Unicode in Regex
    ... index, length), using bytestrings and unicode regexp, verses native ... utf-8 strings in 1.9.0. ... *elegant* solution in 1.8., regexps or otherwise. ...
    (comp.lang.ruby)
  • Re: Unicode Support
    ... > UTF-8 comments and strings, proves the point a different way...some ... > should have had Frank shouting "NASM already does it!!"...but, I bet, ... can never appear in a MBCS (multibyte character sequence). ...
    (alt.lang.asm)
  • Solved: What string encoding to pick as standard for a programming language?
    ... I decided for UTF-8 and started chainging the code before ... so using strings as byte vectors will never be ... part of a multibyte char happens to look like the simple char I am ... If or when I do a Linux version, which wxWidgets char width should ...
    (comp.lang.misc)