Re: Unicode in Regex



On Dec 5, 8:31 pm, Daniel DeLorme <dan...@xxxxxxxxx> wrote:
MonkeeSage wrote:
Here is a micro-benchmark on three common string operations (split,
index, length), using bytestrings and unicode regexp, verses native
utf-8 strings in 1.9.0 (release).

That's nice, but split and index do not operate using integer indexing
into the string, so they are rather irrelevant to the topic at hand.

Heh, if the topic at hand is only that indexing into a string is
slower with native utf-8 strings (don't disagree), then I guess it's
irrelevant. ;) Regarding the idea that you can do everything just as
efficiently with regexps that you can do with native utf-8
encoding...it seems relevant. In other words, it goes to show a
general behavior that is benefited by a native implementation (the
same reason we're using native hashes rather than building our own
implementations out of arrays of pairs).

They produce the same results in ruby1.8, i.e. uni_split==reg_split and
uni_index==reg_index.

Yes. My point was to show how a native implementation of unicode
strings effects performance compared to using regular expressions on
bytestrings. The behavior should be the same (hence the asserts).

I also stated that the point of regex manipulation is to *obviate* the
need for methods like index and length. So a more accurate benchmark
might be something like:
reg_chars N/A N/A N/A ( N/A )
uni_chars 0.130000 0.000000 0.130000 ( 0.193307)
;-)

Someone just posted a question today about how to printf("%20s ...",
a, ...) when "a" contains unicode (it screws up the alignment since
printf only counts byte width, not character width). There is no
*elegant* solution in 1.8., regexps or otherwise. There are haskish
solutions (I provided one in that thread)...but the need was still
there. Another example is GtkTextView widgets from ruby-gtk2. They
deal with utf-8 in their C backend. So all the cursor functions that
deal with characters mean utf-8 characters, not bytestrings. So
without kludges, stuff doesn't always work right.

Ps. BTW, in case there is any confusion, bytestrings aren't going
away; you can, as you see above, specify a magic encoding comment to
ensure that you have bytestrings by default.

Yes, it's still possible to access bytes but it's not possible to run a
utf8 regex on a bytestring if it contains extended characters:

$ ruby1.9 -ve '"abc" =~ /b/u'
ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux]
$ ruby1.9 -ve '"日本語" =~ /本/u'
ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux]
-e:1:in `<main>': character encodings differ (ArgumentError)

And that kinda kills my whole approach.

You can't use mixed encodings (not just in regexps, not anywhere).
You'd have to use a proposed-but-not-implemented-in-1.9.0-release,
command line switch to set your encoding to ascii (or whatever), or
else use a magic comment [1] like I did above. That or explicitly
encode both objects in the same encoding.

Daniel

Regards,
Jordan

[1] http://www.ruby-forum.com/topic/127831
.



Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: RfD: XCHAR wordset
    ... It's somewhat worse, because Windows has "A" prototypes, which convert the ... current code page into UTF-16 on the fly. ... Actually, it might be possible to change the current code page to UTF-8, but ... Windows strings are usually not C strings, ...
    (comp.lang.forth)
  • Re: Unicode Support
    ... > UTF-8 comments and strings, proves the point a different way...some ... > should have had Frank shouting "NASM already does it!!"...but, I bet, ... can never appear in a MBCS (multibyte character sequence). ...
    (alt.lang.asm)
  • Solved: What string encoding to pick as standard for a programming language?
    ... I decided for UTF-8 and started chainging the code before ... so using strings as byte vectors will never be ... part of a multibyte char happens to look like the simple char I am ... If or when I do a Linux version, which wxWidgets char width should ...
    (comp.lang.misc)