Re: Unicode in Regex
- From: MonkeeSage <MonkeeSage@xxxxxxxxx>
- Date: Wed, 5 Dec 2007 19:07:11 -0800 (PST)
On Dec 5, 8:31 pm, Daniel DeLorme <dan...@xxxxxxxxx> wrote:
MonkeeSage wrote:
Here is a micro-benchmark on three common string operations (split,
index, length), using bytestrings and unicode regexp, verses native
utf-8 strings in 1.9.0 (release).
That's nice, but split and index do not operate using integer indexing
into the string, so they are rather irrelevant to the topic at hand.
Heh, if the topic at hand is only that indexing into a string is
slower with native utf-8 strings (don't disagree), then I guess it's
irrelevant. ;) Regarding the idea that you can do everything just as
efficiently with regexps that you can do with native utf-8
encoding...it seems relevant. In other words, it goes to show a
general behavior that is benefited by a native implementation (the
same reason we're using native hashes rather than building our own
implementations out of arrays of pairs).
They produce the same results in ruby1.8, i.e. uni_split==reg_split and
uni_index==reg_index.
Yes. My point was to show how a native implementation of unicode
strings effects performance compared to using regular expressions on
bytestrings. The behavior should be the same (hence the asserts).
I also stated that the point of regex manipulation is to *obviate* the
need for methods like index and length. So a more accurate benchmark
might be something like:
reg_chars N/A N/A N/A ( N/A )
uni_chars 0.130000 0.000000 0.130000 ( 0.193307)
;-)
Someone just posted a question today about how to printf("%20s ...",
a, ...) when "a" contains unicode (it screws up the alignment since
printf only counts byte width, not character width). There is no
*elegant* solution in 1.8., regexps or otherwise. There are haskish
solutions (I provided one in that thread)...but the need was still
there. Another example is GtkTextView widgets from ruby-gtk2. They
deal with utf-8 in their C backend. So all the cursor functions that
deal with characters mean utf-8 characters, not bytestrings. So
without kludges, stuff doesn't always work right.
Ps. BTW, in case there is any confusion, bytestrings aren't going
away; you can, as you see above, specify a magic encoding comment to
ensure that you have bytestrings by default.
Yes, it's still possible to access bytes but it's not possible to run a
utf8 regex on a bytestring if it contains extended characters:
$ ruby1.9 -ve '"abc" =~ /b/u'
ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux]
$ ruby1.9 -ve '"日本語" =~ /本/u'
ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux]
-e:1:in `<main>': character encodings differ (ArgumentError)
And that kinda kills my whole approach.
You can't use mixed encodings (not just in regexps, not anywhere).
You'd have to use a proposed-but-not-implemented-in-1.9.0-release,
command line switch to set your encoding to ascii (or whatever), or
else use a magic comment [1] like I did above. That or explicitly
encode both objects in the same encoding.
Daniel
Regards,
Jordan
[1] http://www.ruby-forum.com/topic/127831
.
- Follow-Ups:
- Re: Unicode in Regex
- From: Daniel DeLorme
- Re: Unicode in Regex
- References:
- Re: Unicode in Regex
- From: MonkeeSage
- Re: Unicode in Regex
- From: Daniel DeLorme
- Re: Unicode in Regex
- From: marc
- Re: Unicode in Regex
- From: Daniel DeLorme
- Re: Unicode in Regex
- From: MonkeeSage
- Re: Unicode in Regex
- From: Daniel DeLorme
- Re: Unicode in Regex
- Prev by Date: [ANN] Rails/Informix 1.1.0 released
- Next by Date: Komodo debugger question
- Previous by thread: Re: Unicode in Regex
- Next by thread: Re: Unicode in Regex
- Index(es):
Relevant Pages
|