Re: Unicode in Regex
- From: MonkeeSage <MonkeeSage@xxxxxxxxx>
- Date: Wed, 5 Dec 2007 17:23:47 -0800 (PST)
On Dec 5, 6:15 pm, Daniel DeLorme <dan...@xxxxxxxxx> wrote:
marc wrote:
Daniel DeLorme said...
MonkeeSage wrote:
Everything in ruby is a bytestring.YES! And that's exactyly how it should be. Who is it that spread the
flawed idea that strings are fundamentally made of characters?
Are you being ironic?
Not at all. By "fundamentally" I mean the fundamental, lowest level of
representation. If strings were fundamentally made of characters then we
wouldn't be able to access individual bytes because that's a lower level
than the fundamental level, which is by definition impossible.
If you are using UCS2 it makes sense to consider strings as arrays of
characters because that's what they are. But UTF8 strings do not follow
the characteristics of arrays at all. Each access into the "array" is
O(n) rather than O(1). So IMHO treating it as an array of characters is
a *very* leaky abstraction.
I agree that 99.9% of the time you want to deal with characters, and I
believe that in 99% of those cases you would be better served with regex
than this pretend "array" disguise.
Daniel
Here is a micro-benchmark on three common string operations (split,
index, length), using bytestrings and unicode regexp, verses native
utf-8 strings in 1.9.0 (release).
$ ruby19 -v
ruby 1.9.0 (2007-10-15 patchlevel 0) [i686-linux]
$ echo && cat bench.rb
#!/usr/bin/ruby19
# -*- coding: ascii -*-
require "benchmark"
require "test/unit/assertions"
include Test::Unit::Assertions
$KCODE = "u"
$target = "!日本語!" * 100
$unichr = "本".force_encoding('utf-8')
$regchr = /[本]/u
def uni_split
$target.split($unichr)
end
def reg_split
$target.split($regchr)
end
def uni_index
$target.index($unichr)
end
def reg_index
$target =~ $regchr
end
def uni_chars
$target.length
end
def reg_chars
$target.unpack("U*").length
# this is *alot* slower
# $target.scan(/./u).length
end
$target.force_encoding("ascii")
a = reg_split
$target.force_encoding("utf-8")
b = uni_split
assert_equal(a.length, b.length)
$target.force_encoding("ascii")
a = reg_index
$target.force_encoding("utf-8")
b = uni_index
assert_equal(a-2, b)
$target.force_encoding("ascii")
a = reg_chars
$target.force_encoding("utf-8")
b = uni_chars
assert_equal(a, b)
n = 10_000
Benchmark.bm(12) { | x |
$target.force_encoding("ascii")
x.report("reg_split") { n.times { reg_split } }
$target.force_encoding("utf-8")
x.report("uni_split") { n.times { uni_split } }
puts
$target.force_encoding("ascii")
x.report("reg_index") { n.times { reg_index } }
$target.force_encoding("utf-8")
x.report("uni_index") { n.times { uni_index } }
puts
$target.force_encoding("ascii")
x.report("reg_chars") { n.times { reg_chars } }
$target.force_encoding("utf-8")
x.report("uni_chars") { n.times { uni_chars } }
}
====
With caches initialized, an 5 prior runs, I got these numbers:
$ ruby19 bench.rb
user system total real
reg_split 2.550000 0.010000 2.560000 ( 2.799292)
uni_split 1.820000 0.020000 1.840000 ( 2.026265)
reg_index 0.040000 0.000000 0.040000 ( 0.097672)
uni_index 0.150000 0.000000 0.150000 ( 0.202700)
reg_chars 0.790000 0.010000 0.800000 ( 0.919995)
uni_chars 0.130000 0.000000 0.130000 ( 0.193307)
====
So String#=~ with a bytestring and unicode regexp is faster than
String#index by a fator or ~0.5. In the other two cases, the opposite
is true.
Ps. BTW, in case there is any confusion, bytestrings aren't going
away; you can, as you see above, specify a magic encoding comment to
ensure that you have bytestrings by default. You can also explicitly
decode from utf-8 back to ascii. and you can get a byte enumerator (or
array from calling to_a on the enumerator) from String#bytes, and an
iterator from #each_byte, irregardless of the encoding.
Regards,
Jordan
.
- Follow-Ups:
- Re: Unicode in Regex
- From: Daniel DeLorme
- Re: Unicode in Regex
- References:
- Re: Unicode in Regex
- From: MonkeeSage
- Re: Unicode in Regex
- From: Daniel DeLorme
- Re: Unicode in Regex
- From: marc
- Re: Unicode in Regex
- From: Daniel DeLorme
- Re: Unicode in Regex
- Prev by Date: Re: ruby-1.8.6-p111 build on osx 10.5.0 fails; ok on 10.4.10
- Next by Date: Re: Reasons to use a buffer in IO::read?
- Previous by thread: Re: Unicode in Regex
- Next by thread: Re: Unicode in Regex
- Index(es):
Relevant Pages
|