Re: Unicode in Regex



On Dec 5, 6:15 pm, Daniel DeLorme <dan...@xxxxxxxxx> wrote:
marc wrote:
Daniel DeLorme said...
MonkeeSage wrote:
Everything in ruby is a bytestring.
YES! And that's exactyly how it should be. Who is it that spread the
flawed idea that strings are fundamentally made of characters?

Are you being ironic?

Not at all. By "fundamentally" I mean the fundamental, lowest level of
representation. If strings were fundamentally made of characters then we
wouldn't be able to access individual bytes because that's a lower level
than the fundamental level, which is by definition impossible.

If you are using UCS2 it makes sense to consider strings as arrays of
characters because that's what they are. But UTF8 strings do not follow
the characteristics of arrays at all. Each access into the "array" is
O(n) rather than O(1). So IMHO treating it as an array of characters is
a *very* leaky abstraction.

I agree that 99.9% of the time you want to deal with characters, and I
believe that in 99% of those cases you would be better served with regex
than this pretend "array" disguise.

Daniel

Here is a micro-benchmark on three common string operations (split,
index, length), using bytestrings and unicode regexp, verses native
utf-8 strings in 1.9.0 (release).


$ ruby19 -v
ruby 1.9.0 (2007-10-15 patchlevel 0) [i686-linux]

$ echo && cat bench.rb
#!/usr/bin/ruby19
# -*- coding: ascii -*-

require "benchmark"
require "test/unit/assertions"
include Test::Unit::Assertions

$KCODE = "u"

$target = "!日本語!" * 100
$unichr = "本".force_encoding('utf-8')
$regchr = /[本]/u

def uni_split
$target.split($unichr)
end
def reg_split
$target.split($regchr)
end

def uni_index
$target.index($unichr)
end
def reg_index
$target =~ $regchr
end

def uni_chars
$target.length
end
def reg_chars
$target.unpack("U*").length
# this is *alot* slower
# $target.scan(/./u).length
end

$target.force_encoding("ascii")
a = reg_split
$target.force_encoding("utf-8")
b = uni_split
assert_equal(a.length, b.length)

$target.force_encoding("ascii")
a = reg_index
$target.force_encoding("utf-8")
b = uni_index
assert_equal(a-2, b)

$target.force_encoding("ascii")
a = reg_chars
$target.force_encoding("utf-8")
b = uni_chars
assert_equal(a, b)

n = 10_000
Benchmark.bm(12) { | x |
$target.force_encoding("ascii")
x.report("reg_split") { n.times { reg_split } }
$target.force_encoding("utf-8")
x.report("uni_split") { n.times { uni_split } }
puts
$target.force_encoding("ascii")
x.report("reg_index") { n.times { reg_index } }
$target.force_encoding("utf-8")
x.report("uni_index") { n.times { uni_index } }
puts
$target.force_encoding("ascii")
x.report("reg_chars") { n.times { reg_chars } }
$target.force_encoding("utf-8")
x.report("uni_chars") { n.times { uni_chars } }
}

====

With caches initialized, an 5 prior runs, I got these numbers:

$ ruby19 bench.rb
user system total real
reg_split 2.550000 0.010000 2.560000 ( 2.799292)
uni_split 1.820000 0.020000 1.840000 ( 2.026265)

reg_index 0.040000 0.000000 0.040000 ( 0.097672)
uni_index 0.150000 0.000000 0.150000 ( 0.202700)

reg_chars 0.790000 0.010000 0.800000 ( 0.919995)
uni_chars 0.130000 0.000000 0.130000 ( 0.193307)

====

So String#=~ with a bytestring and unicode regexp is faster than
String#index by a fator or ~0.5. In the other two cases, the opposite
is true.

Ps. BTW, in case there is any confusion, bytestrings aren't going
away; you can, as you see above, specify a magic encoding comment to
ensure that you have bytestrings by default. You can also explicitly
decode from utf-8 back to ascii. and you can get a byte enumerator (or
array from calling to_a on the enumerator) from String#bytes, and an
iterator from #each_byte, irregardless of the encoding.

Regards,
Jordan
.



Relevant Pages

  • Re: Need help which way is fastest to pick out a segment from a string
    ... There may just be a little to gain by pulling the data into a Byte array and dealing with it as raw byte data, but for this specific task I very much doubt that it will be much faster. ... I have just created an array of 200,000 strings of random characters each with a length varying randomly between 36 and 72 characters and with the first space character in each string varying randomly in the range 12th to 32nd character position. ... If I then run a code loop on that array of 200,000 strings extracting the substring you have requested the code deals with the entire array, returning all 200,000 substrings, in just one tenth of a second. ...
    (microsoft.public.vb.general.discussion)
  • Re: How is strlen implemented?
    ... >>characters, ... not part of the string (but remain part of the array). ... I am saying that these arrays do not contain strings ... Reading email is like searching for food in the garbage, ...
    (comp.lang.c)
  • Re: Structures with variable length array known at compile time
    ... the array "menu_items" will always have 20 character strings but the ... You have the array the wrong what round -- what you wrote is an array ... of 20 arrays of q characters each. ... One way is to have a pointer to an array. ...
    (comp.lang.c)
  • Re: Extract one character
    ... would give the right most 99 characters: ... (96 since def is just 3 characters) ... "Dave Peterson" wrote: ... Column C in one worksheet has text strings, from Row 2 thru Row 676. ...
    (microsoft.public.excel.worksheet.functions)
  • Re: Transpose Function not Working with Long Array Elements
    ... I suspect it has to do with the origins of excel. ... text was limited to 255 characters. ... added a storage mechanism to store larger strings - up to 32K. ... If the array elements are ...
    (microsoft.public.excel.programming)