Re: String trim (was JavaScript Functions)



Dr J R Stockton wrote:
In comp.lang.javascript message <XOGdnfc7I6_uoAfUnZ2dnUVZ_g4LAAAA@gigane
ws.com>, Mon, 16 Feb 2009 23:30:43, kangax <kangax@xxxxxxxxx> posted:
Dr J R Stockton wrote:

On a 3GHz PC, XP sp3, FF3, the following takes perceptible but
insignificant time to list all non-matches to \S : it could perhaps be
done better.


Why not just use Richard's test, posted earlier in this thread? It
tests client's \s against all of the whitespace characters (including
Unicode "space separators"). Doesn't it clearly demonstrate above
mentioned oddities?

Richard's test considers only the characters that he thinks should be
treated by \s as spaces, etc. Mine, much quicker to write, found all

That list seems very logical to me. /\s/ (CharacterClassEscape :: s) is clearly defined in ES3's 15.10.2.12. WhiteSpace (7.2), which /\s/ references, clearly lists all of the character code points. It also mentions Unicode space separators. Those space separators are also clearly defined in Unicode [1] under the White_Space section.

characters that don't match \S in the current browser (it now uses
S.match(/\s/g)). The tests are logically distinct.

If there is a character, such as
cp:"6158", codePoint:"0x180E", character :"\u180E",
name:"MONGOLIAN VOWEL SEPARATOR", group:"Zs"
that NO browser recognises, that's not much of a worry for coders
(unless handling Mongolian) since testing on any browser will give the
same result.

Doesn't it make more sense to base tests on specs, rather than on some vague subset of browsers? We can't really assert that "NO browser recognizes" "MONGOLIAN VOWEL SEPARATOR"; neither can we test "all browsers", can we?


Richard seems to be taking the attitude that whitespace should be all
that, and nothing more than, the Unicode standard says (because that is
what ISO/IEC 16262 says). That attitude is appropriate for writing and
testing JavaScript engines.

Mostly, though, the question should be "What does this application need
to consider as whitespace, and can I use \s for that?". In most cases,
all that is needed is [ \r\n], often with \t added; but having \v & \f
and lacking all others does no harm.

Absolutely. It's all about a context.

[snip]

On a side note, Firefox 3.1 beta 2 now has `String.prototype.trim` (as
well as, iirc, `leftTrim` and `rightTrim`). Firefox' internal \s
representation fails to match some of the characters and also
erroneously matches some of the non-whitespace characters (as one can
see by running the above mentioned test). This leads to
`String.prototype.trim` choking on those very same troublesome
characters.

Don't fall into the trap of believing that the set of characters one
needs to trim is equal to the set that is defined as whitespace.


I'm not : ) As it stands now, FireFox's \s is simply not ES3-compliant and its deficiencies affect native `trim` (as that `trim` relies on \s)

[1] http://unicode.org/Public/UNIDATA/PropList.txt

--
kangax
.



Relevant Pages

  • Re: Japanese in browser??
    ... in Unicode coding) containing Japanese characters in the browser. ... The problem is that the characters are stored in the Access db in Unicode, ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: Unicode Support
    ... >> (I know this is a poor example, but think about other languages, eg ... First things first, when you register your RosAsm windows classes, you ... the messages with ANSI / UNICODE parameters in ANSI or UNICODE form... ... with their alphabet characters, as with the numbers and punctuation...so, ...
    (alt.lang.asm)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • =?windows-1252?Q?Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogate_Al?= =?windows-1252?Q?pha
    ... characters of an exotic eastern language using an ASCII keyboard. ... It is true to say that any keyboard of any language can be simulated ... communicate in large volume with China or Japan using CJK from Unicode ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)

Loading