Re: Fast search for all positions in a string



On Jun 4, 5:06 pm, Jorge wrote:
On Jun 4, 11:58 am, czechboy wrote:
<snip>
txt.length= 4031250 : " Lorem ipsum dolor sit amet, c..."

"lorem" : 15625 : 121 mS
"IPSUM" : 31250 : 144 mS
"veni" : 0 : 128 mS
"iMac" : 0 : 125 mS
"CoNsEqUaT" : 15625 : 130 mS
"." : 93750 : 173 mS
" " : 625000 : 648 mS
"r" : 187500 : 258 mS
"iN" : 46875 : 156 mS
" LoReM iPsUm DoLoR " : 15625 : 146 mS
Done.

And this in IE8b/WinXP :

txt.length= 4031250 : " Lorem ipsum dolor sit amet, c..."

"lorem" : 15625 : 190 mS
"IPSUM" : 31250 : 310 mS
"veni" : 0 : 140 mS
"iMac" : 0 : 150 mS
"CoNsEqUaT" : 15625 : 230 mS
"." : 93750 : 491 mS
" " : 625000 : 2804 mS
"r" : 187500 : 530 mS
"iN" : 46875 : 331 mS
" LoReM iPsUm DoLoR " : 15625 : 250 mS
Done.

It addition to running timing tests in different browsers and on
different OSs it is also necessary to know the type of processor being
used for the tests and to perform the tests on different types of
processors before any useful picture of how meaningful the are can be
established. This is because processors are optimised in different
ways and so any tests on a single processor may say more about
artefacts of that processors design than about javascript or its
implementations. For example, using the same browser and the same OS
it was observed that on a P4 a regular expression text was faster than
direct string comparison (which is unintuitive given the relative
complexity of the two operations), while on a P3 the reverse was true.

In any event, given the OPs desire for case insensitive testing the
overheads of converting a very large mixed case string into a upper or
lower case normalised string is important for performance testing case
sensitive searching by string comparison.

The otherwise often problematic characteristic of Regular expression
where when the global flag is set the - lastIndex - property of the
regular expression instance is not re-set following a single match is
likely to exist for precisely this sort of situation, so here is a
regular expression version (which is case insensitive):-

<html>
<head>
<title></title>
</head>
<body>
<pre>
<script type="text/javascript">

var data = "XXXXXXX ... XXXXXXXXX "; // Long string

var testSt = 'xxxxxx'; // The substring to match.

var testStringLength = testSt.length;// To avoid re-resolving the
// property accessor in the loop.

/* Note that the sting used to construct the regular expression
really should be passed through an escaping process that
turns all characters that are significant in regular expressions
(such as '[' or '.') into (one of) their escape sequence
equivalents.
*/
var RX = new RegExp(testSt, 'gi') // Without the global flag
// this will loop forever.


var out = []; // The array that will contain
// the start indexes of all
// matches found
while(RX.test(data)){
/* The - lastIndex - property of the regular expression instance
is the index of the first character _after_ the match, and so
needs to be reduced by the length of the match in order to
find the beginning of the march in the original string. As
this regular expression is matching a fixed sequence of
characters it is valid to assume that the length of that
sequence is also the length of any match found.
*/
out.push((RX.lastIndex - testStringLength));
}

for(var c = 0,len = out.length;c < len;++c){
document.write(
(c+':\t'+out[c]+
'\t'+data.substring(out[c], (out[c]+testStringLength))
)+'\n'
);

}
</script>
</pre>
</body>
<html>
.



Relevant Pages

  • Re: RegExp irregularity in JScript
    ... we believe the VBScript Regular Expression class (version 1.0 through ... It does not however, limit the string minimum 4, maximum 8 characters. ... Obviously the first test should test the length of the string, minimum 4, ...
    (microsoft.public.scripting.jscript)
  • Re: Usename regex
    ... Think of a string, ... Regular expression benchmark ... MS MAX AVG MIN DEV INPUT ... If the textbox in question is limited to say 16 characters you'd ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: Regular Expression taking excessive CPU
    ... > regular expression adding so much time to the process, ... > ftIndex is a string variable that typically won't exceed 100 characters. ... static string RemoveNonAlpha1 ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Extracting Strings
    ... regular expression and php function that does it. ... I want to extract the data in the following string: ... The characters in the square brackets are the characters to match ...
    (alt.php)
  • Re: Get regular expression
    ... own tree structure. ... Expression compares a string character-by character, ... regular expression solution, which was about as close as one could get to ... the structure of the hierarchy can be inferred by using ...
    (microsoft.public.dotnet.languages.csharp)