any clever vectorising idea?



I want to seach long numeric vectors (taking up much of memory) for exact
subsequence matches. Matlab doesn't have a built-in solution as far as I know.
Is there a better way than my attempts to date (see below)?

Thanks
ross.

==========================

1. STRING: If the vector is a string, then STRFIND makes it easy:
idx = strfind(x, s);
On my PC, this takes 0.31 secs for length(x) = 1e8, length(s) = 20.
Furthermore, large numbers of matches do not affect the execution time much.

Problem: doesn't work directly on real values.

2. NUMERIC: LOOP MATCHING FULL SUBSEQUENCE
The most "obvious" way is a loop that matches the whole subsequence at each
offset:
idx = []; n = length(s)-1;
for i = 1:N-n
if all( x(i:i+n) == s ); idx = [idx i]; break; end
end

Problem: unbearably slow (980 secs) even when the IF clause is never executed
(ie, no matches).

3. NUMERIC: BUILT-IN SEARCH FOR 1ST ELEMENT THEN LOOP TO VALIDATE
How to get rid of the FOR loop? I can use FIND to search for s1, the first
element of s, then check for the full subsequence match wherever s1 is matched:
j = find(x == s1);
if ~isempty(j)
% use loop like #1 to check for s, beginning at each index j(i)
end

Problem: Degrades badly when the number of hits rises. Quite efficient when
s1 is rare (1.8 secs) but becomes unuseably slow when s1 is even moderately
common (eg, 1in 1000). Unfortunately, in many quantised data sets, s1 will be
that common.

4. SCALE x TO A STRING, AND USE STRFIND.
There seems to be no real alternative to doing all phases of the search in
compiled code. I couldn't find a numeric equivalent to STRFIND. Therefore I
have resorted to a routine which converts x to a string and uses STRFIND. This
is very efficient, but it does depend on the assumption that the x values are
evenly spread on their range, so that they do not all map onto a small subset
of chars on string conversion. IF this assumption is met, the loss of
resolution on conversion doesn't matter, because the number of permutations of
a substring is so high that false matches are very unlikely. For searching
pseudorandom number sequences up to the limit of memory, it works very well.

Problems:
(1) conversion to char needed -- memory constraint if x fills most of memory.
(2) loss of resolution a problem if x contains repetitive nearly identical
structures.

5. MULTIPLICATIVE METHODS: FIR filtering, xcorr, etc.

Don't work because of requirement for exact match. Xcorr of a scaled-up,
similar sequence will give a larger output than the correct sequence.

I still feel there might be better ways. Any other suggestions?

Thanks
Ross

.



Relevant Pages

  • Re: Fast string operations
    ... worried about string performance. ... > is why people use unsafe code now and then to use pointer arithmetic to loop ... > I'm aware that looping has to occur one way or the other, but with Trim, ... The customer perceives this as a memory leak. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Fast string operations
    ... > is why people use unsafe code now and then to use pointer arithmetic to ... > loop over arrays without all the unnecessary bounds checking. ... don't return the trimmed string". ... The customer perceives this as a memory leak. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: any clever vectorising idea?
    ... STRING: If the vector is a string, ... NUMERIC: LOOP MATCHING FULL SUBSEQUENCE ... SCALE x TO A STRING, AND USE STRFIND. ...
    (comp.soft-sys.matlab)
  • Re: Why the following code is faster executed when placed in the form?
    ... everything that is not changed inside the loop must not be calculated there. ... Some of the characters you use are predefined as ... It allocates memory for Chr/in your case you can add call to Chr/. ... Now imaging if you build extremely big string, ...
    (microsoft.public.vb.general.discussion)
  • Re: Fast string operations
    ... Looping over characters like that can't slow things down that much. ... iteration through the string. ... a loop is going to start from the beginning of the ... the original issue was the amount of memory that is being consumed ...
    (microsoft.public.dotnet.languages.csharp)