any clever vectorising idea?
- From: <ross@xxxxxxxxxx>
- Date: 5 Jul 2006 16:24:09 +0800
I want to seach long numeric vectors (taking up much of memory) for exact
subsequence matches. Matlab doesn't have a built-in solution as far as I know.
Is there a better way than my attempts to date (see below)?
Thanks
ross.
==========================
1. STRING: If the vector is a string, then STRFIND makes it easy:
idx = strfind(x, s);
On my PC, this takes 0.31 secs for length(x) = 1e8, length(s) = 20.
Furthermore, large numbers of matches do not affect the execution time much.
Problem: doesn't work directly on real values.
2. NUMERIC: LOOP MATCHING FULL SUBSEQUENCE
The most "obvious" way is a loop that matches the whole subsequence at each
offset:
idx = []; n = length(s)-1;
for i = 1:N-n
if all( x(i:i+n) == s ); idx = [idx i]; break; end
end
Problem: unbearably slow (980 secs) even when the IF clause is never executed
(ie, no matches).
3. NUMERIC: BUILT-IN SEARCH FOR 1ST ELEMENT THEN LOOP TO VALIDATE
How to get rid of the FOR loop? I can use FIND to search for s1, the first
element of s, then check for the full subsequence match wherever s1 is matched:
j = find(x == s1);
if ~isempty(j)
% use loop like #1 to check for s, beginning at each index j(i)
end
Problem: Degrades badly when the number of hits rises. Quite efficient when
s1 is rare (1.8 secs) but becomes unuseably slow when s1 is even moderately
common (eg, 1in 1000). Unfortunately, in many quantised data sets, s1 will be
that common.
4. SCALE x TO A STRING, AND USE STRFIND.
There seems to be no real alternative to doing all phases of the search in
compiled code. I couldn't find a numeric equivalent to STRFIND. Therefore I
have resorted to a routine which converts x to a string and uses STRFIND. This
is very efficient, but it does depend on the assumption that the x values are
evenly spread on their range, so that they do not all map onto a small subset
of chars on string conversion. IF this assumption is met, the loss of
resolution on conversion doesn't matter, because the number of permutations of
a substring is so high that false matches are very unlikely. For searching
pseudorandom number sequences up to the limit of memory, it works very well.
Problems:
(1) conversion to char needed -- memory constraint if x fills most of memory.
(2) loss of resolution a problem if x contains repetitive nearly identical
structures.
5. MULTIPLICATIVE METHODS: FIR filtering, xcorr, etc.
Don't work because of requirement for exact match. Xcorr of a scaled-up,
similar sequence will give a larger output than the correct sequence.
I still feel there might be better ways. Any other suggestions?
Thanks
Ross
.
- Follow-Ups:
- Re: any clever vectorising idea?
- From: us
- Re: any clever vectorising idea?
- From: Jos
- Re: any clever vectorising idea?
- From: riccardo
- Re: any clever vectorising idea?
- From: Rune Allnor
- Re: any clever vectorising idea?
- From: Richie Cotton
- Re: any clever vectorising idea?
- From: Raul
- Re: any clever vectorising idea?
- Prev by Date: Re: Generating periodic signal
- Next by Date: Re: view variables
- Previous by thread: 3D array help
- Next by thread: Re: any clever vectorising idea?
- Index(es):
Relevant Pages
|
Loading