Re: Finding similar entries



"John D'Errico" <woodchips@xxxxxxxxxxxxxxxx> wrote in
message <fq17qb$2b2$1@xxxxxxxxxxxxxxxxxx>...
"Daniel " <daniel4738@xxxxxxxxxxx> wrote in message
<fq0ple$qos$1@xxxxxxxxxxxxxxxxxx>...
I have a problem I can't seem to find the solution to.
It's
relatively easy.

I have a collection of 25000 observations of 8
variables.

I want to find entries which are similar to each other.
There must be an easy way, can someone perhaps suggest
something?

i.e. each entry is a galaxy with 8 parameters, I want
to
find a galaxy which has similar properties to one I
select.

The simple solution is to compute an interpoint
distance matrix. There are several such tools on
the file exchange, or use pdist from the stats TB.
But these will fail on a 25000 point set.

I've written a code that allows you to find only
those distances below some limit, or only the
single nearest neighbor. I'd been planning on
putting it on the file exchange when I got a
round tuit. I'll do so today. E-mail me if you
want it sooner.

John

If you only need the similarity to the one you have
selected, then you don't need the interpoint distance
matrix. Distance to the selected one should be enough:
If x is the 25000 by 8 data and myx is the selected (1 by 8)
dist = bsxfun(@minus,x,myx);
% euclidean distance for example
Ed = sqrt(sum(dist.^2,2));
Then it is up to you to select what is close enough for
you, [sEd,ind] = sort(Ed); and pick the small ones..

Another solution is to do k-means clustering, no need to
the huge interpoint distance matrix. Included in statistics
toolbox. If you don't have that there is k-means available
also for free at least in SOM toolbox
www.cis.hut.fi/projects/somtoolbox/
But even if two points are in same cluster, they are not
necessarily very similar, you would still need to calculate
the similarity somehow, so I would go for direct distance
measure..



.



Relevant Pages

  • Re: Distributing n points in space
    ... distance matrix por each pair of points distance from one to the ... The n x n positive semidefinite matrix G ... negative eigenvalues or rank> k, the best approximation, in some sense, ... Now given the distance matrix D: ...
    (sci.math)
  • Re: Understanding Array indexing and the find command
    ... % diagonal of DM are zero, DM may not necessary symetric (distance from a to ... so i can get the total tour distance from the distance matrix, DM, by ... or the maximum distance between two stops as, ...
    (comp.soft-sys.matlab)
  • Re: fast curve similarity needed
    ... For a cuve defined by N points, the distance matrix would be the NxN array of all the pairwise distances. ... The distance matrix is invariant under rotations of the curve, and by normalizing the distances you can make it invariant under uniform scaling as well. ... The functions should have the option to match curves regardless of rotation in 3D space, and be preferably invariant in terms of uniform scale as well. ...
    (comp.graphics.algorithms)
  • Re: Looking for "diff" algorithm.
    ... > I'm looking for an implementation of a difference algorithm. ... Google for "minimal edit distance". ... where lev_matrix calculates the distance matrix. ...
    (comp.lang.prolog)