Re: Distance between two instances?
- From: russell kym horsell <kym@xxxxxxxxxxxxxxxxxxx>
- Date: Thu, 20 Jul 2006 09:05:13 GMT
Ted Dunning <ted.dunning@xxxxxxxxx> wrote:
[...]
Remember the original question. They stated they had discrete data.[...]
This often leads to problems with the naive application of the
Euclidean metric as propoosed here.
It probably doesn'ty matter because the problem is "hard", anyway.
Here are some real-world exampls of using utility functions (usually
a "distance" of some type in an abstract N-space where the axes are
nominal data rather than numbers anyway :). Have a giggle about
the implications.
(1).
Quite a few years ago some group tried to analyze some photos.
In the typical manner of the day, the photos were meant to be classified
as "containing a hidden military asset" or "not containing".
A set of photos containing said assets (partly hidden behind vegetation
or cam netting, etc) was data-mined for statistically significant features,
and an "average" model in feature-space extracted. A similar process
was used to extract an "average" not-a-asset dataset.
The idea was that a utility function was to be used -- based on
some vector norm/distance type thing -- to decide whether any new photo
(after relevant feature extraction) was "closer" to "has asset" than
"does not have asset".
Surprisingly, the 2 sets could be separated by the hyperplane and the
automaton created was beleived to be very reliable.
However, the customer brought along some new photos, none of which
could be correctly classified by said automaton.
It later trurned out the training phots had an unusual feature.
Most of the phots of the "with asset" had been taken on a particular sunny day;
those "whithout asset" had been taken on overcast days.
(2).
Quite a few years ago a research project tried to create an automaton
that could mark short-answer questions in economics. The idea was that
a training set of "model answers" would be data-mined, creating data points
on an N-dim feature space. New answers could then be feature-extracted and
matched against the model anwsers. Any new answer closer than a given
distance from one of the models was then calld a "pass"; others were
evaluated "fail".
By this time I knew of the pitfalls of Arrow, and persisted in showing
that even after about a dozen changes to the utility functions/distance
metrics and feature sets involved, there weere answers that were obviously
not right, but were marked "pass". After some metnion of selling said
answers to interested 3rd parties, the project was abandoned.
(3).
A few years ago a certain company was developing radar processing s/w.
The idea behind airborne radar is to keep track of upto (say) 1 dozen
targets at once. Each target is represented by a datapoint in about 10 dims.
After each pass of the radar beam, each "new" target must be matched up
against the "old" targets from the prev sweep. In the case I have in mind,
a distance metric was used to determine the "goodness of fit", and
the naive algorithm sought to minimise this using a questionable
numerical method that was shoe-horned onto the primitive available hardware.
Unfortunately, there was something wrong with the whole idea. :)
The symptom was that on some random occasion the minimum-distamnce
metric mis-matched targets in the old and new sweeps, thereby transforming
enemies into friendlies, and vice versa. At the time I posited the idea
of using a stable marriage type algorithm which at least guaranteed something
about performance. I was shouted down.
Safety tip: it may not be safe to fly when fighters with certyain radar
equipment are in the air at the same time.
(4).
A couple of years ago someone wanted to match up "buyers" with
"sellers". Each buyer and seller had nominated a set of features,
each on a scale of 1 through 5. The idea was to find all those matches
that were better than "65%". Said customer was informd that "65%"
was a bit of a fuzzy concept. The response was shock. Weren't we a professional
outfit? Hadn't we gone to kindergarten?
But given a small sample of prospective data, it was shown that
suitably-chosen distance metrics -- all of them "obvious" in some sens --
could order the data from closest to furthest in any way whatever.
Which of the data points was "65% closer" was completely arbitrary.
Perhaps they could just display a number of random data points and charge
the customer anyway -- it would be easier than burning more research budget.
[ comp.ai is moderated ... your article may take a while to appear. ]
.
- Follow-Ups:
- Re: Distance between two instances?
- From: Predictor
- Re: Distance between two instances?
- From: Ted Dunning
- Re: Distance between two instances?
- References:
- Re: Distance between two instances?
- From: tim smith
- Re: Distance between two instances?
- From: Ted Dunning
- Re: Distance between two instances?
- Prev by Date: Re: Distance between two instances?
- Next by Date: JMLR: Efficient Learning of Label Ranking by Soft Projections onto Polyhedra
- Previous by thread: Re: Distance between two instances?
- Next by thread: Re: Distance between two instances?
- Index(es):
Relevant Pages
|
|