Re: Distance between two instances?



Ted Dunning <ted.dunning@xxxxxxxxx> wrote:
[...]
Remember the original question. They stated they had discrete data.
This often leads to problems with the naive application of the
Euclidean metric as propoosed here.
[...]

It probably doesn'ty matter because the problem is "hard", anyway.


Here are some real-world exampls of using utility functions (usually
a "distance" of some type in an abstract N-space where the axes are
nominal data rather than numbers anyway :). Have a giggle about
the implications.


(1).
Quite a few years ago some group tried to analyze some photos.
In the typical manner of the day, the photos were meant to be classified
as "containing a hidden military asset" or "not containing".

A set of photos containing said assets (partly hidden behind vegetation
or cam netting, etc) was data-mined for statistically significant features,
and an "average" model in feature-space extracted. A similar process
was used to extract an "average" not-a-asset dataset.
The idea was that a utility function was to be used -- based on
some vector norm/distance type thing -- to decide whether any new photo
(after relevant feature extraction) was "closer" to "has asset" than
"does not have asset".
Surprisingly, the 2 sets could be separated by the hyperplane and the
automaton created was beleived to be very reliable.

However, the customer brought along some new photos, none of which
could be correctly classified by said automaton.

It later trurned out the training phots had an unusual feature.
Most of the phots of the "with asset" had been taken on a particular sunny day;
those "whithout asset" had been taken on overcast days.


(2).
Quite a few years ago a research project tried to create an automaton
that could mark short-answer questions in economics. The idea was that
a training set of "model answers" would be data-mined, creating data points
on an N-dim feature space. New answers could then be feature-extracted and
matched against the model anwsers. Any new answer closer than a given
distance from one of the models was then calld a "pass"; others were
evaluated "fail".

By this time I knew of the pitfalls of Arrow, and persisted in showing

that even after about a dozen changes to the utility functions/distance

metrics and feature sets involved, there weere answers that were obviously
not right, but were marked "pass". After some metnion of selling said
answers to interested 3rd parties, the project was abandoned.


(3).
A few years ago a certain company was developing radar processing s/w.
The idea behind airborne radar is to keep track of upto (say) 1 dozen
targets at once. Each target is represented by a datapoint in about 10 dims.
After each pass of the radar beam, each "new" target must be matched up
against the "old" targets from the prev sweep. In the case I have in mind,
a distance metric was used to determine the "goodness of fit", and
the naive algorithm sought to minimise this using a questionable
numerical method that was shoe-horned onto the primitive available hardware.

Unfortunately, there was something wrong with the whole idea. :)
The symptom was that on some random occasion the minimum-distamnce
metric mis-matched targets in the old and new sweeps, thereby transforming
enemies into friendlies, and vice versa. At the time I posited the idea
of using a stable marriage type algorithm which at least guaranteed something
about performance. I was shouted down.

Safety tip: it may not be safe to fly when fighters with certyain radar
equipment are in the air at the same time.


(4).
A couple of years ago someone wanted to match up "buyers" with
"sellers". Each buyer and seller had nominated a set of features,
each on a scale of 1 through 5. The idea was to find all those matches
that were better than "65%". Said customer was informd that "65%"
was a bit of a fuzzy concept. The response was shock. Weren't we a professional
outfit? Hadn't we gone to kindergarten?

But given a small sample of prospective data, it was shown that
suitably-chosen distance metrics -- all of them "obvious" in some sens --
could order the data from closest to furthest in any way whatever.
Which of the data points was "65% closer" was completely arbitrary.
Perhaps they could just display a number of random data points and charge
the customer anyway -- it would be easier than burning more research budget.

[ comp.ai is moderated ... your article may take a while to appear. ]
.



Relevant Pages

  • Re: PCs Always Cost More, Deal With It
    ... but the photos are individually available. ... version of iphoto w/ iweb. ... it has some pretty big downfalls... ... The collage feature is a joke... ...
    (comp.sys.mac.advocacy)
  • Re: With Agile methods, we are measuring the right things
    ... measuring lines of code as productivity is not worth much. ... to not only use the feature, but feel that the feature is valuable. ... towards completing them cannot be based on the same subjectivity? ... implement first based on metrics that we can generate. ...
    (comp.object)
  • Re: Question re Canon iP1500
    ... From my understanding Epson shells ... The feature is disabled in North America, as in USA Mexico, Canada. ... I checked some of my indoor photos as thats the ones I printed most, ...
    (comp.periphs.printers)
  • Re: With Agile methods, we are measuring the right things
    ... That is never an issue in shops that properly use metrics for process monitoring. ... The metrics are just mechanisms to determine whether one is doing the best that one can do. ... individual developers are not rewarded on the basis of metrics like LOC produced. ... LOC metrics and feature value metrics are quite different. ...
    (comp.object)
  • FS: 1995 Saturn NBA Classic "NBA At Its Best" Program
    ... Photos include: Shaquille ... Cassell, Mario Elie, Robert Horry, David Robinson (feature article), ... Starks, John Stockton, Vin Baker, and many more. ... US plus postage ...
    (rec.collecting.sport.basketball)