Re: Analysis of repeated measurements across different methods



In a followup post, artitj@xxxxxxxxx wrote:
I originally thought about comparing against the average, but then I
was concerned that the deviations between the humans would be as large
as the difference between the automated method and the human average,
in which case, the automated method could be reasonably considered to
be as good as a human. (For example, if for one particular frame the
humans measured 5,7,9 and the automated method measured a 9, the
automated method could be good. Or it could be a fluke...).
This gets back to the question of what you are really interested in
testing when you say that you want to find out whether the methods are
different from humans, given that the humans are different from one
another. I don't know the answer--it depends on the details of what you
are trying to do.

Just as a kind of thought experiment, consider this radically different
approach to your data. Suppose that for Method A, you simply tabulate
the percentage of frames for which its score was _within the range of
the human scores_. That is, for each frame, identify the highest and
lowest of the three human scores. Now method A gets 1 point for that
frame if its score is between the lowest and highest human scores for
that frame, and it gets 0 points for that frame if its score is greater
than the highest human score or less than the lowest one. The average
of all these 1's and 0's is simply the proportion of frames for which
method A gave a score within the "human bounds".

Now, I have no idea what proportion is good enough to say that method A
is "not significantly different from humans", but I really had no idea
what that meant in the first place anyway. On the other hand, I think
it is pretty clear that comparing these proportions for A and B would
help you decide which one was more human-like.

You might also want to augment this procedure to take into account how
far out of bounds the method was, on trials when it was out.

I have one more comment below about these "radically different
approach" comments.

What would be a good way to incorporate this accuracy of the gold
standard. Would I just do a correlation between each human score and
the mean? (But since the mean is derived from the humans' score, I
guess that would be invalid...)
You could do that. It wouldn't be "invalid", really. You just have to
keep in mind when interpreting these correlations that they are
inflated to some extent by the fact that the individual scores go into
the mean, as you have noted.

Or maybe correlation between each pair of humans?
This is also reasonable. In that case, I would also compute the
correlation of method A with each human. The basic question here is
whether the humans correlate better with each other than with method A.

There are substantial differences between the videos, but they are
fairly representative of the range of videos I'd see in practice. So I
should probably then just calculate the correlation across the frames
in each video, and have a seperate correlation for each method and
video?
Yes, if there are differences between videos, you need to compute the
correlations separately for each video, as Rich Ulrich emphasized in
his post.

Perhaps you could clear this up for me (I think this may be a case
where a little bit of knowledge is a bad thing), but I read a paper by
Bland & Altman regarding measuring agreement in method comparison
studies, and they seem to suggest that the typically used Pearson
product-moment correlation coefficient is not valid since it tends to
have high correlations as long as the data are linearly related somehow
and not necessarily equal (if for example one method was always twice
the other, it would be highly correlated but not really correct). I'm
not entirely sure if their criticism extends to other correlation
measures as well though.
In principle, their criticism would extend to the measures of
correlation that we are discussing here, yes. But I would not think
that these kinds of criticisms would be too important in practice,
because it is easy enough to linearly transform the scale to remove
error. For example, if you found that method A always overestimated the
width by two units (relative to the humans), you could simply modify
the method A estimate to get rid of the extra 2. I think this is
similar to the idea that Rich Ulrich was expressing when he said that
unwanted differences could be "fixed by tuning".

And, herein lies what I see as the biggest problem with the "radically
different approach" that I considered in the thought experiment above:
as stated, it completely ignores the possibility that the method can be
fixed by tuning. To take an extreme example, method A might have a
proportion of 0.0 within the human bounds if it is always (say) 10
units too high, which would make it look terrible. On the other hand,
with the correction of subtracting 10 units, it might have a proportion
of 1.0, which would mean that after correction it was really very good.
But you might not notice the possibility of that correction when doing
the tabulation that I suggested.

Well, sorry that I am leaving you without a clear recommendation again,
but I hope some of the ideas are helpful anyway.

.