Re: Variance of an index of dispersion



Michael.Lacy.junk@xxxxxxxxxxxxx wrote:
> [...]
> Another issue, though, is that any hypothesis test for
> the Simpson Index or related measures is hampered by a
> nuiscance parameter problem; at least this is true of
> the asymptotic approaches. The difficulty is that
> the same value of the dispersion measure can be
> achieved by many different arrangements of the probability
> vector for the i - 1, .., k categories.
> Therefore, there is no unique probability vector to
> serve as the foundation for the null distribution
> of the nominal dispersion statistic.

I have approached the single-sample inference problem from the point
of view of confidence sets: In the set of all probability vectors that
would not be rejected by the observed frequencies using the usual
one-way Pearson chi-square at a specified alpha level, the maximum
and minimum dispersion indices can be taken as a 1-alpha confidence
interval for the dispersion.

For instance, if the observed frequencies are [10 20 30 40] and alpha
= .05 then the p that gives the upper bound is [.191 .231 .270 .308],
and the p that gives the lower bound is [.070 .149 .245 .537].
My preferred dispersion index is k' = 1 / sum p[i]^2, which answers
the question "how many categories are the observations effectively
spread over?". The observed k' is 3.33; the 95% CI is (2.67, 3.88).

However, finding the extreme p-vectors can be tricky numerically --
I don't have a fully automated, fire-and-forget procedure -- and the
p-vector that gives the lower bound necessarily tends to have at least
one small p[i], so the lower bound may be less believable than the
upper bound if the sample size is small (i.e., if n*p[i] < 5 or so).

I'm not sure how best to extend this approach to multi-sample
inference problems, or if such extensions are feasible.

.