Re: Distribution Fitting and GoF Tests



Running the kstest (or chi2gof) on the whole data set would result in
a reject, that's why we're using a random subset of the data.

Regarding the p-value, I'm assuming the higher the p-value, the
better the fit; am I right? Can I just look at the total number of
rejects vs. total number of iterations instead?

Thanks

Tom Lane wrote:


Ziad, with that much data you're very likely to find any
distribution test
to be significant if you use the entire data set. Even minor
deviations
from the theoretical distribution might be statistically
significant for
such sample sizes. I'm not sure of the motivation for using a
smaller
subset -- maybe it's a roundabout way to force small K-S distance
measures
to be insignificant.

It's true that the K-S test and some others require that the
parameters be
specified in advance. Estimating them tends to make the p-value
too big.
Your idea of estimating the parameters from the whole data set
probably
reduces this problem somewhat, but I don't know exactly how to
quantify
that. Your results seem to indicate that the p-value is fairly
accurate.
People might use things like the K-S distance as a criterion for
selecting a
distribution even with estimated parameters, but it's not wise to
put much
faith in the p-value in those cases.

Do you really need to fit a distribution? You could make a
histogram or
empirical cdf. You could use the ksdensity function to get a
kernel-smoothing estimate of the density. You might also want to
look at
some of the Statistics Toolbox demos listed under "fitting
distributions to
data." There are examples of fitting mixtures of distributions,
and of
separately modeling the tails of the data.

-- Tom

"Ziad Hatahet" <hatahetRemoveThis@xxxxxxxxx> wrote in message
news:ef3819b.-1@xxxxxxxxxxxxxxxxxxxxxxxxxx
Hello all,

I have a couple of large collections of data (first one
contains
approximately 0.5 million points and the other about 1.8
million
-
the former being a subset of the latter), and I want to find a
distribution(s) that best describe(s) them.

The distribution (histogram) of the data decays in an
exponential
manner with a "heavy" tail, so I'm currently looking at
distributions
such as Weibull, Rayleigh, lognormal, Pareto, etc.

I came across a method in a paper, and they were doing the
following
(in addition to comparing the fit visually): estimate the
parameters
using MLE for each of the distributions they wanted to test
against;
take a random subset consisting of 100 elements (from the whole
data
set); conduct the Kolmogorov-Smirnov and Anderson-Darling
goodness-of-fit tests on the random subset; repeat 1000 times
(i.e.
for 1000 random subsets) and observe the average p-value, based
on
which they determined the better fit.

Now I'm thinking of following the same method for my data.
MATLAB
does not have the A-D test; however, it does have both the K-S
and
Chi-square GoF tests. I might code the A-D test myself.

Is this a good method to follow? When should I consider the
p-values,
when h = 0 or 1 or both (where h is the result of the kstest or
chi2gof test)? What intrigues me is that the K-S test does not
/
should not give accurate results when estimating the parameters
from
the data (Mr. Peter Perkins mentioned this on several
occasions);
however, the authors of the paper I read did use the test, and
I
have
also seen it used in distribution fitting software (e.g.
BestFit
and
EasyFit).

Hence, I wrote a script in MATLAB to see the results the K-S
test
gives. I generated a set of 500000 numbers using randn() {mean
=
0,
std var = 1), then for 1000 iterations, I took a subset of 100
numbers from the main set, and ran the kstest() twice; the
first
was
using mean and std. variation of 0 and 1 respectively, and the
second
using the mean and std var estimated from the original set
(500000
points). I then looked at the total number of rejects and the
average
p-values. The end results were extremely close, and sometimes
even
equal (e.g. 46 rejects for the first vs. 45 rejects for the
second
out of 1000 trials). And the average p-values were very close
as
well. Is using the K-S test in this context correct then?

Furthermore, we (my group and I) are thinking that if there is
no
"best fit", would it be possible to separate the empirical
distribution into two (or more) parts and fit each one into a
different distribution model?

Thanks in advance,

-- Ziad Hatahet



.



Relevant Pages

  • Re: Goodness of fitting of a distribution
    ... distribution being tested. ... The K-S test has positive efficiency ... which the chi-squared test has decent power are ... To test the uniformity of the distribution in ALL bins. ...
    (sci.stat.math)
  • Re: Goodness of fitting of a distribution
    ... plot that points out that the best distribution that fit my data is a ... linear combination of a weibull and a normal distribution. ... I don't need to read your Berkeley Symposium to know that the K-S ... it is the large number of bins which reduces the ...
    (sci.stat.math)
  • Re: Computing derivative using finite differnce method
    ... Inorder to compute the asympotic covariance of the MLE of Weibull ... I took the data set: ... and the loglikehood function for the Weibull distribution fuction: ...
    (comp.soft-sys.matlab)
  • Re: Construct confidence interval for a p-value
    ... I agree that a p-value is a random quantity, and even a "statistic". ... The alternative is probably a composite hypothesis, so you have to define the expected p-value conditionally. ... I guess that by bootstrapping, you're trying to non-parametrically estimate the distribution of your data, and then simulate the sampling distribution of your statistic given that estimate of the data's distribution. ...
    (comp.soft-sys.matlab)
  • Re: Construct confidence interval for a p-value
    ... I agree that a p-value is a random quantity, and even a "statistic". ... The alternative is probably a composite hypothesis, so you have to define the expected p-value conditionally. ... I guess that by bootstrapping, you're trying to non-parametrically estimate the distribution of your data, and then simulate the sampling distribution of your statistic given that estimate of the data's distribution. ...
    (comp.soft-sys.matlab)