Distribution Fitting and GoF Tests
- From: "Ziad Hatahet" <hatahetRemoveThis@xxxxxxxxx>
- Date: Mon, 29 May 2006 12:26:31 -0400
Hello all,
I have a couple of large collections of data (first one contains
approximately 0.5 million points and the other about 1.8 million -
the former being a subset of the latter), and I want to find a
distribution(s) that best describe(s) them.
The distribution (histogram) of the data decays in an exponential
manner with a "heavy" tail, so I'm currently looking at distributions
such as Weibull, Rayleigh, lognormal, Pareto, etc.
I came across a method in a paper, and they were doing the following
(in addition to comparing the fit visually): estimate the parameters
using MLE for each of the distributions they wanted to test against;
take a random subset consisting of 100 elements (from the whole data
set); conduct the Kolmogorov-Smirnov and Anderson-Darling
goodness-of-fit tests on the random subset; repeat 1000 times (i.e.
for 1000 random subsets) and observe the average p-value, based on
which they determined the better fit.
Now I'm thinking of following the same method for my data. MATLAB
does not have the A-D test; however, it does have both the K-S and
Chi-square GoF tests. I might code the A-D test myself.
Is this a good method to follow? When should I consider the p-values,
when h = 0 or 1 or both (where h is the result of the kstest or
chi2gof test)? What intrigues me is that the K-S test does not /
should not give accurate results when estimating the parameters from
the data (Mr. Peter Perkins mentioned this on several occasions);
however, the authors of the paper I read did use the test, and I have
also seen it used in distribution fitting software (e.g. BestFit and
EasyFit).
Hence, I wrote a script in MATLAB to see the results the K-S test
gives. I generated a set of 500000 numbers using randn() {mean = 0,
std var = 1), then for 1000 iterations, I took a subset of 100
numbers from the main set, and ran the kstest() twice; the first was
using mean and std. variation of 0 and 1 respectively, and the second
using the mean and std var estimated from the original set (500000
points). I then looked at the total number of rejects and the average
p-values. The end results were extremely close, and sometimes even
equal (e.g. 46 rejects for the first vs. 45 rejects for the second
out of 1000 trials). And the average p-values were very close as
well. Is using the K-S test in this context correct then?
Furthermore, we (my group and I) are thinking that if there is no
"best fit", would it be possible to separate the empirical
distribution into two (or more) parts and fit each one into a
different distribution model?
Thanks in advance,
-- Ziad Hatahet
.
- Follow-Ups:
- Re: Distribution Fitting and GoF Tests
- From: hatim solayman migdadi
- Re: Distribution Fitting and GoF Tests
- Prev by Date: Re: ftp via MatLab ???
- Next by Date: Re: exist('NAME.SUBNAME') --> ans = 0 should be 1
- Previous by thread: Passing Functions into MEX Files
- Next by thread: Re: Distribution Fitting and GoF Tests
- Index(es):
Relevant Pages
|