Distribution Fitting and GoF Tests



Hello all,

I have a couple of large collections of data (first one contains
approximately 0.5 million points and the other about 1.8 million -
the former being a subset of the latter), and I want to find a
distribution(s) that best describe(s) them.

The distribution (histogram) of the data decays in an exponential
manner with a "heavy" tail, so I'm currently looking at distributions
such as Weibull, Rayleigh, lognormal, Pareto, etc.

I came across a method in a paper, and they were doing the following
(in addition to comparing the fit visually): estimate the parameters
using MLE for each of the distributions they wanted to test against;
take a random subset consisting of 100 elements (from the whole data
set); conduct the Kolmogorov-Smirnov and Anderson-Darling
goodness-of-fit tests on the random subset; repeat 1000 times (i.e.
for 1000 random subsets) and observe the average p-value, based on
which they determined the better fit.

Now I'm thinking of following the same method for my data. MATLAB
does not have the A-D test; however, it does have both the K-S and
Chi-square GoF tests. I might code the A-D test myself.

Is this a good method to follow? When should I consider the p-values,
when h = 0 or 1 or both (where h is the result of the kstest or
chi2gof test)? What intrigues me is that the K-S test does not /
should not give accurate results when estimating the parameters from
the data (Mr. Peter Perkins mentioned this on several occasions);
however, the authors of the paper I read did use the test, and I have
also seen it used in distribution fitting software (e.g. BestFit and
EasyFit).

Hence, I wrote a script in MATLAB to see the results the K-S test
gives. I generated a set of 500000 numbers using randn() {mean = 0,
std var = 1), then for 1000 iterations, I took a subset of 100
numbers from the main set, and ran the kstest() twice; the first was
using mean and std. variation of 0 and 1 respectively, and the second
using the mean and std var estimated from the original set (500000
points). I then looked at the total number of rejects and the average
p-values. The end results were extremely close, and sometimes even
equal (e.g. 46 rejects for the first vs. 45 rejects for the second
out of 1000 trials). And the average p-values were very close as
well. Is using the K-S test in this context correct then?

Furthermore, we (my group and I) are thinking that if there is no
"best fit", would it be possible to separate the empirical
distribution into two (or more) parts and fit each one into a
different distribution model?

Thanks in advance,

-- Ziad Hatahet
.



Relevant Pages

  • Re: Probit analysis
    ... In a log-likelihood ratio test, you would fit the probit model ... compared with a chi-square distribution. ...
    (sci.stat.math)
  • Re: Goodness of fitting of a distribution
    ... plot that points out that the best distribution that fit my data is a ... linear combination of a weibull and a normal distribution. ... I don't need to read your Berkeley Symposium to know that the K-S ... it is the large number of bins which reduces the ...
    (sci.stat.math)
  • Re: Goodness of fitting of a distribution
    ... plot that points out that the best distribution that fit my data is a ... linear combination of a weibull and a normal distribution. ... It is the chi-squared test with many classes which has ... it is the large number of bins which reduces the ...
    (sci.stat.math)
  • Re: (hyper)sensitivity of goodness-of-fit tests
    ... amount of deviation of the fit. ... another distribution and move on with it? ... There are trends in the deviations of the fit. ... in ALL Neyman-Pearson type of hypothesis testing -- so Kolmogorov ...
    (sci.stat.math)
  • Re: (hyper)sensitivity of goodness-of-fit tests
    ... there is plainly a lack of fit. ... The data you posted indicate a strong departure from an exponential ... consider another distribution and move on with it? ... I see an almost perfect linear relationship. ...
    (sci.stat.math)