Re: Distribution Fitting and GoF Tests



Ziad, with that much data you're very likely to find any distribution test
to be significant if you use the entire data set. Even minor deviations
from the theoretical distribution might be statistically significant for
such sample sizes. I'm not sure of the motivation for using a smaller
subset -- maybe it's a roundabout way to force small K-S distance measures
to be insignificant.

It's true that the K-S test and some others require that the parameters be
specified in advance. Estimating them tends to make the p-value too big.
Your idea of estimating the parameters from the whole data set probably
reduces this problem somewhat, but I don't know exactly how to quantify
that. Your results seem to indicate that the p-value is fairly accurate.
People might use things like the K-S distance as a criterion for selecting a
distribution even with estimated parameters, but it's not wise to put much
faith in the p-value in those cases.

Do you really need to fit a distribution? You could make a histogram or
empirical cdf. You could use the ksdensity function to get a
kernel-smoothing estimate of the density. You might also want to look at
some of the Statistics Toolbox demos listed under "fitting distributions to
data." There are examples of fitting mixtures of distributions, and of
separately modeling the tails of the data.

-- Tom

"Ziad Hatahet" <hatahetRemoveThis@xxxxxxxxx> wrote in message
news:ef3819b.-1@xxxxxxxxxxxxxxxxxxxxxxxxxx
Hello all,

I have a couple of large collections of data (first one contains
approximately 0.5 million points and the other about 1.8 million -
the former being a subset of the latter), and I want to find a
distribution(s) that best describe(s) them.

The distribution (histogram) of the data decays in an exponential
manner with a "heavy" tail, so I'm currently looking at distributions
such as Weibull, Rayleigh, lognormal, Pareto, etc.

I came across a method in a paper, and they were doing the following
(in addition to comparing the fit visually): estimate the parameters
using MLE for each of the distributions they wanted to test against;
take a random subset consisting of 100 elements (from the whole data
set); conduct the Kolmogorov-Smirnov and Anderson-Darling
goodness-of-fit tests on the random subset; repeat 1000 times (i.e.
for 1000 random subsets) and observe the average p-value, based on
which they determined the better fit.

Now I'm thinking of following the same method for my data. MATLAB
does not have the A-D test; however, it does have both the K-S and
Chi-square GoF tests. I might code the A-D test myself.

Is this a good method to follow? When should I consider the p-values,
when h = 0 or 1 or both (where h is the result of the kstest or
chi2gof test)? What intrigues me is that the K-S test does not /
should not give accurate results when estimating the parameters from
the data (Mr. Peter Perkins mentioned this on several occasions);
however, the authors of the paper I read did use the test, and I have
also seen it used in distribution fitting software (e.g. BestFit and
EasyFit).

Hence, I wrote a script in MATLAB to see the results the K-S test
gives. I generated a set of 500000 numbers using randn() {mean = 0,
std var = 1), then for 1000 iterations, I took a subset of 100
numbers from the main set, and ran the kstest() twice; the first was
using mean and std. variation of 0 and 1 respectively, and the second
using the mean and std var estimated from the original set (500000
points). I then looked at the total number of rejects and the average
p-values. The end results were extremely close, and sometimes even
equal (e.g. 46 rejects for the first vs. 45 rejects for the second
out of 1000 trials). And the average p-values were very close as
well. Is using the K-S test in this context correct then?

Furthermore, we (my group and I) are thinking that if there is no
"best fit", would it be possible to separate the empirical
distribution into two (or more) parts and fit each one into a
different distribution model?

Thanks in advance,

-- Ziad Hatahet


.



Relevant Pages

  • Re: Bayesian estimation of structured correlation/covariance
    ... correlation between a pair of paths. ... >> should formulate this as covariance matrix estimation with the ... > large step of using the Wishart/ inverse Wishart distribution for the ... > switch from viewing your problem as one of estimating the covarince ...
    (sci.stat.math)
  • Re: Distribution Fitting and GoF Tests
    ... Ziad Hatahet wrote: ... should not give accurate results when estimating the parameters from ... Right, and the reason is that, loosely speaking, the estimated distribution is "closer" to the data than the true unknown distribution -- you'd normally think of this as "estimation error". ...
    (comp.soft-sys.matlab)
  • Re: Chi-square statistic
    ... to have the chi-square distribution. ... based on the hypergeometric distribution instead. ... If the degree of freedom for estimating the mean is ...
    (sci.stat.consult)
  • Re: Distribution of a Percetile
    ... > speeds of individual ... > approximated with a skewed distribution. ... let N be the sample size involved in estimating mu and let m be the degrees of freedom for estimating sigma. ... It is a plain Bernoulli case WHATEVER the distributions are. ...
    (sci.stat.math)
  • Distribution Fitting and GoF Tests
    ... The distribution of the data decays in an exponential ... for 1000 random subsets) and observe the average p-value, ... which they determined the better fit. ... I wrote a script in MATLAB to see the results the K-S test ...
    (comp.soft-sys.matlab)