Re: Distribution Fitting and GoF Tests
- From: "Ziad Hatahet" <hatahetRemoveThis@xxxxxxxxx>
- Date: Fri, 2 Jun 2006 13:46:33 -0400
Running the kstest (or chi2gof) on the whole data set would result in
a reject, that's why we're using a random subset of the data.
Regarding the p-value, I'm assuming the higher the p-value, the
better the fit; am I right? Can I just look at the total number of
rejects vs. total number of iterations instead?
Thanks
Tom Lane wrote:
contains
Ziad, with that much data you're very likely to find any
distribution test
to be significant if you use the entire data set. Even minor
deviations
from the theoretical distribution might be statistically
significant for
such sample sizes. I'm not sure of the motivation for using a
smaller
subset -- maybe it's a roundabout way to force small K-S distance
measures
to be insignificant.
It's true that the K-S test and some others require that the
parameters be
specified in advance. Estimating them tends to make the p-value
too big.
Your idea of estimating the parameters from the whole data set
probably
reduces this problem somewhat, but I don't know exactly how to
quantify
that. Your results seem to indicate that the p-value is fairly
accurate.
People might use things like the K-S distance as a criterion for
selecting a
distribution even with estimated parameters, but it's not wise to
put much
faith in the p-value in those cases.
Do you really need to fit a distribution? You could make a
histogram or
empirical cdf. You could use the ksdensity function to get a
kernel-smoothing estimate of the density. You might also want to
look at
some of the Statistics Toolbox demos listed under "fitting
distributions to
data." There are examples of fitting mixtures of distributions,
and of
separately modeling the tails of the data.
-- Tom
"Ziad Hatahet" <hatahetRemoveThis@xxxxxxxxx> wrote in message
news:ef3819b.-1@xxxxxxxxxxxxxxxxxxxxxxxxxx
Hello all,
I have a couple of large collections of data (first one
millionapproximately 0.5 million points and the other about 1.8
-exponential
the former being a subset of the latter), and I want to find a
distribution(s) that best describe(s) them.
The distribution (histogram) of the data decays in an
MATLABmanner with a "heavy" tail, so I'm currently looking atdistributions
such as Weibull, Rayleigh, lognormal, Pareto, etc.following
I came across a method in a paper, and they were doing the
(in addition to comparing the fit visually): estimate theparameters
using MLE for each of the distributions they wanted to testagainst;
take a random subset consisting of 100 elements (from the wholedata
set); conduct the Kolmogorov-Smirnov and Anderson-Darling(i.e.
goodness-of-fit tests on the random subset; repeat 1000 times
for 1000 random subsets) and observe the average p-value, basedon
which they determined the better fit.
Now I'm thinking of following the same method for my data.
/does not have the A-D test; however, it does have both the K-Sand
Chi-square GoF tests. I might code the A-D test myself.p-values,
Is this a good method to follow? When should I consider the
when h = 0 or 1 or both (where h is the result of the kstest or
chi2gof test)? What intrigues me is that the K-S test does not
occasions);should not give accurate results when estimating the parametersfrom
the data (Mr. Peter Perkins mentioned this on several
Ihowever, the authors of the paper I read did use the test, and
haveBestFit
also seen it used in distribution fitting software (e.g.
andtest
EasyFit).
Hence, I wrote a script in MATLAB to see the results the K-S
=gives. I generated a set of 500000 numbers using randn() {mean
0,first
std var = 1), then for 1000 iterations, I took a subset of 100
numbers from the main set, and ran the kstest() twice; the
wasas
using mean and std. variation of 0 and 1 respectively, and thesecond
using the mean and std var estimated from the original set(500000
points). I then looked at the total number of rejects and theaverage
p-values. The end results were extremely close, and sometimeseven
equal (e.g. 46 rejects for the first vs. 45 rejects for thesecond
out of 1000 trials). And the average p-values were very close
nowell. Is using the K-S test in this context correct then?
Furthermore, we (my group and I) are thinking that if there is
."best fit", would it be possible to separate the empirical
distribution into two (or more) parts and fit each one into a
different distribution model?
Thanks in advance,
-- Ziad Hatahet
- Prev by Date: Re: Matlab compiler makes the code slower??
- Next by Date: Re: change prompt >>
- Previous by thread: Re: Distribution Fitting and GoF Tests
- Next by thread: Re: Distribution Fitting and GoF Tests
- Index(es):
Relevant Pages
|