Re: distribution comparison



Peter Perkins wrote:


Viktor Martyanov wrote:

I am using Distribution Fitting Tool to fit my data to a
particular
distribution. I think I can make some conclusions on the basis
of
visual results. But how can I quantify differences between
actual
data and theoretical distributions (e.g. by comparing their
variances)? Not sure how to do that while evaluating the fits
within
Distribution Fitting Tool.

Viktor, there are measures such as the log-likelihood (which the
tool gives
you), AIC/BIC (which are simple to compute given the LL), or
Kolmogorov
statistics (which you can get from KSTEST). But none of these
scalar statistics
are going to tell you as much as just looking -- there's no way,
for example,
that the AIC can tell you that the fit near the mode is right on,
but in the
tails it drops off too fast. You presumably care _how_ a fit
ifails to capture
what your data say, so that you can decide if that aspect is
important or not.

It's unfortunate that there is little general theory to allow you
to test
hypotheses of particular families of distributions -- you can est
against
specific distributions, but not, for example, things like, "do my
data come from
_some_ gamma distribution?" You can do simulations to try to get
at that kind
of question, but it's hard to get the kind of hard p-values you
might be looking
for.

Hope this helps.

- Peter Perkins
The MathWorks, Inc.


Peter,

Thanks a lot for your detailed answer. I guess the problem with
visual analysis in my case is the size of the dataset. Right now I am
looking at the distribution of 256 4-mer DNA motifs in front of each
gene. Therefore, I end up having to visually check 256 sets of
distribution fits if I am using DFITTOOL. If I proceed with 5-mers or
6-mers it becomes impossible to test all fits within reasonable time.

Besides, I am interested in some number that would be indicative of
goodness-of-fit between actual data and theoretical distribution.
Having specific numbers would allow me to select say 10% of n-mers
that are most different from default distribution. I do not think
such selection can be made on the basis of the visual determination.
I also have a couple of questions regarding statistics you are
talking about. As for KSTEST, I can use it only to find out if the
data come from normal distribution or not, is that correct? As for
AIC/BIC, do I use as log(V) (according to Matlab help) the actual
log-likelihood calculated for each distribution?
Thanks a lot in advance.

Viktor
.



Relevant Pages

  • mixing distributions
    ... I have a dataset and want to fit a distribution on it. ... the pareto distr fits very well. ... with such a mixing the result is not a density function because: ... * the CDF is not monotonic increasing if I use the CDF ...
    (sci.stat.math)
  • Re: Goodness of fit measures for a distribution
    ... > for the goodness of fit of a distribution. ... > Lognromal, Beta, Gamma, inverse Gaussian etc) and find out which fits ... The good-ole PROBABILITY PAPER plot is the idea you should use. ... or cdfs over my data to see which fits the best. ...
    (sci.stat.math)
  • Goodness of fit measures for a distribution
    ... for the goodness of fit of a distribution. ... I have some sample data which lets say it ... Lognromal, Beta, Gamma, inverse Gaussian etc) and find out which fits the ...
    (sci.stat.math)
  • Re: Compound Distributions?
    ... >Distribution A fits very well from 0 to K ... >Its not sure that K is identical to the point of intersection! ... Let p_A and p_B be the probability densities you are talking about, ...
    (sci.stat.math)
  • Re: Probit analysis
    ... In a log-likelihood ratio test, you would fit the probit model ... compared with a chi-square distribution. ...
    (sci.stat.math)