Re: On The Scandal of Induction



Landau Plotken wrote:

"Michael Olea" <oleaj@xxxxxxxxxxxxx> wrote in message
news:KRtrh.35356$Gr2.6562@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
PeskyBee wrote:

<...>

An Empiricist Criterion of Meaning
- Hempel

<...>

Not only that, but who's going to be afraid of Karl Popper?
Although his work on falsifiable propositions is seminal,
his criticisms of induction were too far.

I have not read Popper, so I don't know, though, of course, I have seen
many references to his work, and I did read Hempel's essay above (it's
actually titled The Empiricist Criterion of Meaning), which points out
problems with various criteria, including falsifiability, for a sentence
to have empirical meaning. For example, is the statement "there exists at
least one unicorn" meaningful? It could, in principle, be verified, but
it cannot be falsified.

The question "what tends to confirm an induction?" is at the heart of
pattern recognition and classification algorithms - neural nets, support
vector machines, Bayesian networks, etc, - and so not just a "word game",
but an issue of practical concern. There is, of course, a vast technical
literature on the topic:

"One of the central problems in learning is to balance "goodness of fit"
criteria against the complexity of models. An important development in
the Bayesian approach was thus the realization that there does not need
to be any extra penalty for model complexity. If we compute the total
probability that data are generated by a model, there is a factor from
the volume in parameter space -- the "Occam factor" -- that discriminates
against models with more parameters or, more specifically, against models
that are more complex in a precise information theoretic sense. ..."

http://www.princeton.edu/~wbialek/our_papers/nemenman+bialek_02.pdf

Should read Satosi Watanabe. Ugly Duckling Theorem. Pattern Recognition.

Actually, I did that already, so I have a modest (very modest) familiarity
with the main ideas.

http://www.menem.com/ilya/digital_library/inf_thy/watanabe-60a.pdf

(maybe not the paper you had in mind)

This idea compact description make subjective judgements. Change terms
of decription change size of theory. Concept of "more parameters" depend
you design langauge. Information depends your langauge.

While that is true, there are some nuances.

First, as I am sure you know, if you change the description language the MDL
(Minimum Description Length) changes only by an additive constant. So,
while the "information" (by this measure) depends on the language,
different languages result in the same ranking of the Kolmogorov complexity
of theories.

But there is a more subtle point - there is a better way, an
information-theoretic way, to quantify the complexity of a model than by
counting parameters. Lets see, how can I explain this? The issue is not the
number of coordinate axes in parameter space, but one of volumes (measure).
The number of parameters depends on the chosen coordinate system (i.e.
description language), but the geometry does not. Think, for example, of
PCA (Principle Component Analysis), where you find an eigensystem that
depends only on the geometry of covariance, not the initial choice of
coordinate axes. You get a canonical and minimal coordinate system that
depends only on the data, not on the initial choice of axes. That is
roughly analogous to what I am trying to describe.

The goal here is to eliminate the issue of the (subjective) choice of
parameterization, and arrive at an intrinsic, universal, measure of model
(or data) complexity that quantifies the inherent difficulty in learning a
distribution in terms of prediction accuracy as a function of number of
observations. This can be done - has been done. There are a number of ways
to get at this, but they all involve the rate of change of the
Kullback-Leibler divergence (KLD) between the "true" distribution and
estimated distributions - if that rate is high, the gradient steep, small
changes in parameters leading to large changes KLD, then the distribution
is complex - the steeper the gradient the greater the complexity. Note, and
this is crucial, that this does not depend at all on the particular
parameterization (choice of description language) of the space - it depends
only on its intrinsic geometry (or, if you prefer to think in terms of
level curves, its "topographic map").

Of course that is just a toe-hold, a tiny little beach-head on the vast
landscape of universal complexity measure country. The next step, if I were
up to it this chilly wednesday night in California, would be to relate the
rate of change of the KLD to the divergence of the mutual information
between the past and the future of a stochastic process. I am not upto it
right now, but you can find it all layed out right here:

http://www.princeton.edu/~wbialek/our_papers/bnt_01a.pdf

A third point is that the paper I cited earlier:

http://www.princeton.edu/~wbialek/our_papers/nemenman+bialek_02.pdf

is not about parametric models -- it is about non-parametric (i.e. infinite
dimensional) models of distributions induced from experience (a finite, but
growing, list of exemplars). Without constraints generalization, and hence
learning, is impossible. The space of learnable non-parametric models is
infinite-dimensional, but constrained. Choose your contraints - say
smootheness criteria. Nature does not make jumps (well, ok, maybe it does
make jumps, but the amplitide of the wave function is smooth, and thus
constrained). The same info-theory works: rate of change of KLD, divergence
of mutual information between past and future. Correlation horizons.
Complexity. Quantifiable, Universal. The key insight is that estimation of
regularized non-parametric distributions amounts to a QFT (quantum field
theory), and so the mathematics already exists. But, just as the almosat 60
year old result that information can be quantified has not yet sunk in
(see, for example, just about any post by Ray Scanlion), so will the notion
that complexity can be quantified take some time to sink in.

Ockham useful. Not strict.

Well, yeah, true - which is why it is only one term in optimal (most
probable) generative model selection expressions.

You read more. You understand.

Amen to that.

-- Michael
.