Re: An Inflationary Account - #4 - Why THAT definition - Part B



On Sat, 14 Jul 2007 22:05:47 -0400, "Perplexed in Peoria"
<jimmenegay@xxxxxxxxxxxxx> wrote:

Shannon defined the information content of an answer as (- lg p)
and the entropy of a set of answers as (- Summation p lg p)
where p is a probability and lg represents the logarithm to base 2.

These definitions are good ones because they enabled Shannon to
build a theory, prove theorems, etc. But what are the mathematical
reasons why the definitions proved to be useful? A Q&A format may
be best.

Q: Well, first of all, isn't it a bit odd to have 'p' appearing
twice in that formula for entropy. In most scientific definitions,
variables only appear once, or at most once per term.

A: Yes, you may be right about 'most scientific definitions', but
there are a few in which a variable appears twice. Look at some of
the black-body radiation formulas, both classical and quantum. It
does seem to be mostly in statistical mechanics, though.

Q: Ok, but WHY does it appear twice in the entropy formula?

A: The two appearances of 'p' have different purposes. The first one
works with the summation to produce a standard 'weighted average'.
Weighted by likelihood of occurrence. The second 'p' is there to
provide the 'surprisal' metric.

Q: Ok, I can see that (1/p) gives a metric of 'surprisal'. But why
take the logarithm?

A: Doing so makes entropy additive. Adding logarithms is like
multiplying before taking a log. You multiply two independent
probabilities to get the compound probability. So you add two
independent entropies to get a compound entropy. If your surprisal
is 3 bits for the answer to one question and 2 bits for the answer
to a second independent question, then your total surprisal for the
answer to both questions is 5 bits.

Q: But why take the logarithm to base 2?

A: Convenience, mostly. Two letter 'alphabets' are the rule in
electronics. And choices with only two possibilities are common
everywhere. If you have trouble imagining how much surprisal is
represented by 13 bits, just think of flipping a coin and calling
it successfully 13 straight times. Pretty simple.

Q: I guess the most surprising thing is that a central formula for
information theory would be so simple. Only one variable to plug
in - a probability. Well, actually, it is a set of probabilities,
but still ... You would think that there would be more factors to
consider. Factors having more to do the problem domain - 'information'.

A: Excellent final question. And just the segue I was looking for.
Thank you.

In fact, it does appear, looking just at the formulas, that they are
not necessarily about information at all. The formulas are really
about probability theory and statistics, rather than information.
In fact, it could be said that 'information theory' itself is simply
an alternative viewpoint on the ancient disciplines of probability
theory and statistics. A viewpoint which prefers to measure things
in bits rather than something else.

I said in my first posting that information theory should be defined
as the discipline that measures things in bits. And lots of things
can be measured in bits - some of them having very little to do with
'information' as it is usually understood. For instance, I am currently
rereading a monograph by John Avery - "Information Theory and Evolution" -
in which he explains photosynthesis by talking about the number of
bits provided to the leaf by each photon from the sun! It is not a
great book, and I wouldn't recommend it. But this is not the only
treatment of biological thermodynamics I have seen that uses this
language. I'll discuss the reasoning and the math behind it in a later
posting. But I will admit that this kind of language can be jarring
and confusing. "The sun provides bits? Ok, but bits of what?" If
Avery had said "Bits of negentropy" or even "bits of information
capacity", I could swallow this terminology, but Avery wants to call
them bits of 'information'. And ATP is the cell's way of storing
'information'. I mention this only to show that Wilkins is faced
with information enthusiasts even crazier than me.

My position is that 'information theory' is about more than just
information. It is about probabilities. And, as such, it can be
applied to lots of things in biology and other sciences. While I
will have to make this case by providing lots of examples of how
the bit-centric viewpoint can be an improvement over the traditional
viewpoint on these things, I don't really see why there should be
opposition to this program of research simply because it is
non-traditional. Though I can see opposition simply because it is
likely to create confusion ... unless I carefully make sure that there
is no information-oriented language there unless it really belongs.
I can see the reasonableness of that ... But ...

... And Then a Miracle Happens ...

It turns out (I claim ... and I know I haven't yet given any evidence
for this claim) that when you do analyze some traditional biological
problems - evolution, for example - according to this program; when
you think of things in terms of 'bits' but you are careful not to
think of the bits in terms of 'information'...; when you are all done,
the temptation is to go back and ask about all those variables which
you had denominated in bits, "What if those 'bits' really did represent
'information'? What kind of 'information' is it, and where does it
reside? And what does it 'mean'? And very often you can come up with
something pretty reasonable. 'Bits of something-or-other' added to
genomes by natural selection can be interpreted as bits of information
about the environment. And so on.

What I am saying is that I am going to try to find and point out some
uses of the ought-to-be-uncontroversial 'probability bits' in biology.
But then when I am done, I may turn around and say something like
"Now if you re-interpret those probability bits as information bits,
... well ... isn't THAT interesting." Just thought I should warn you.
If you check the bit in the mouth of that horse I am dragging in, you
may see that the horse is a Greek gift. Or something like that.

Next few postings are about information and thermodynamic entropy.
Other than explaining where Avery is coming from, they should be
quite traditional and non-controversial. Then, after that, I'll
start getting into the intended meat of this series - an
information-oriented (or better, 'bit-oriented') look at natural selection
and population genetics. Then, if I still have enough motivation to
continue, I'll try to say something about applications in molecular
biology and perhaps abiogenesis.


Shannon discusses explicitly many of the issues raised here in his
1948 paper in Bell System Tech J, available on-line at
http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

Here is what Shannon says about meaning:

"The fundamental problem of communication is that of reproducing at
one point either exactly or approximately a message selected at
another point. Frequently the messages have meaning; that is they
refer to or are correlated according to some system with certain
physical or conceptual entities. These semantic aspects of
communication are irrelevant to the engineering problem. The
significant aspect is that the actual message is one selected from a
set of possible messages."

Here is what Shannon says about the logarithmic measure:

"The logarithmic measure is more convenient for various reasons:
1. It is practically more useful. ...
2. It is nearer to our intuitive feeling as to the proper measure.
3. 3. It is mathematically more suitable."

Each of these points is expanded in the original.

Here is what Shannon says about summing p ln p:

"If there is such a measure, say H(p1; p2; : : : ; pn), it is
reasonable to require of it the following properties:
1. H should be continuous in the pi.
2. If all the pi are equal, pi = 1/n, then H should be a monotonic
increasing function of n. With equally likely events there is more
choice, or uncertainty, when there are more possible events.
3. If a choice be broken down into two successive choices, the
original H should be the weighted sum of the individual values of H."

He then goes on to prove (Appendix 2) that summing p ln p is the only
measure H that satisfies these three properties.

Here is what Shannon says about entropy:

"The form of H will be recognized as that of entropy as defined in
certain formulations of statistical mechanics where pi is
the probability of a system being in cell i of its phase space. H is
then, for example, the H in Boltzmann?s famous H theorem. We shall
call H = - sum pi log pi the entropy of the set of probabilities
p1; : : : ; pn."

Incidentally, as pointed out by Shannon, taking logarithms to base 2
merely establish a unit for measuring this 'entropy'. You can use
different bases and use different units. Using base 2 gives
information measured in units called 'bits'.

And it is indeed true that this information is simple a theory about
probabilities and probability distributions. Shannon's entropy is, as
he stated, simply a measure on a probability distribution.


.



Relevant Pages

  • Re: An Inflationary Account - #4 - Why THAT definition - Part B
    ... where p is a probability and lg represents the logarithm to base 2. ... twice in that formula for entropy. ... I guess the most surprising thing is that a central formula for ... it could be said that 'information theory' itself is simply ...
    (talk.origins)
  • Re: behavior as mapping
    ... estimating a probability distribution, the distribution ... sequence with equal probability - since you have microsecond temporal ... reduction of the entropy Pto the entropy P ... If there were 4 genes we would need 2 bits of binding site info. ...
    (comp.ai.philosophy)
  • Re: An Inflationary Account - #4 - Why THAT definition - Part B
    ... where p is a probability and lg represents the logarithm to base 2. ... twice in that formula for entropy. ... it could be said that 'information theory' itself is simply ... uses of the ought-to-be-uncontroversial 'probability bits' in biology. ...
    (talk.origins)
  • Re: behavior as mapping
    ... estimating a probability distribution, the distribution ... sequence with equal probability - since you have microsecond temporal ... reduction of the entropy Pto the entropy P ... If there were 4 genes we would need 2 bits of binding site info. ...
    (comp.ai.philosophy)
  • Re: "boundary condition" definition
    ... Some people say that "entropy" as used in information theory is the ... is driven by a dissipation of energy. ... By "process" I mean the existence of empirical constraints ... necessarily alters the probability distribution. ...
    (sci.philosophy.meta)

Loading