Re: Book-able view of ID as speculative science



Lilith (Deanne Taylor) wrote:
> There are two good books on recognizing and classifying patterns --
> there are others, but I think two good ones are a place to start; one
> is called "Pattern Recognition" by Theodoridis and Koutroumbas,
> published by Academic Press. The other is "Pattern Classification" by
> Duda, Hart, and Stork published by Wiley Interscience. I'll call the
> former PR, the latter PC.
>
> PR's chapters include classifiers based on Bayes decision theory,
> linear classifiers, nonlinear classfiiers, feature selection (a hot
> topic), feature generation, several chapters on clustering. PC goes
> into the same kinds of topics but does go a bit into genetic
> programming, machine learning, and a few exotic methods. If you're
> interested in how pattern classification or pattern recognition is
> done, there are plenty of these kinds of works out there in the
> engineering and image processing disciplines.
>
> That said, there are many different methods used to do searching for
> patterns. In biology, "function" (however you wish to define it here)
> is important in context. Many methods applied to the genome are
> supervised methods, which search for features with some kind of
> knowledge of the thing they are looking for. This knowlege can be scant
> or based on a calculated probability based on how well a sequence
> matches a known profile. There are also unsupervised models, that
> look for patterns in data without having a priori knowledge of the
> thing they are looking for, though there is knowledge inherent in the
> parameters of the model. You basically either have to know what you're
> looking for (albeit loosely) in supervised models, or have a strict
> method of searching for something and have a good definition of what
> constitutes a "pattern" so you can evaluate your results (unsupervised
> method).


This seems to imply that they are looking for something related to
biology, not just "patterns" per se. They have in mind up front what
they are looking for.


> The biggest sandtrap is when you throw any random sequence
> into a bunch of unsupervised methods. You'll always get something out
> that looks interesting, but interpreting it so that it means something
> is another matter.
>
> That said, I'm surprised "ID" people haven't gone into the human
> genome and thrown every pattern matching algorithm they could against
> it, try to find some random signature that is as likely as any other,
> but be obscure about it and insist it indicates special design for some
> theological reason unconnected to the actual genomics. It might sound
> like fruitcake on a plate, but it can't be any worse than the whole
> irreducible complexity argument.
>
> As far as supervised methods go, there are many successful ones. A
> baysian method for gene prediction, for instance, are programs like
> GenScan, that use pre-existing knowledge of gene structure to predict
> whether or not a span of DNA is likely to contain a certain feature of
> a gene. There are programs like RepeatMasker, which give the likelihood
> that a gene sequence contains a signature of a retroelement ( like
> viral sequences). There are many other algorithms that do pattern
> searching/matching on known charateristic signatures. There are
> several other supervised methods that try to find characteristics based
> on structures or features that are not well-define sequence features,
> like helix searching, protein motif searching etc.
>
> See for example papers in the list here:
>
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Display&dopt=pubmed_pubmed&from_uid=15852508
>
>
> In unsupervised methods, the model assumes no knowledge a priori and
> goes out "mining" for interesting results. Those are most difficult
> because the information isn't very valuable without biological context.
> There are some papers that are successful in showing some of these
> methods, here are some:
>
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Display&dopt=pubmed_pubmed&from_uid=14571370
>

Look at the abstract from #3 here:

"Novel tools are needed for comprehensive comparisons of interspecies
characteristics of massive amounts of genomic sequences currently
available. An unsupervised neural network algorithm, Self-Organizing
Map (SOM), is an effective tool for clustering and visualizing
high-dimensional complex data on a single map. We modified the
conventional SOM, on the basis of batch-learning SOM, for genome
informatics making the learning process and resulting map independent
of the order of data input. We generated the SOMs for tri- and
tetranucleotide frequencies in 10- and 100-kb sequence fragments from
38 eukaryotes for which almost complete genome sequences are available.
SOM recognized species-specific characteristics (key combinations of
oligonucleotide frequencies) in the genomic sequences, permitting
species-specific classification of the sequences without any
information regarding the species. We also generated the SOM for
tetranucleotide frequencies in 1-kb sequence fragments from the human
genome and found sequences for four functional categories (5' and 3'
UTRs, CDSs and introns) were classified primarily according to the
categories. Because the classification and visualization power is very
high, SOM is an efficient and powerful tool for extracting a wide range
of genome information."

(end quote)

They are basically matching similarities and graphing the similarity as
a presentation. Something like this won't find say an image of Mona
Lisa or a formula for geometric buildings hidden in there.

While it may identify some "patterns" per se, they are mostly tuned for
biological research purposes, not finding intelligent encoding.

A good many of them seem devoted to pattern matching itself, not really
the nature of the patterns. For example, it may find 8 occurences of a
given pattern, but says nothing about that pattern itself (other than
maybe matching a library of patterns). If the 8-repeat was an image of
Mona Lisa, nobody would probably catch that because they are not
looking for such. They are mostly looking for similarites within
sequences, among similar species, different species, etc.

They are essentially cross-reference engines. While such may have use
in intelligent pattern searching, it is a fairly narrow technique and
should not be considered the only or best approach.


>
> Enjoy --
> Deanne

-T-

.



Relevant Pages

  • Re: Book-able view of ID as speculative science
    ... >> interested in how pattern classification or pattern recognition is ... Many methods applied to the genome are ... > characteristics of massive amounts of genomic sequences currently ... > conventional SOM, on the basis of batch-learning SOM, for genome ...
    (talk.origins)
  • Re: I want to buy an NSG
    ... > 3) NSG must be capable of producing a different sequence when given a ... > different pattern number. ... > The sequences do not need to be the exact numbers shown above. ...
    (borland.public.delphi.thirdpartytools.general)
  • Re: Fermats Last Theorem Near Misses
    ... >> pattern I can find. ... new family that generates sequences with fewer than a trillion digits ... 1st bug in MS win2k source code found after 20 minutes: ...
    (sci.math)
  • Re: Ugly loop
    ... it most fails to be sufficiently lispish) is that it isn't extensible. ... And some find a pity that they can't use other pattern matching ... sequences, virtual concatenation of sequences, sequences with a finite ... as a single value from which LOOP or other iteration constructs take elements. ...
    (comp.lang.lisp)
  • Re: Book-able view of ID as speculative science
    ... is called "Pattern Recognition" by Theodoridis and Koutroumbas, ... topic), feature generation, several chapters on clustering. ... interested in how pattern classification or pattern recognition is ... method of searching for something and have a good definition of what ...
    (talk.origins)