Re: Optimizing large-scale Markov chain handler
- From: "Axel Etzold" <AEtzold@xxxxxx>
- Date: Tue, 31 Jul 2007 01:15:51 +0900
Dear Giuseppe,
Although this works remarkably well, there are huge memory issues:
learning my test
text (210K+ words, with 20K+ unique words) is unfeasible for orders >
4 (at order 4
it takes about half a gigabyte of memory).
If you have 20000 unique words and want to represent the probabilities
of an n-chain of them, you'll need (20000)^n values, i.e.,
1.6*10^17 for n=4, so the sparsity of your data is such that there is
maybe one in a billion 4-chains actually represented in your probability hash ( a hash entry will actually need more than a byte, and that increases the sparsity).
So I don't think you'll be able to reduce the need in memory a great
deal ...
I could think of two things (not necessarily disjunct).
1.) Group individual words together to get classes of words,
and calculate the probabilities for those fewer classes. I'm not
quite sure how well this will work for you, but I'd try to use some
kind of Huffman coding concept on the n-chains:
http://en.wikipedia.org/wiki/Huffman_coding#n-ary_Huffman_coding
I'd then break off building the tree somewhere and put the remaining
words into a class of seldom words ... This would introduce
some overhead of getting back to words from the classes, though,
and there will be many words that have different meanings that are
nevertheless equally seldom :(
2.) Be less strict with the probability values -- you could say that
the probability of any particular 6-chain is the product of the probabilities of the five successive two-chains involved in it
(independence hypothesis), unless you can reject that hypothesis by a
statistical test (I'd use the chi-square test - the Ruby code can be
found in the package statistics2 by Shin-ichiro Hara).
So keep only the two-chain probabilities and those longer chain probabilities which deviate considerably from the independence hypothesis,
leading to rejection in the chi-square test.
You can influence the amount of data thus created by changing the
alpha value of the test.
There is a book by Jeremy Spinrad about "Efficient Graph Representations"
where you can find further ideas about how to represent a graph .
And, something else:
What is the probability of an n-chain of words that wasn't actually
encountered in the training set ? It's not zero, even though the
Hash created from the training data says so .....
If you think of the incidence matrix A=(a_ij) representing the probabilities of finding a 2-chain of words i-j, all entries with
no such Markov chain are zero. Now, in a technique called "Latent Semantic
Analysis"
http://en.wikipedia.org/wiki/Latent_semantic_indexing
this matrix is reconstructed from a product of matrices of lower
rank, introducing non-zero entries in the previously zero entries.
Thus all of a sudden, there is a small probability of finding that
particular chain now.
I join Ed Borasky in suggesting some number-crunching computer language
+ Ruby interface for dealing with big amounts of data, but maybe
you can reduce it all a bit ?
Best regards,
Axel
Now, the final application (this is a plugin for the ircbot rbot) will
store the data in a bdb,
which will enormously ease the memory load, but I was wondering if it
was possible to
optimize memory usage in a way that would allow the code to be used
'live' without
some on-disk backend, while still being able to go up to order 6 (at
least).
Any suggestions?
--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail
.
- Follow-Ups:
- Re: Optimizing large-scale Markov chain handler
- From: Oblomov
- Re: Optimizing large-scale Markov chain handler
- References:
- Optimizing large-scale Markov chain handler
- From: Oblomov
- Optimizing large-scale Markov chain handler
- Prev by Date: BUG(?): 1.8.6-p36 broke compiling C++ extensions
- Next by Date: array.include?
- Previous by thread: Re: Optimizing large-scale Markov chain handler
- Next by thread: Re: Optimizing large-scale Markov chain handler
- Index(es):
Relevant Pages
|