Searching Google n-gram corpus
- From: bobterwillinger@xxxxxxxxx
- Date: Sat, 08 Sep 2007 14:36:16 -0000
(also posted in sql group but got no replies, apolgies if that's bad
etiquette)
Hi,
Google released a corpus of n-grams collected from the Web.
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-...
It contains all 1..5grams that occur more than 40 times in their web
crawl. It comes as 5 folders, each folder containing around 120 files.
Each file contains 10,000,000 (10^7) lines. A line looks like:
"this is a four gram 65"
where the last number is the frequency of that exact phrase.
The total unzipped size of the 3 grams alone is 19GB, each individual
file around 200MB.
All the unzipped data is around 100GB.
I would like to be able to search through all this and return all
lines that contain a particular word or phrase.
I have no idea where to start with this, but I was wondering would an
SQL database be feasible. For the 5-grams i would need a billion rows
and of 6 columns. What sort of hard disk space would I need, and what
kind of time would i be looking at per search on on ordinary mahcine?,
I would like to be able to find every line where a particular word
occurs, no matter which position it occurs in, and ideally I would
like to be able to find particular bigrams as well.
thanks.
.
- Follow-Ups:
- Re: Searching Google n-gram corpus
- From: Bob Stearns
- Re: Searching Google n-gram corpus
- Prev by Date: Re: TRM and sorts
- Next by Date: Re: Searching Google n-gram corpus
- Previous by thread: Affiliate Marketng - Scam or Ligitimate Money Earner?
- Next by thread: Re: Searching Google n-gram corpus
- Index(es):