Re: Boolean query algorithms
- From: "Sherrie Laraurens" <sherrielaraurens@xxxxxxxxxxx>
- Date: 12 Jul 2006 03:04:32 -0700
Does that mean the query is broken up into 2 processes that are
delegated to unique "chunks" and executed in parallel say like so:
1. find all the documents that have the word bat in them -> v1
2. find all the documents that have the word cat in them -> v2
if thats the case how do they reduce the complexity of:
find the intersection between v1 and v2 where the words (text
patterns) "Bat" and "Cat" are consecutive (and please don't think by
consecutive I mean "Bat" word word word "Cat" or any other form -
just plain old simple consecutive)
Best case join complexity is O(nm) and from there you still have to
make sure that Bat and Cat are consecutive, which is another O(n)
scan.
I'm well aware of the breaking down and distribution of the problem
over large clusters etc, its just I'm unsure of the details of the
final steps mainly the join and the final filtering process. also
another problem would be partial match queries...
Sherrie
Brian Wakem wrote:
Sherrie Laraurens wrote:
Hi Brain,
I'm not interested in the variations of logical predicates you
presented, also my question relates to how said queries be them sql
or something else are actually executed in a timely fashion.
The phrase "I have written a boolean search engine myself" followed
by "It simply generates the SQL queries" basically means you didn't
write anything of importance just a DB front end and sql generator
for an actual db which is doing all the dirty work for you, I guess
anyone with a couple of hours on their hands and an SQL compatible db
can say that they too have written a search engine of sorts, but alas
this was not the answer(s) I was looking for.
Then your question relates to scaling and is nothing to do with boolean
searches at all.
Google use 'chunck servers'. The data is split into say 10,000 parts, each
part stored on 1 server. Each of those servers is then replicated by say
10 others. When you a query need answering it is sent to the relevant
chunk servers which process in parallel and then feed the results back
where they are combined.
For a single query this is slightly less than 10,000 times faster than
storing everything on a single machine with a large disk array.
For multiple parellel queries it is even faster as the query can be sent to
any of the 10 chunck servers that hold each chunk of data.
That would be just 1 datacentre, google then copy the whole lot over dozens
of datacentres (not a copy in reality as they all seem to hold different
data).
Note: The numbers I have quoted are guesses.
--
Brian Wakem
Email: http://homepage.ntlworld.com/b.wakem/myemail.png
.
- Follow-Ups:
- Re: Boolean query algorithms
- From: Brian Wakem
- Re: Boolean query algorithms
- References:
- Boolean query algorithms
- From: Sherrie Laraurens
- Re: Boolean query algorithms
- From: Brian Wakem
- Re: Boolean query algorithms
- From: Sherrie Laraurens
- Re: Boolean query algorithms
- From: Brian Wakem
- Boolean query algorithms
- Prev by Date: Re: Boolean query algorithms
- Next by Date: Re: Boolean query algorithms
- Previous by thread: Re: Boolean query algorithms
- Next by thread: Re: Boolean query algorithms
- Index(es):
Relevant Pages
|