Re: If he bring his action
- From: msb@xxxxxxx (Mark Brader)
- Date: Tue, 01 Dec 2009 14:36:43 -0600
James Hogg:
The figure 643 seems very low in comparison to the 36,800,000 that
Google finds initially. Can there be so many repetitions?
Donna Richoux:
No, that's not it, as you will see if you take up their offer to show
the "omitted results."
Yes, Google estimates there are 36,000,000 whatevers in its database,
but it only *shows* you, say, 643. Often it will show you up to 999
Actually the limit is 1,000.
(we don't why it doesn't go up that high sometimes), but nothing can
coax it go beyond these limits. The hits are "out there" somewhere, and
some *other* search might display those pages, but its routine for
forming lists of hits will not go beyond these low ceilings.
Here's my guess. And it is only a guess; we can be pretty sure we're
talking here about algorithms that are protected as trade secrets.
1. The index that they use to find entries in their database includes
data not only about where to find the entries, but also how many
there are and some sort of contextual information that can be used
for the (clearly less reliable) estimates for phrase searches.
2. When you start a search, the first thing google does is to
*use this data* to estimate how much of the whole database it
will need to examine in order to find the maximum 1,000 hits,
where "how much" is measured in some sort of units internal to
the database storage system.
3. Google then constructs a cache *of that much of the database*
and saves it, indexed by a key derived by hashing your specific
search details. Also cached is some information about which
server responded to your query, presumably based on your IP
address or something derived from it.
4. To construct the results page served to you, it scans the *cached
database entries*. If you then ask for additional pages of hits,
it returns to the cache to construct them.
5. When it returns any result page, if it did not find enough hits
to fill the page, it corrects the estimated number to match the
actual one, dropping the word "about". This is when you see
"Results 501-600 of about 108,000,000" followed on the next page
by "Results 601-643 of 643". And if you repeat the search later,
you get the same results, because the cache persists for hours
if not days. There is no way to ask it to search *more* of the
main database.
6. If you ask for "repeats included", it still returns to the same
cache. So if the estimate in step 2 was low (and in my experience
when I've done this, it *usually* is), then you still don't get
1,000 hits. But if it was high, and you do step through and get
to 1,000, then you never find our how many it would have served
if you kept going, before the cache was exhausted.
7. They think this is okay because they assume people are using their
searches to quickly get to the pages they most want to see, and
nobody really wants to step through as many as 1,000 pages, let
alone millions. (Or in other words, "320K is as much memory as
anyone could ever want". But they're right -- if you search for
something you think is on the web, how many hits do you look at
before deciding you need to try a different search?)
So they aren't considering people who are mainly interested using in
the results pages themselves to compile statistics -- or, at least,
they aren't considering such users *to be commercially important*.
I repeat, all of this is just my conjecture. But it makes sense to me.
--
Mark Brader | "It is only a guess, of course.
msb@xxxxxxx | I hope none of you ever finds out for certain."
Toronto | -- Insp. Grandpierre (Peter Stone, "Charade")
My text in this article is in the public domain.
.
- References:
- Re: If he bring his action
- From: Peter Moylan
- Re: If he bring his action
- From: James Hogg
- Re: If he bring his action
- From: Donna Richoux
- Re: If he bring his action
- Prev by Date: Re: In Cold Blood
- Next by Date: Re: While constructions
- Previous by thread: Re: If he bring his action
- Next by thread: Re: If he bring his action
- Index(es):
Relevant Pages
|