OT: Google search counts are just estimates



Here's part of the Numbers Guy column from the Wall Street Journal online.
===
[...]
Perhaps a recent study will help sink the journalistic shorthand. Prompted
by Yahoo's assertion last month on its blog that its search index contains
about 19.2 billion Web pages -- far more than Google, which said on its home
page this week that it searches fewer than 8.2 billion -- two recent college
graduates set out to investigate the claim.

Their intent was to test whether Yahoo really did consistently return more
results than Google. But in the process, they discovered something else: The
total number of "hits" estimated by Yahoo and Google for searches was often
significantly higher than the actual number of pages returned by the search
engines. (The study was mentioned in a Wall Street Journal article that
pointed out that both companies self-report the size of their indexes,
making it difficult to check their claims.)

The study, by Matt Cheney and Mike Perry, who attended the University of
Illinois together, used automated software to click through and count
individual results for searches that returned fewer than 1,000 Web pages
(both Yahoo and Google only allow users to click on the first 1,000 results
for a particular search).

Their first test used searches of two words randomly paired from a
dictionary, and found that Google was overstating its results count by about
9%, and Yahoo by nearly 300%. Prompted by criticism that the study was
finding many "word list" Web sites, which include long lists of words to
fool search engines and lure users to ad-laden pages, the researchers then
ran a second study, pairing two random words and excluding a third. (For
example, "bind experientialism -repelling" or "hydrothorax
gallstones -spiteful.") That found inflation of nearly 100% by Google and of
more than 300% by Yahoo.

[...]

It's no surprise that the search engines' hit counts aren't exact, but the
researchers found that in nearly every case, the estimates were too high; a
properly designed estimation technique should be too low half the time, and
too high the other half. "Those numbers struck us as really quite
misleading," Mr. Cheney told me. He adds, "It seems silly to put the
estimates up there [on search-results pages] without the recognition that
these are really silly results."

Yahoo and Google both acknowledge that their estimates are just that, using
the word "about" before providing the count on their results pages. MSN and
Ask Jeeves don't, and offer precise counts: For a recent search on "Wall
Street Journal," MSN said it found 12,247,832 pages, and Ask Jeeves found
4.575 million.

Of the four sites, only Google agreed to talk to me in any detail about how
it arrives at its estimates. Peter Norvig, the company's director of search
quality, explained that the company's search index ranks Web pages by what
it considers "quality" (for instance, how many non-spam pages link to it).
In order to return results quickly, Google's computers start by scanning the
top-quality pages, and then move down the list, stopping before the end --
usually well before half-way -- so that the site can return the top 10
results in less than a second. Each time a user clicks through to the next
page of search results, Google digs deeper into its index, yielding more
good hits and recomputing the estimate.

For a search like "Wall Street Journal," with three different terms (I
didn't put quotes around the phrase), Google compiles a list for each term
of pages that match it. Then it cross-matches the lists to find pages that
contain all terms. The percentage of all pages that contain "wall," and also
"street" and "journal," is extrapolated to the entire index, and that's how
the estimate is computed. That means the estimate for a multi-term search is
likely to be less accurate than for a single-term search, because each term
adds to the uncertainty.

Occasionally, Google inserts "fudge factors," Mr. Norvig said, to improve
estimate accuracy. He wasn't sure when the company last did this, but said
it was before last November, when Google announced a doubling of its search
index to more than eight billion pages. "I don't think we've looked at it
since then, so I think we probably should," he said. As for the study by
Messrs. Cheney and Perry, he said, "I don't have any reason to doubt their
methodology. For the queries they did, I would believe their accurate
reporting."

The bottom line, said Mr. Norvig, is that getting an accurate estimate isn't
that important for most of Google's users, so the company hasn't invested
much time and computing power. "It's only reporters and computational
linguists who care if it's really precise," he said.

[...]
===

Any comments from the computational linguists out there?


.



Relevant Pages

  • http://snofreh19.007gb.com/msn-plus2a/map.html msn plus log hacking
    ... http://snofreh15.007gb.com/yahoo-chd0/harlerediase.html cards yahoo ... http://snofreh15.007gb.com/yahoo-chd0/fangati.html msn mesenger 7 o ... http://snofreh15.007gb.com/yahoo-chd0/vesthask.html google calendar ... http://snofreh15.007gb.com/yahoo-chd0/rin.html msn instant messenger ...
    (sci.space.policy)
  • Google, Yahoo, Microsoft Set Common Voice Abroad
    ... Google, Yahoo, Microsoft Set Common Voice Abroad ... Principles Aim to Define Conduct With Nations That Restrict Speech, ... Lack Privacy Protections and Censor Search Results ...
    (soc.culture.romanian)
  • Re: Structuring informational content for commercial site
    ... >would be good to use subsubdirectories or not as much as it concerns search ... >>> I think that Yahoo, for example, does a better job than Google ... I don't totally agree with "at giving better rank to pages which have ...
    (alt.internet.search-engines)
  • Re: Structuring informational content for commercial site
    ... >>>At giving a better rank to pages which have high valuable content. ... and I have sites/pages that rank well in Yahoo ... rolled out their new search engine - I ranked higher on them than I ... Google just took me a little while ...
    (alt.internet.search-engines)
  • [Full-disclosure] Re: Google and Yahoo search engine zero-day code
    ... On 7/4/06, n3td3v wrote: ... Hi-Jack corporate crawler machines which have vulnerable robot ... Today's disclosure involves Google and Yahoo search engines: ... Yahoo visit it, then the code exploits the software they use and makes ...
    (Full-Disclosure)

Loading