Re: How do search engines index multilingual content?



On Mon, 30 Jan 2006, Philip Ronan wrote:

> Andreas Prilop wrote:
>
> > Google Groups still ignore the charset parameter of Usenet
> > articles. Instead they use the group name and I-don't-know-
> > what-else to select an encoding for an article.
>
> That's an inevitable problem caused by putting multiple articles
> (with different charsets) in a single web page.

I can't agree.

Mozilla's Bugzilla made the same mistake, and some of the
charset-related bug reports are sheer incomprehensible as a
consequence - they contain a mish-mash of Chinese, Cyrillic and
whatever else, in their different encodings, served out as raw bytes.
But the mistake was made many years back...

At least their discussion shows that they have recognised their
mistake, and understand how to correct it - mapping the various
encodings into Unicode, and serving out the results accordingly -
probably in utf-8.

(This might cause problems for people who are discussing the finer
details of Han unification, but that can't be helped now.)

Google have already, in effect, implemented something like that for
indexing web content. Otherwise it wouldn't be possible to find texts
in koi8-r and Windows-1251 when searching with a utf-8-encoded query:
the kind of problems that Andreas was reporting some years back with
various search engines, which (to put it briefly) made a query in one
encoding, and only returned pages which used that same encoding.

They just need to apply the same principle to what their ggroups
thingy is serving out. Admittedly, ggroups have *other*, *serious*,
problems to attend to first, such as encouraging their users to follow
netiquette - to at least the extent needed to get them out of the
widespread killfiling that they've already earned. But I digress.

.



Relevant Pages

  • Re: Understanding simplest HTML page
    ... the media type of data needs to be expressed outside the ... Well, at least in /some/ encoding which has us-ascii as a subset, ... on-the-fly transcoding into a suitable encoding for HTTP transmission. ... With HTML, on the other hand, if it contains "meta charset" ...
    (comp.infosystems.www.authoring.html)
  • Encoding problems / Perl 5.8.0 / XML::LibXML / XML::LibXSLT
    ... I'm having a problem with charset encodings that I desparately need some ... I am transforming XML source into XHTML using an encoding of iso-8859-1 ... characters preceeding some characters generated from an entity ... complete with encoding specification; however, ...
    (comp.lang.perl.misc)
  • restarting the read after ChangedCharSetException
    ... the Reader with the charset extracted from CharSetSpec before restarting the ... {Exception eencode = getpage(url, encoding); ...
    (comp.lang.java.programmer)
  • Re: Character encoding between Win and *nix
    ... search for the encoding specification inside the message itself. ... of HTTP, which is only one of the protocols you want to support: ... specification of a different charset for another part of the message ... since the specification would be in plain old ASCII anyway... ...
    (comp.lang.java.programmer)
  • Re: DBI and character sets (yet again)
    ... >> If a list of charset behaviors for each DBD is needed, ... And driver authors, feel free to forward to me (and/or thlis ... Most applications only work with one character set encoding ... the dbms/driver returns UTF8, thats all great. ...
    (perl.dbi.users)

Loading