Re: What string encoding to pick as standard for a programming language?
- From: James Harris <james.harris.1@xxxxxxxxxxxxxx>
- Date: Thu, 21 Jun 2007 15:43:41 -0700
On 21 Jun, 15:25, Marcin 'Qrczak' Kowalczyk <qrc...@xxxxxxxxxx> wrote:
....
Searching of human texts is more complex than choosing a normalization
(case insensitive searches, ignoring diacritics, ignoring differences
between similar but distinct glyphs like «'» and «'», or «-» and «-»).
Choosing a different string representation would not be enough to reduce
it to string comparison.
OTOH there is no point in considering glyph similarity when parsing
keywords from a configuration file, or when comparing filenames.
Agreed.
It's not obvious however how to design the details, there are lots
of possibilities and lots of needs. Let's consider: different rules
for different locales, specifying the Unicode version explicitly or
leaving it unspecified, letting the end user override collation rules,
obtaining behavior which is compatible with particular external tools
(databases, filesystems, programming languages), etc.
And because this is hard and choices are rarely obvious, the core
string handling rules should be simple, so complex algorithms can be
implemented upon them. It's easier to apply some normalization than
to unapply one which has been applied automatically before.
Yes, this seems a very hard thing to design. At the moment I am
thinking to hide whatever implementation behind a standard interface
because I cannot think of there being any direct access that could
possibly apply to all character sets.
As for the actual representation perhaps, as a first attempt, there is
a parallel in the motivations for VLIW processors: different bit
fields represent different characteristics of the characters, e.g. so
many bits for the class of character (punct, letter, numeral, space),
so many for the base character, so many for meaning-changing accents,
so many for accents which do not change the meaning and perhaps affect
only the text when read, so many for variants of the character: case
etc. Sure, each 'character' would be much wider than normal. However,
awkward searches such as for a given accent, for a word that starts
with a given letter, for a piece of text regardless of accents etc.
would be easier to carry out.
Note, characters don't need to be stored in the above form either in
memory or on disk as long as there is a (fast) mapping to the long
form. Hmmm.....
.
- References:
- Re: What string encoding to pick as standard for a programming language?
- From: Marcin 'Qrczak' Kowalczyk
- Re: What string encoding to pick as standard for a programming language?
- From: Marco van de Voort
- Re: What string encoding to pick as standard for a programming language?
- From: James Harris
- Re: What string encoding to pick as standard for a programming language?
- From: Marcin 'Qrczak' Kowalczyk
- Re: What string encoding to pick as standard for a programming language?
- From: James Harris
- Re: What string encoding to pick as standard for a programming language?
- From: Marcin 'Qrczak' Kowalczyk
- Re: What string encoding to pick as standard for a programming language?
- Prev by Date: Re: Responsibility as a programming language construct
- Next by Date: Re: Question about languages
- Previous by thread: Re: What string encoding to pick as standard for a programming language?
- Next by thread: Re: What string encoding to pick as standard for a programming language?
- Index(es):
Relevant Pages
|