Re: case-sensitivity



Nils M Holm wrote:

For the record:

I think that Unicode identifiers make things worse for the reasons
I tried to explain, and I think that case-sensitivity is a bad
thing for similar reasons: it requires additional software in order
to emphasize parts of programs and it makes writing about Scheme
harder. See also: <d5304q$2qp$1@xxxxxxxxx>

I think that within the elephantine bulk of Unicode there is
a good character set standard waiting to be uncovered. But
uncovering it will require removal (or ignoring) of well over
three quarters of the currently specified code points.

A good character standard has some good properties that
are designed into it to make it easy to work with, parse,
and process with simple tools. Unicode was accumulated
rather than designed, and lacks these properties.

From time to time I consider an RFC to try to extract the
"good" character set that lurks within Unicode.

1) Range checks and bitmasks should suffice to determine
almost all character properties as a logical function
of a minimal number of boolean operations. In pursuit
of this I'd recommend completely reordering the
codepoints paying particular attention to mirroring
the uppercases and lowercases at an offset which is a
power of two and gathering all left-to-right characters
into a different range from right-to-left characters and
putting control characters, accents, and variant selectors
together in their own ranges.

2) There should be exactly one sequence of codepoints that
corresponds to one linguistic string. I'd recommend
dumping every character that has a canonical decomposition
and using the more general combining codepoints for accents,
ligation, and variant selectors.

3) Case operations should be reversible and not change string
lengths if at all possible. Once you regularize accents
and ligations as compositions of codepoints, this is
mostly achieved. The outstanding exception of course is
eszett, and the benefits are so compelling that I'd recommend
a "forced regularization" by introducing a capital-eszett
whether or not current languages use it. It could be rendered
as a pair of capital S's. Let it be the capital in the "default"
locale, and then let the eszett-using languages each define
whatever locale exceptions they're willing to work with.

4) Expertise in handling basic operations should not have to
be repeated across different fonts for mathematical (etc)
characters, nor repeated for handling accented (etc)
characters. Once accent codepoints are separated from
base codepoints, a "default" for rendering, case operations,
etc, is established by simply applying transformations to
the base character. I say that the same benefit should
extend to all font variants, and therefore instead of
repeating selected alphabets a dozen times in different
fonts for mathematics I'd just have a dozen combining
variant selectors for different fonts.

5) Writing systems that are not alphabetic in the sense of there
being a closed set of characters should probably develop an
independent method of representing language within the standard
that suits their needs better than the alphabet-driven concept
of a "character set."

We do not try to standardize dictionaries into our character
sets, and in Chinese, etc, the writing system is more like a
dictionary than an alphabet in the western sense. If Sinograms
are really "words", then we need to step back and figure out
what the basic units of spelling those words are and create an
extensible system that allows someone to spell them and won't
always have gaps in it where someone needs a proper noun that
the dictionary compiler didn't think of, nor need constant updates
as the language itself evolves and new words come into use.

I'd recommend dedicating a set of 768 codepoints to this use:
256 "stroke starts", 256 "stroke turning points", and 256 "stroke
endings" corresponding to different mapped locations in the
character cell. Variants in glyphs (where the Korean version
has a rising stroke and the Chinese version a falling stroke,
for example) would then have the basic properties of the variants
in spelling between US and British usage in English. Note that
maybe this should be 1024 of each instead of 256 of each; but in
that case the standardization of "spelling" implied will be
more difficult.

A side benefit of this is that "smart" editors could recognize
a word and substitute a proper Sinogram glyph for it, but even
"dumb" editors could tell enough about the glyph to render it
poorly.

A downside of the above design changes is that text files, especially
text files of non-alphabetic texts, grow larger in codepoints; but we
already have good file compressors, and this additional bulk would
squeeze out of a gzipped archive or any other standard format instantly.
I think it's well worth it for being able to more easily work with
the expanded character set.

The resulting character set is slim and trim again and well under
16 bits, does not require tables in memory or convoluted code to work
with, is mostly suitable for small devices with limited memory and
compute resources (especially if file compression is used) but is
also capable of rendering any character that unicode can render.
I think there are compelling reasons to prefer it to Unicode.

Bear

.



Relevant Pages

  • Re: Newsreader Lion
    ... unicode text that maps to or from the usual Latin-1 character set". ... Was genau betrachtet ein Mini-Subset von Unicode ist. ...
    (de.comp.sys.mac.misc)
  • Re: Proposal: require 7-bit source strs
    ... >> character encodings make more sense. ... Programs that show text still need to know which character set the ... there are many non-'global' applications too where Unicode is ... I don't know Perl 6, but Perl 5 is an excellent example of how not do to ...
    (comp.lang.python)
  • Re: VB - Ascii to Unicode and then Unicode to UTF-8 conversion (Very desperate!!)
    ... Latin together) then you have to use a Unicode column type. ... AscW returns the real Unicode character ... for Chinese characters, ... then the next thing to worry about is your CSV file. ...
    (microsoft.public.vb.general.discussion)
  • Re: UTF8: cgi ist staerker als ich
    ... UNICODE bzw. eigentlich UCS (Universal Character Set) ist kein Encoding, ... Ein "Character Set" definiert eine Menge von unterscheidbaren Zeichen. ...
    (de.comp.lang.perl.cgi)
  • Re: UCS Identifiers and compilers
    ... the language, particularly in identifiers. ... context dependent glyphs for the same character, ... That's a problem with Unicode, on a couple of different levels. ... repeat particular characters at different codepoints in unicode. ...
    (comp.compilers)