Re: UCS Identifiers and compilers



wclodius@xxxxxxxxxxxxxx (William Clodius) writes:

As a hobby I have started work on a language design and one of the
issues that has come to concern me is the impact on the usefulness and
complexity of implementation is the incorporation of UCS/Unicode into
the language, particularly in identifiers.

1. Do many of your users make use of letters outside the ASCII/Latin-1
sets?

We have one major Yacc++ customer that has a series of languages that
support Unicode identifiers. Some of their languages have both case
sensitive and case insensitive features in the same language. My
experience relates primarily to supporting them.

3. Visually how well do alternative character sets mesh with a language
with ASCII keywords and left to right, up and down display, typical of
most programming languages? eg. how well do scripts with ideographs,
context dependent glyphs for the same character, and alternative saptail
ordering work, or character sets with characters with glyphs similar to
those used for ASCII (the l vs 1 and O vs. 0 problem multiplied)

The glyphs that look like ASCII are a definite problem and that is
made worse if the glyphs that look like ASCII characters have
different properties. In particular, a fair amount of effort went
into dealing with the Turkish character that is an i without the dot.
Apparently, there is no capital from of this letter (or it shares the
captial with some other letter) and the system toupper/tolower
routines did not deal consistently with it across locales. As a
result, we had to take care to make certain that we used a consistent
approach when calling those routines to make certain we had not
changed our locale between calls. The difficulty being that some
tables were built at the time the compiler was built (and thus under
one locale), which may not be the same as the locale the user has
specified when running the compiler.

Hope this helps,
-Chris

******************************************************************************
Chris Clark Internet: christopher.f.clark@xxxxxxxxxxxxxxxxxxxxxx
Compiler Resources, Inc. or: compres@xxxxxxxxxxxxx
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
------------------------------------------------------------------------------
.



Relevant Pages

  • locale (was: Accented characters in less and vim)
    ... Most people would use both languages interchangably; ... hence the correct locale should not be taken from ... presumable has its very own character encoding). ... :> Shouldn't $LANG always include an encoding defintion? ...
    (uk.comp.os.linux)
  • Re: Unicode and stream
    ... > Unicode does not deal with glyphs. ... character or a "u" character and a "umlaut" combining character is ... the claims of some proponents of such other languages that the language ...
    (comp.lang.cpp)
  • Re: VERY simple question about "?"
    ... don't see the equivalence between a string delimiter, or a character that signals the beginning of a symbol, and a symbol that is actually productive of something. ... Part of my difficulty understanding you is probably caused by the fact that you seem to try to tackle problems of computer languages with tools from a complete different domain. ... "Tom" - my name can also be called, but when one does so IT doesn't spring into action at all. ...
    (comp.lang.ruby)
  • Re: Comma vs. Decimal
    ... I set my locale to "German ". ... I can learn to use the USA formats much quicker than trying to figure out ... You must figure out which decimal character is correct on the users ... treated as strings, and formats them into a delimited data record. ...
    (microsoft.public.vb.general.discussion)
  • Re: RfD: XCHAR wordset (for UTF-8 and alike)
    ... >encoding can be used; latin-1 is most widely used, ... >languages, different char-sets have to be used, several of them ... How does this fit in with the wide character and internationalisation ...
    (comp.lang.forth)

Loading