Re: case-sensitivity



Nils M Holm <before-2006-03-01@xxxxxxxxx> writes:

Alexander Schmolck <a.schmolck@xxxxxxxxx> wrote:
It's bad enough that at the beginning of the 21st century our conception of
programming largely boils down to pushing around monospaced characters in
text-files with 30 year old editors, so let's at least not insist on drawing
these characters from a set that is inadequate for just about any imaginable
task.

I think we "push around mono-spaced characters in text-files" because
this method is a local optimum. If it was not, better methods would have
taken over.

According to this logic indo-arabic numerals are at best marginally better
than romans ones. I'm wary of spending centuries in such "optima".

Using ASCII exclusively in program text[1] is a very, very good idea,
too, because it allows people to swap code over national boundaries and
language boundaries.

This is a popular claim in such debates, but I've never seen it backed up with
any vaguely plausible arguments or empirical evidence (not even anectodes from
personal experience).

So let me ask: how much code have you actually been able to swap across
national and language boundaries, because it contained, say, romanized Chinese
or Japanese, rather than unicode encoded kanji? And why did ASCII make the
difference?

Hasn't java always allowed unicode identifers? If so why haven't I heard any
horror stories, given java's enormous international popularity, including
Japan (IIRC sun even have japanese versions of just about anything on their
official java website, but not Spanish French or German) and the many
globe-spanning open source projects that use it?

I have no relevant experiences with non-English code myself, but my bet is
that almost all the code that people swapped successfully over national and
language boundaries was successfully swapped because it was written in some
approximation to English and not because of the virtues of the ASCII charset
(which evidently sucks even for English-only prose and programming).

In all likelihood boiling everything down to ascii even exasperates some of
the difficulties involved in international code exchange, for example because
it is a very lossy translation (see footnote).

Imagine you wanted to read a Scheme program written by someone writing
kanji.

Maybe not the best example as I'd actually rather prefer that to reading the
same scheme program in transliterated Japanese. And I'd not be surprised if
you came to you make the same choice [1], but I'm sure some actual code to
look at would help.

Since there are some Chinese and Japanese on this newsgroup, maybe one of them
(or a kanji-illiterate who actually had to deal with both romanized and kanji
code) can chime in and offer some example code or their experiences and
opinions.

Even if you could read kanji, now try editing the program with a US or
European keyboard.

Japanese keyboards aren't magically different you know (I have certainly
written japanese (with kanji and everything) on a US/European keyboard. All it
takes as an emacs user is to pick a Japanese input method to your liking with
M-x set-input-method -- no additional software installations required). But of
course editing code in a language with a unfamiliar char set is going to be
more difficult.

However, in the absence of evidence to the contrary I'd claim that this
difficulty is absolutely in the noise (compared to the problems you'll have
making some sense of foreign language code in the first place and the overall
effort required to integrate it with English code), is unlikely to affect many
people anyway, and it in no way makes up for the enormous disadvantages
resulting from forcing all programming activity into a crippled char set
(making many tasks impossible and many more unnecessarily awkward).

I really have enormous difficulty to see how code written in several languages
by people who were incompetent in most of them is going to be useable. And if
you have to translate the identifiers and comments in a common language anyway
in order to create code of adequate quality, then I don't see how the presence
of kanji would present much of a hurdle either (and anyway a simple program to
transliterate unicode-encoded source code into ascii should be trivial to
write (unlike its inverse), maybe even less than a page of code -- and if the
problems caused by kanji etc. identifiers were serious enough to even
contemplate crippling some languages to accept only ascii, one should almost
certainly be able to find high-quality off-the-shelve solutions anyway).


'as

Footnotes:
[1] Finding out the meaning of identifiers will be much easier in kanji.
Let's say the identifer consists of four kanj -- then either it is a
single word or some multiword identifier or neologism, but in either case
you can look up the meaning of everything or the relevant parts
trivially, with only a few keypresses (using emacs and edict.el;
otherwise you'd use a online-dictionary such as wwwjdic, or some quickly
written helper script).

Also, recognizing one of the constituents in another identifier will be
straightforward (although it may be a burden on your memory) and likely
informative, even if you haven't looked up the constituent.

On the other hand if it were a multiword *romanized* japanese identifier
you might already run into some difficulties in your attempts to break up
the identifier into chunks that are listed in your dictionary (good luck
with an inflexion rich language you don't speak -- you'll likely
experience at least as much fun as a Japanese would with the jocular
donaudampferschiffahrtskapitaen -- which doesn't even contain any
declensions, sound shifts or abbreviations).

But of course that's not the main problem. Neither the mapping from kanji
to latin characters nor its reverse are *remotely* one-to-one. Ignoring
the existence of different transliteration schemes, a single kanji can
easily have half a dozen completely unrelated readings (e.g. looking up
the kanji "down" in a small kanji dictionary gives the following context
dependent readings: ka, ge, shita, moto, shimo, sa(geru), o(rosu),
kuda(su), sa(garu), kuda(ru), kuda(saru)). This doesn't even include
regular sound changes yet (e.g. the constituents of the popular Tekken
video game really are TETSU and KEN, but euphonic rules result in that
particular, context dependent soundchange. N also frequently changes to M
and H to B or P , shi to ji, su to zu etc. etc.).

In the other direction, looking up e.g. ki in a japanese dictionary will
give you a dozen or so different kanjis (tree, spirit, table, season
etc.).

So not only would you have greater difficulties looking stuff up, you'll
run into far more trouble recognizing common components in identifiers
because there will be countless false positives and false negatives.

Therefore it's not a priori clear to me that you'd be better off with the
ascii version, whereas koreans and chinese would certainly be
significantly worse off (the languages are quite different, but a Chinese
or Korean could still correctly interpret the meaning of a lot of
Japanese identifiers in kanji).

Speaking of Chinese -- for that language here the situation is probably
even worse -- you can completely forget about transcribing Chinese
sensibly if your identfiers look like [a-z_]+, because it is a tonal
language (so you'd need accents or at least numbers). I also suspect a
programmer in Hongkong wouldn't transcribe it into anything vaguely
similar to a programmer in Beijing anyway because only the written
language happens to be vaguely uniform accross China (from my very
limited experience even an educated Kantonese who also speaks good
mandarin and English and has a postgraduate degree in the UK found it
difficult to transliterate kanji into romanized mandarin).

So now with your your brilliant forced ASCII-fication scheme you have
likely turned a piece of software with identifiers that a billion chinese
can understand into one that only a few million speaking the relevant
dialect can grok with difficulty.

Finally, what would you say if some Chinese told you that you have to
express even the stuff in your Germany-specific business or eductational
software that doesn't make sense outside a German cultural context (say
solidaritaetszuschlag) in chinese characters?

Even if (and that's a big if) allowing unicode identifiers would
complicate international code-exchange would that justify millions of
people having to put up with similar crap?
.



Relevant Pages

  • Re: case-sensitivity
    ... How about the fact that although ASCII isn't enough for most of the ... and although most of programming languages ... delightful alternatives that language desingers have come up so far ... identifiers have vastly different graphical requirements from ...
    (comp.lang.scheme)
  • Re: Are extended characters safe in i dentifiers?
    ... program source would apply the standard of the programming language. ... Those linters report non-Ascii letters in identifiers as _errors_. ...
    (comp.lang.javascript)
  • Re: case-sensitivity
    ... and allowing all scripts simultaneously just wouldn't work. ... language that's only to be usable in a certain culture, ... well be written using kana and kanji, ... enough due to the different _programming_ languages used. ...
    (comp.lang.scheme)
  • Re: Are =?UTF-8?B?77+9ZXh0ZW5kZWQgY2hhcmFjdGVyc++/vSBzYWZlIGluIGk=?= =?UTF-8?B?IGRlbnRpZmllc
    ... I would have expected that people who write software for checking program source would apply the standard of the programming language. ... Those linters report non-Ascii letters in identifiers as _errors_. ...
    (comp.lang.javascript)
  • Re: PEP 3131: Supporting Non-ASCII Identifiers
    ... provides two main programming environments: Squeak and Python. ... identifiers are ... this means naming identifiers with terms from the native language. ...
    (comp.lang.python)