Re: Integrating a new language in Tex



earifi@xxxxxxxxx wrote:

Hi Zibo,

: 1. What steps should I take to integrate a new language in Tex system??

Making TeX and/or LaTeX ready to deal with a new language covers the following
issues (this is not an exhaustive list; some languages require more attention
to detail and typographical convention than others---strictly speaking, this
is a cultural, not a linguistic issue, but we have to consider these aspects
nonetheless):

1. Character set, output side.

It sounds strange to start with the output of characters, but first of all
you must be able to print all characters and character combinations (ligatures)
as desired and expected by your readers. Check your fonts. As long as there
are missing characters, all other work below remains invisible. It is a
different thing how to inform TeX/LaTeX about the desired output.

2. Character set, input side.

You must be able to tell TeX/LaTeX to generate the desired character output.
This can be done in a number of ways:
- direct character input from the keyboard to your source file, which is the
most transparent manner - however it may become necessary to tell TeX/LaTeX
how to treat this input (inputenc.sty is helpful).
- commands which generate characters, like \l, \"a for a-umlaut, or similar.
These can be used for rarely used characters if typing is not automated
(you can instruct your text editor to convert your keyboard input automat-
ically to the desired control sequences).
- so-called active characters; they look and are typed like normal characters
but by your language package they are treated as commands to produce another
character. E.g. in Classical Mongolian there is no upper case G but a Greek
gamma is used. So it is possible to abuse "G" by making it active and have
it produce the gamma character.
- ligatures can be used to generate e.g. Cyrillic semi-vowels, like typing
"yu" in order to produce the character that vaguely looks like "I-O". This
can only be done in a Metafont ligature instruction or a so-called virtual
font; you first have to generate a property list from a font, then you must
insert your ligature commands, and finally you convert this property list
to a new font metrics description (or a virtual font; the differences in
the properties seen by the TeX engine are minuscule).

All methods have their specific usages, and none of them covers all circum-
stances equally well. In language style files, you can usually find a mix of
all of them.


3. Typographical tradition.

The choice of punctuation marks, choice of upper case letters in headings etc.
is different for every language, and sometimes even different for the same
language in different regions. Compare French quotation marks (guillemot)
with English or German quotation marks, and you'll see the difference. Or
have a look at Spanish questions and exclamations.

Another issue governed by typographical convention is the minimum word length
to be hyphenated (the technicalities are implemented in TeX's hyphenation
system). Together with allowable syllable and word length for hyphenation,
the document spacing properties (inter-word, after punctuation, etc. See
\frenchspacing!) have to be defined.


4. Names and shapes of entities (table of contents, etc.)

A complete style knows how to say "Table of Contents", "List of Figures"
and others, but it also produces a year/month/day date in the correct order,
with the appropriate mixture of numbers and words (June 14th, 2006).

5. Hyphenation.

Hyphenation pattern generation is a slow process.

It is possible to start with a useful definition for another language of
similar type, but you can also build a hyphenation file from scratch.

1. Get as much word material as possible (these days I'd suggest to download
as much diverse text material from web sites as possible, formerly I
harvested newspapers and a more-or-less complete monolingual dictionary
with rich definitions).

2. Convert your text data into one huge list, preferrably sorted uniquely.
(sort -u is your friend).

3. Verify the spelling of all entries.

4. Get patgen, the Pattern Generation utility for TeX. Get the documentation
for patgen which is called something like "A patgen-2 Tutorial" (iirc, by
Yannis Haralambous). It says Patgen-2, and you don't need to worry, modern
patgen _is_ patgen-2, there is no need to search for a file named patgen2
or similar.

5. Create initial hyphenation patterns and check them with TeX. There is
a special \showhyphenations (spelling?) command which will list all
possible hyphenation points of all words in your log file.

6. Correct your hyphenation patterns, and repeat step 5 and 6 as many times
as necessary.

In my experience, you spend days and weeks rather than hours and days to
build an acceptable hyphenation system. It is also helpful to share this
part of the work with a colleague since this is best done in collaboration.
Working alone, too many mistakes escape your attention.

A very good description of the process is found in Petr Sojka's article
Hyphenation on Demand (tug 99), along with an excellent list of references.

I found it helpful to use Jan Pazdziora's Perl module TeX-Hyphen (available
from CPAN) as it allowed me to build wrappers for handling large parts of
the build process.

Hth,

Oliver.
--
Dr. Oliver Corff e-mail: corff@xxxxxxxxxxxxxxxxxx
.



Relevant Pages

  • Re: A note on computing thugs and coding bums
    ... It would handle international characters if the execution character ... method I used in "Build Your Own .Net Language and Compiler". ... work areas and counting on Nul is an illusion. ...
    (comp.programming)
  • Thunderbird bugs [was: lots of other topics]
    ... Question marks are very, very specific thing and has very, very specific cause - written down in my previous e-mail - or in my Outlook Express instruction (same issue in Thunderbird and OE): ... interest in non-ASCII character sets comes partly from the fact that I ... It's a problem because the web browser designers ... specify a language at the sending end and a preferred language at the ...
    (alt.usage.english)
  • Re: The linf project
    ... except in character literals. ... A language needs clarity, succinctness, ... you *could* say that no continuation is allowed. ... A simple language should have only one of these features. ...
    (comp.lang.fortran)
  • Re: what does "serialization" mean?
    ... Sorry eddie, but you're dead wrong there as usual. ... >>How about ASCII character 0xB0, ... > Totalitarians and Fascists are often self-appointed language police. ...
    (comp.programming)
  • [OT] Anyone volunteers?
    ... older thread about *Hyphenation in PowerPoint*... ... with *Spanish menus*... ... You have to have selected in *Language configuration* ... *Japanese* or other eastern language selected, ...
    (microsoft.public.powerpoint)