Re: Unicode and composition mappings



Scripsit Andreas Prilop:

Is there simple-to-use software available that does normalizations?

Internet Explorer 7

I don't think it does.

Software for normalization can be found e.g. via
http://www.unicode.org/onlinedat/products.html

Look at
http://www.unics.uni-hannover.de/nhtcapri/combining-marks.html
with Internet Explorer 7 and Firefox 2.

It may look like normalization, but it's an illusion.

If you e.g. cut the part that looks like "À = À" and paste it onto WordPad, click on the location after the first "À", and press Alt+X (on a sufficiently new version of WordPad), then it magically transforms to "A300", because the program converts the combing grave accent U+0300 to its hex code value. Nothing like that happens for the second occurrence of "À": using Alt+X, you turn it into C0.

This illustrates that the two occurrences of "À" are really different beasts: the first one is "A" followed by U+0300, whereas the second one is the single letter "À", U+00C0. The browser has _not_ normalized anything.

Displaying the two things in identical ways is correct and appropriate, but it takes place at the formatting level, not at the character level. And normalization is a character-level operation. Combining a letter and a diacritic in visual presentation might even take place at the _glyph_ level (i.e., the rendering engine might render such a combination using a single glyph from a font), but even that wouldn't be character-level issue.

Your page is a nice utility for testing _rendering_ level issues. The results naturally depend on the browser and on the fonts available. For example, though I see no difference in rendering (of the decomposed form and the precomposed form) for many characters, I see a big difference for Z with circumflex. Many things can happen; a simplistic implementation just takes a base character and a glyph for a diacritic and does an "overprint", and it might even use glyphs from different fonts, since many fonts don't have glyphs for many combining diacritics. That's bad, but it's a quality of implementation issue.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

.



Relevant Pages

  • Re: Grepping fonts for a specific glyph
    ... Unicode character. ... I'd like to see which fonts have Hebrew glyphs. ... but I cannot search for fonts by glyph with that application. ... To UNSUBSCRIBE, email to debian-user-REQUEST@xxxxxxxxxxxxxxxx ...
    (Debian-User)
  • Re: the two scripts of Serbian-Croatian
    ... So, strictly speaking, no, you cannot just use two fonts. ... Serbian every sound has its appertained single Cyrillic character. ... Obviously, you type usual words with the nj digraph character, and the ... But there is no digraph nj (as a single glyph) on the keyboard, ...
    (sci.lang)
  • Re: Unicode-based FreeBSD
    ... how backspacing will be implemented for complex scripts such ... logical character. ... rendering more than 256 glyphs at a time. ... Using 8xXX fonts for CJK ...
    (freebsd-current)
  • Re: More elegant solution for diffing two sequences
    ... character (glyph) of the two fonts in a list. ... sorted by the codepoints of the characters. ...
    (comp.lang.python)
  • Re: extracting vector art from dingbat fonts - save each character as an individual .ps, .eps or
    ... 'dingbat' fonts that I have. ... I have not found a 'FontBook' type ... corresponding keyboard character) - suggestions welcome. ... each glyph could be extracted/saved as an individual vector art file ...
    (comp.fonts)