Re: How to terminate a text file line in Unicode (in Java)



Scripsit Stefan Ram:

When writing into a Unicode text file in Java, given that the
stream encoding was set to »UTF-8«, what is the proper, best
or canonical way to terminate a line?

Let's ignore the programming aspect as well as the encoding (UTF-8 vs. something else, such as UTF-16), for the time being. The primary question is how to terminate a line in Unicode.

This has to do with the question whether the specification of
the line terminator of a proper »Unicode text file« is the
responsibility of Unicode or of the operating system (or other
protocols used).

Both, but basically the latter.

Unicode defines several line breaking characters. It does not define "the" line breaking character. The characters include U+2028 LINE SEPARATOR (LS), which unambiguously means line break; but it is rarely used. Other line break characters may have different semantics, by operating system or other software. Using them does not violate the Unicode standard in any way, though it creates portability problems.

See section 5.8 (Newline Guidelines) in the Unicode standard;
http://www.unicode.org/versions/Unicode5.0.0/ch05.pdf

So the pragmatic question is: what do the recipients (i.e., software that will process your file) recognize as line break?

And also, whether Java does any translations of "\n" and other
codes, when writing to UTF-8.

There are three conceptual levels involved here. First, "\n" means something in Java, and you need to check Java references for that. Generally, escapes like "\n" are defined as indicating line break, without specifying a particular character or string; in practice, this means that it is interpreted, by a programming language compiler or interpreter, in a system-dependent manner, as a character or a string that works as line break in the underlying system. Second, this character or string has to be presented according to some character code standard, such ASCII or Unicode. Finally, the character or string has to be represented using some encoding, such as UTF-8 - but this is smooth sailing as soon as we know the character or string, as coded in some known code, and the encoding has been decided on.

So a related question removes
Java from the discussion by asking: What is the proper byte
sequence to terminate a line of a »UTF-8 text file«?

The proper character or string depends on the agreements on line breaking characters. The rest is simple and algorithmic. There is nothing special happening here; the line break character(s) are encoded as any other characters. In most cases, line break will be presented as CR, LF, or CR LF pair. CR is octet 0D and LF is octet 0A in UTF-8.

A finer detail would be the question, whether the last
line of a »proper Unicode text file« needs to be terminated
by something, or whether the lines are separated.

No terminator is needed. Of course, specific data formats or conventions or programs may require a trailing line break, whereas some programs may treat it as an indication of an empty line at the end of the file.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

.



Relevant Pages

  • Re: VB - Ascii to Unicode and then Unicode to UTF-8 conversion (Very desperate!!)
    ... Latin together) then you have to use a Unicode column type. ... AscW returns the real Unicode character ... for Chinese characters, ... then the next thing to worry about is your CSV file. ...
    (microsoft.public.vb.general.discussion)
  • Re: Unicode Support
    ... if two Unicode strings are the same? ... UTF-16 is basically telling everyone "ok we all got to start ... character, and will likely support *both* endians. ... UTF-8 encodings are also easy to learn to ...
    (alt.lang.asm)
  • Re: Determining if a string is Unicode
    ... there's nothing magic about Unicode. ... where each character occupies 2 bytes, as opposed to a Single-Byte Character ... You could load up a string with rubbish, ... > INF file like so: ...
    (microsoft.public.vb.general.discussion)
  • Re: KANJD212
    ... >>Who decides the factors and what are their criteria, Unicode? ... But once a character is defined/get a codepoint in Unicode it ... standard modifies the codepoint of the kanji to a totally new ... I can use a code like JIS X0208 along with a font ...
    (sci.lang.japan)
  • Re: Enhanced Unicode support for "Go" tools
    ... the point to remember is that UNICODE is a _character ... It's the fonts, the OS and the application which work together ... society for the protection of French from English ...
    (alt.lang.asm)