Re: How to terminate a text file line in Unicode (in Java)
- From: "Jukka K. Korpela" <jkorpela@xxxxxxxxx>
- Date: Thu, 12 Jul 2007 21:08:54 +0300
Scripsit Stefan Ram:
When writing into a Unicode text file in Java, given that the
stream encoding was set to »UTF-8«, what is the proper, best
or canonical way to terminate a line?
Let's ignore the programming aspect as well as the encoding (UTF-8 vs. something else, such as UTF-16), for the time being. The primary question is how to terminate a line in Unicode.
This has to do with the question whether the specification of
the line terminator of a proper »Unicode text file« is the
responsibility of Unicode or of the operating system (or other
protocols used).
Both, but basically the latter.
Unicode defines several line breaking characters. It does not define "the" line breaking character. The characters include U+2028 LINE SEPARATOR (LS), which unambiguously means line break; but it is rarely used. Other line break characters may have different semantics, by operating system or other software. Using them does not violate the Unicode standard in any way, though it creates portability problems.
See section 5.8 (Newline Guidelines) in the Unicode standard;
http://www.unicode.org/versions/Unicode5.0.0/ch05.pdf
So the pragmatic question is: what do the recipients (i.e., software that will process your file) recognize as line break?
And also, whether Java does any translations of "\n" and other
codes, when writing to UTF-8.
There are three conceptual levels involved here. First, "\n" means something in Java, and you need to check Java references for that. Generally, escapes like "\n" are defined as indicating line break, without specifying a particular character or string; in practice, this means that it is interpreted, by a programming language compiler or interpreter, in a system-dependent manner, as a character or a string that works as line break in the underlying system. Second, this character or string has to be presented according to some character code standard, such ASCII or Unicode. Finally, the character or string has to be represented using some encoding, such as UTF-8 - but this is smooth sailing as soon as we know the character or string, as coded in some known code, and the encoding has been decided on.
So a related question removes
Java from the discussion by asking: What is the proper byte
sequence to terminate a line of a »UTF-8 text file«?
The proper character or string depends on the agreements on line breaking characters. The rest is simple and algorithmic. There is nothing special happening here; the line break character(s) are encoded as any other characters. In most cases, line break will be presented as CR, LF, or CR LF pair. CR is octet 0D and LF is octet 0A in UTF-8.
A finer detail would be the question, whether the last
line of a »proper Unicode text file« needs to be terminated
by something, or whether the lines are separated.
No terminator is needed. Of course, specific data formats or conventions or programs may require a trailing line break, whereas some programs may treat it as an indication of an empty line at the end of the file.
--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
.
- Prev by Date: Re: How to terminate a text file line in Unicode (in Java)
- Next by Date: Re: How to terminate a text file line in Unicode (in Java)
- Previous by thread: Re: How to terminate a text file line in Unicode (in Java)
- Next by thread: Re: How to terminate a text file line in Unicode (in Java)
- Index(es):
Relevant Pages
|