Re: How to terminate a text file line in Unicode (in Java)
- From: "Jukka K. Korpela" <jkorpela@xxxxxxxxx>
- Date: Fri, 13 Jul 2007 17:22:45 +0300
Scripsit Stefan Ram:
My specific application is a demonstration program showing how
to write a »text file« with Java and then how to read it again
in the context of a Usenet discussion.
In this case, technically, I might use any character sequence
as a line terminator that does not occur within a line,
That's correct, though the text file isn't really a (plain) text file if its line terminator is not one of the characters designated for such use in character code standards. This implies that in such a case, it cannot be smoothly displayed and otherwise processed with tools for plain text files (like Notepad or Emacs).
The most likely two candidates in Java are
\n Unicode 10 (decimal)
%n The line separator of the operating system
(might be a sequence of characters)
I'm not a specialist in Java issues, but it seems to me that the specifications for the language designate "\n" as line break and specifically as line feed, LF, U+000A, i.e. as Unicode 10 decimal. This is somewhat obscure (since the operating system need not use such a convention) and reflects lack of rigorous standardization of the language.
I guess both candidates are feasible, with no clear preference, but the context and purpose may make one of them preferable. If you think about the possibilities of using the file in the particular environment (operating system), %n might be slightly better. If you think about wider processability, the \n might be better. When the file is used, as such, in another environment - with a different line break convention - \n is safer than %n. It is more probable that an unknown recipient is able to handle U+000A as line break in a plain text file than that it can handle your operating system's line break, if it is something exotic.
Generally, I'd vote for \n, since normally the purpose of writing a UTF-8 file is to produce something that is portable across systems.
Or, if I would want to write a tutorial itself as a »UTF-8
text file« for further distribution (»Content-Type:
text/plain; charset=UTF-8«), what should be used as a line
separator in this case?
CR LF (U+000D U+000A), because that's mandatory for text/plain by the definition of this Internet (MIME) media type, and any subtype of text:
"The canonical form of any MIME "text" subtype MUST always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" MUST represent a line break. Use of CR and LF outside of line break sequences is also forbidden.
This rule applies regardless of format or character set or sets involved."
Source.: RFC 2046, clause 4.1, available e.g. at
http://www.mhonarc.org/~ehood/MIME/2046/rfc2046.html#4.1
Thus, anything delivered with Internet message headers indicating it as "text" (in the media type sense) MUST use CR LF for line breaks. Of course, "MUST" is to be understood as a normative requirement; you might be able to violate it without serious consequences. In practice, programs tend to be more permissive, accepting a lone CR or a lone LF as line break as well, and this has been explicitly specified for some subtypes, such as text/html, see
http://www.w3.org/TR/html401/struct/text.html#didx-line_break
--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
.
- Follow-Ups:
- Re: How to terminate a text file line in Unicode (in Java)
- From: Jim Kingdon
- Re: How to terminate a text file line in Unicode (in Java)
- References:
- Re: How to terminate a text file line in Unicode (in Java)
- From: Jukka K. Korpela
- Re: How to terminate a text file line in Unicode (in Java)
- Prev by Date: Re: How to terminate a text file line in Unicode (in Java)
- Next by Date: Re: How to terminate a text file line in Unicode (in Java)
- Previous by thread: Re: How to terminate a text file line in Unicode (in Java)
- Next by thread: Re: How to terminate a text file line in Unicode (in Java)
- Index(es):
Relevant Pages
|