Re: How to terminate a text file line in Unicode (in Java)



Scripsit Stefan Ram:

My specific application is a demonstration program showing how
to write a »text file« with Java and then how to read it again
in the context of a Usenet discussion.

In this case, technically, I might use any character sequence
as a line terminator that does not occur within a line,

That's correct, though the text file isn't really a (plain) text file if its line terminator is not one of the characters designated for such use in character code standards. This implies that in such a case, it cannot be smoothly displayed and otherwise processed with tools for plain text files (like Notepad or Emacs).

The most likely two candidates in Java are

\n Unicode 10 (decimal)
%n The line separator of the operating system
(might be a sequence of characters)

I'm not a specialist in Java issues, but it seems to me that the specifications for the language designate "\n" as line break and specifically as line feed, LF, U+000A, i.e. as Unicode 10 decimal. This is somewhat obscure (since the operating system need not use such a convention) and reflects lack of rigorous standardization of the language.

I guess both candidates are feasible, with no clear preference, but the context and purpose may make one of them preferable. If you think about the possibilities of using the file in the particular environment (operating system), %n might be slightly better. If you think about wider processability, the \n might be better. When the file is used, as such, in another environment - with a different line break convention - \n is safer than %n. It is more probable that an unknown recipient is able to handle U+000A as line break in a plain text file than that it can handle your operating system's line break, if it is something exotic.

Generally, I'd vote for \n, since normally the purpose of writing a UTF-8 file is to produce something that is portable across systems.

Or, if I would want to write a tutorial itself as a »UTF-8
text file« for further distribution (»Content-Type:
text/plain; charset=UTF-8«), what should be used as a line
separator in this case?

CR LF (U+000D U+000A), because that's mandatory for text/plain by the definition of this Internet (MIME) media type, and any subtype of text:

"The canonical form of any MIME "text" subtype MUST always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" MUST represent a line break. Use of CR and LF outside of line break sequences is also forbidden.
This rule applies regardless of format or character set or sets involved."

Source.: RFC 2046, clause 4.1, available e.g. at
http://www.mhonarc.org/~ehood/MIME/2046/rfc2046.html#4.1

Thus, anything delivered with Internet message headers indicating it as "text" (in the media type sense) MUST use CR LF for line breaks. Of course, "MUST" is to be understood as a normative requirement; you might be able to violate it without serious consequences. In practice, programs tend to be more permissive, accepting a lone CR or a lone LF as line break as well, and this has been explicitly specified for some subtypes, such as text/html, see
http://www.w3.org/TR/html401/struct/text.html#didx-line_break

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

.



Relevant Pages

  • Re: getline and CR, LF, CR/LF, VS/Linux
    ... >> crlf sequences or into the single lf character. ... the operating system will stop reading even though most of the file is ... This is an ancient problem and you can't blame Windows. ...
    (microsoft.public.vc.stl)
  • Re: user defined function that converts string to float
    ... > I need user defined function that converts string to float in c. ... initial, possibly empty, sequence of white-space characters (as ... point character, then an optional exponent part as defined in ... then a nonempty sequence of hexadecimal digits ...
    (comp.lang.c)
  • Re: Check for Common character sequence ( I will pay)?
    ... Dude, programming is all problem-solving. ... You need to identify character sequences of 3 or more characters that appear ... in more than one string. ... and test each 3-character sequence that results. ...
    (microsoft.public.dotnet.framework)
  • Re: Check for Common character sequence ( I will pay)?
    ... Do I need to return an array? ... You need to identify character sequences of 3 or more characters that appear ... in more than one string. ... and test each 3-character sequence that results. ...
    (microsoft.public.dotnet.framework)
  • Re: Check for Common character sequence ( I will pay)?
    ... Yes you are returning an array of FoundString objects. ... in more than one string. ... This means that you have to identify sequences 1 character at a time, ... Again, obviously, if the 3-character sequence doesn't match, neither will ...
    (microsoft.public.dotnet.framework)