Re: Tidy using unicode does not validate



On 16 Mar, 19:53, grou...@xxxxxxxxxx wrote:

Byte-Order Mark found in UTF-8 File.

There are two UTF-8 encodings: with and without a BOM at the start of
the file.

With (sometimes described as "UTF-8Y" in some Windows tools) is
_obviously_ UTF-8 and so is easier for capable tools to recognise and
deal with unambiguously.

However you should remember that files in ASCII, ISO-8859-* or UTF-8
are all equal until you start using non-ASCII characters. If you add a
BOM to a UTF-8 file, then it is no longer ASCII or ISO-8859-* at all,
no matter what characters it contains. For this reason it's often
advised against it, because it will confuse older non-UTF-8-aware
editors.

I use UTF-8 throughout, and I don't use BOMs. I also try to impose
this on our team with a literal clue of iron. If I started actually
poking a few of them with it, I might even stop them re-encoding my
source in UTF-16 or Windows wibble when I'm not looking....


This is one of those problems that's not difficult, but isn't well
understood because you can get a long way relying on the tools and not
understanding any of it yourself. In the end though, it's worth
putting the small amount of effort in to understand it, then it just
ceases to be a problem. Until of course the minions with their UTF-16
defaults sneak back in...


India where [...] these things never happen.

If you would like a megabyte of cheap Indian Java source where these
things _certainly_ happen, then I've got plenty of it.

.



Relevant Pages

  • Re: unicode file
    ... If there is a BOM, the file is treated as UTF-8 or UTF-16LE ... When a file is opened for writing using _O_WTEXT, UTF-16 ... My small library does the UTF-16 to UTF-8 conversion behind the scene. ...
    (microsoft.public.vc.mfc)
  • Re: Defacto standard string library
    ... UTF-8 (or UTF-16), because it's possible that there was no BOM and the ... I am using a protocol that has BOM at the start of text. ... represent an initial ZWNBSP? ... The particular code point for the ZWNBSP was chosen, IIRC, because the UTF-16LE and UTF-16BE encodings of it were invalid UTF-8, thus distinguishing exactly which of the three UTFs was in use -- but it can't definitively tell you that it's not some other encoding. ...
    (comp.lang.c)
  • Re: aps.net : BIG BUG in streamwriter
    ... look the BOM! ... editor which proceeds to rewrite it as UTF-16? ... when i want deserialize it with an utf-8 encoding... ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Custom Resource, XML problem
    ... Why are you assuming that it is 8-bit characters? ... //JWxml is namespace used by CXml ... which is then screamingly obvious as the UTF-8 Byte Order Mark, ... BOM is the only meaning of BOM in my brain was for "Bill Of Material" which ...
    (microsoft.public.vc.mfc)
  • Re: unicode file
    ... and if is ansi how can i convert it to unicode ... If there is a BOM, the file is treated as UTF-8 or UTF-16LE ... When a file is opened for writing using _O_WTEXT, UTF-16 ...
    (microsoft.public.vc.mfc)