Re: What is better encoding method?



Bart Van der Donck wrote:
Lasse Reichstein Nielsen wrote:
"Bart Van der Donck" <bart@xxxxxxxxxx> writes:
Yes, but those code points do not necessarliy represent the same
character in the \x80-\x9F range. My test seems to turn out that even
MSIE prefers ISO-8859-1 in stead of the expected Windows-1252 there.

A quick test shows that if n is a number between 128 and 255, and
hh is a hex representatio of it, then the following gives the same
result:
String.fromCharCode(n)
"\xhh"
"\u00hh"
unescape("%hh")
unescape("%u00hh")
(which is a string with .charCodeAt(0)==n, however much sense that
makes).
[...]

The code point table would probably be identical across all these
commands, it's probably decided by the js engine itself.

<quote cite="ECMA 262, 3rd Ed. Section 6">
6 Source Text

ECMAScript source text is represented as a sequence of characters in
the Unicode character encoding, version 2.1 or later, using the UTF-16
transformation format. The text is expected to have been normalised to
Unicode Normalised Form C (canonical composition), as described in
Unicode Technical Report #15. Conforming ECMAScript implementations
are not required to perform any normalisation of text, or behave as
though they were performing normalisation of text, themselves.

SourceCharacter ::
any Unicode character

ECMAScript source text can contain any of the Unicode characters. All
Unicode white space characters are treated as white space, and all
Unicode line/paragraph separators are treated as line separators.
Non-Latin Unicode characters are allowed in identifiers, string
literals, regular expression literals and comments.
</quote>

It doesn't look like the page's own charset has any influence.

The/a character set asserted by an HTTP content type header would
probably be employed in deciding how to translate incoming javascript
source into the "of characters in the Unicode character encoding" that
is needed prior to the tokenisation of the code.

I didn't find a way
to force getCharCodeAt() to a specific code page neither.
<snip>

You wouldn't as by the time you are dealing with javascript you are
past the point where the normalisation to Unicode ahs happened and so
code pages are not an issue.

Richard.

.



Relevant Pages

  • Re: displaying unicode x2258
    ... the unicode character instead as, say, a "dash" command. ... font that contains the character in question (like DejaVu Sans in your ...
    (comp.text.tex)
  • Re: Perl opting for double-byte chars?
    ... If by "a Unicode character" you mean one whose code value is greater ... incur some processing overhead due to the extra work of Perl handling ... because Perl takes care of it for you (if you're ...
    (comp.lang.perl.misc)
  • Re: PEP 3131: Supporting Non-ASCII Identifiers
    ... The C# identifier syntax from http://msdn2.microsoft.com/en-us/library/aa664670.aspx ... A unicode-escape-sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl ... A Unicode character of classes Mn or Mc ... A unicode-escape-sequence representing a character of classes Mn or Mc ...
    (comp.lang.python)
  • Re: Questions about MSDN for some DDK functions
    ... Even if the uppercase version of the specified ... > Unicode character requires two Unicode characters to express, ... > RtlUpcaseUnicodeChar returns it in one WCHAR. ...
    (microsoft.public.development.device.drivers)
  • Re: What is better encoding method?
    ... the Unicode character encoding, version 2.1 or later, using the UTF-16 ... though they were performing normalisation of text, ... ECMAScript source text can contain any of the Unicode characters. ...
    (comp.lang.javascript)

Loading