Re: convert NCR to \u?



Matthias Reuter wrote:
Ken Williams wrote:
Hi, I'm trying to convert text in "numerical character reference" format
to this javascript escape (\u) format. for example 소개
should become \uC18C\uAC1C.

That's a one-liner:

"소개".replace(/&#(\d+);/g, function (search, match) { return
"\\u" + parseInt(match, 10).toString(16); });

To be precise, at least a two-liner, for legibility :)

"소개".replace(/&#(\d+);/g, function (search, match) {
return "\\u" + parseInt(match, 10).toString(16).toUpperCase(); });

It also matters that the `return' keyword and return value expression start
on the same line, else `undefined' is returned due to automatic semicolon
insertion.

However, I would write it as a general-purpose function:

function charRefToUnicodeEscape(s)
{
return String(s).replace(
/&#(\d+);/g,
function(m, p1) {
return "\\u" + parseInt(p1, 10).toString(16);
});
}

var s = ...;
/* ... */
s = charRefToUnicodeEscape(s).toUpperCase();

(Or make it a method of String.prototype.)
The issue remains that the HTML Document Character Set is UCS, which
supports code points beyond the Basic Multilingual Plane (U+10000 and
greater) with UCS-4, while ECMAScript Unicode escape sequences do not:
\uFFFF is the specified maximum. So those characters cannot be presented
equally in ECMAScript.

However, the solution to that problem would be simple (and oft-mentioned
before):

Do not output or store character references, but output raw code units and
declare the proper character encoding (e.g. UTF-7, -8, -16 or -32).


PointedEars
.