Converting to UCS-2 or UTF-16 for use by a C extension



I'm working on a C extension that embeds an ANTLR parser, and I need
to convert a Ruby input string into UCS-2 or possibly UTF-16 encoding.

I've got a working implementation but I suspect that it is flawed and
just wanted to ask if this is the right way to do it. The basic idea
is as follows (in pseudo-code):

// 1. unpack to array of UTF8 characters
utf8 = input.unpack("C*");

// 2. repack
packed = utf8.pack("U*");

// 3. convert using Iconv
ucs2 = Iconv.iconv("UCS-2", "UTF-8", packed).first

// 4. freeze
ucs2.freeze

// 5. get pointer, and length (in 16 bit words)
pointer = StringValuePtr(ucs2); // this bit in C
count = ucs.length / 2;

// 6. hand off to the parser...

My doubts are basically as follows:

- I'm doing the unpack/repack because I am not sure that my string is
encoded internally as UTF-8... it *seems* to be, because if I type a
string like "€" in irb then I can see that it's composed of three
bytes in UTF-8 ("\342\202\254")

- Is it in UTF-8 only because my system's locale is set that way?
might it be different on other people's machines? (and if so, how
would I find out what the encoding is?)

- In the case that the encoding is *not* UTF-8, does my "round-trip"
unpack/pack trick actually get it into UTF-8? (I don't think it will!
In which case the rount-trip is a waste of time)

- And once I've got the String in UCS-2, does StringValuePtr give me
access to the raw UCS-2 encoded data like I think it does? (seems to)

- Does calling length on the UCS-2 encoded string always give the
result in bytes? (I am almost certain that it does)

- Is there some more elegant way to get an arbitrary Ruby string into
UCS-2 so that it can be handed off the C parser?

Cheers,
Wincent

.



Relevant Pages

  • Re: Error (?) writing foreign-language (French/Japanese/..) string from Java program to a file
    ... If the string was in French/Arabic/Japanese: ... using UTF-16 encoding. ... original Java app, I find that its content ... If UTF-8 is not the system's default encoding then that wouldn't be very surprising. ...
    (comp.lang.java.programmer)
  • Re: UTF-8 encoding in AJAX web application.
    ... But I should really convert from UTF-8 to UCS-2 before saving ... No - the driver will do that for you. ... And if so how come the result is still in UTF-8 when I retrieve the ... I can see that everything works fine when storing a UTF-8 string in an ntext ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Binary-safe?
    ... but I'm still a bit lost how one is binary-safe and the other ... | parameter was an empty string. ... etc. that's the literal string contents for 'a' in ucs-2. ... C and PHP both view strings as arrays ...
    (alt.php)
  • Re: UTF-8 encoding in AJAX web application.
    ... And if so how come the result is still in UTF-8 when I retrieve the ... actually UTF-16, which is very similar to UCS-2, but you can ... a string is "UTF-8 encoded". ... When you fetch it from the database, the driver ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Yet more on Unicode VCL...
    ... You can pass UTF-16 as UCS-2 and vice-versa. ... handling UTF16, but UCS-2, and so can blissfully cut a string in the middle of a character, rendering it invalid. ... If I remember correctly this will also be the case for the new UnicodeString, so you would have to use functions that are aware of surrogate pairs to ensure correct handling of UTF16. ...
    (borland.public.delphi.non-technical)