Re: Unicode



On 30/09/2007, Felipe Contreras <felipe.contreras@xxxxxxxxx> wrote:
Hi,

On 9/29/07, John Joyce <dangerwillrobinsondanger@xxxxxxxxx> wrote:

Yes but what about stuff already encoded in UTF-16?

That's why I said read up on unicode!
After you read that stuff you'll understand why it's no problem.
I'm not going to explain it. Many people understand it, but when
explaining it might make mistakes.
Read the unicode stuff carefully. It's vital for many things.

The only thing you might run into is BOM or Endian-ness, but it's
doubtful it will be an issue in most cases.

This might get you started.
http://www.unicode.org/faq/utf_bom.html#37


Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting
how programmers need to know it and how few actually do.
The short version is that UTF-16 is basically wasteful. It uses 2
bytes for lower-level code-points (the stuff also known as ASCII
range) where UTF-8 does not.

As you suggested I read the article:
http://www.joelonsoftware.com/articles/Unicode.html

I didn't find anything new. It's just explaining character sets in a
rather non-specific way. ASCII uses 8 bits, so it can store 256
characters, so it can't store all the characters in the world, so
other character sets are needed (really? I would have never guessed
that). UTF-16 basically stores characters in 2 bytes (that means more
characters in the world), UTF-8 also allows more characters it doesn't
necessarily needs 2 bytes, it uses 1, and if the character is beyond
127 then it will use 2 bytes. This whole thing can be extended up to 6
bytes.

So what exactly am I looking for here?

UTF-8 and UTF-16 are pretty much the same. They encode a single
character using one or more units, where these units are 8-bit or
16-bit respectively. The only thing you buy by converting to utf-16 is
space efficiency for codepoints that require nearly 16 bits to
represent (such as Japanese characters) and endianness issues. Note
that some characters may (and some must) be composed of multiple
codepoints (a character codepoint, and additional accent
codepoint(s)).


You really need to spend an afternoon reading about unicode. It
should be required in any computer science program as part of an
encoding course, Americans in particular are often the ones who know
the least about it....

What is there to know about Unicode? There's a couple of character
sets, use UTF-8, and remember that one character != one byte. Is there
anything else for practical purposes?

I'm sorry if I'm being rude, but I really don't like when people tell
me to read stuff I already know.

My question is still there:

Let's say I want to rename a file "fooobar", and remove the third "o",
but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and
of course there will still be a 0x00 in there. That's if the string is
recognized at all.

Why is there no issue with UTF-16 if only UTF-8 is supported?

If you handle UTF-16 as something else you break it regardless of the
language support. If you know (or have a way to find out) it's UTF-16
you can convert it. If there is no way to find out all language
support is moot.

Thanks
Michal

.



Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... Simply make a straight decision now - you will use UTF-8. ... character format) much like UTF-8 which itself ... I would have little more than UNICODE left. ... generator is assembly language. ...
    (comp.arch.embedded)
  • Re: Unicode Support
    ... if two Unicode strings are the same? ... UTF-16 is basically telling everyone "ok we all got to start ... character, and will likely support *both* endians. ... UTF-8 encodings are also easy to learn to ...
    (alt.lang.asm)
  • Re: Posting with XHR and ISO-8859-15
    ... UTF-8 code units can be byte values ... Latin-9, and Unicode are the same, so there wouldn't be any troubles ... URIs, I can't use encodeURIComponent. ... ISO-8859-xx in the sense that not every character that can be encoded ...
    (comp.lang.javascript)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... No other encodings - no Latin-1, no UTF-16, no home-made character sets, no extra fonts. ... Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. ... A couple of days work here is a drop in the ocean compared to the man-years it will take to work with your home-made encoding, and you will at least have the benefit of a better understanding of your problem. ... I would have little more than UNICODE left. ...
    (comp.arch.embedded)

Loading