Re: Size limit on NSString cStringUsingEncoding?



Steve Edwards <gfx@xxxxxxxxxxx> wrote:

OK, this seems the bast way to go.
However, just 'cause it's bugging me:

( original [str length] = 49,804,307)

char* c = [str cStringUsingEncoding:NSUTF8StringEncoding];
This now works with the UTF-8 encoding.

//-- Turn it back in to NSString to see its length:

NSString *newStr = [NSString stringWithUTF8String:cStr];
unsigned long newLen = [newUTF8String length];

newLen now equals 46,804,207!

What's a missing hundred bytes between friends!?

Unicode does all kinds of weird things that could cause a difference in
length. For example, if you look at an accented character such as e'
(pretend that's all together), you can represent it as either a single
character, or a plain 'e' followed by a character which means "put a '
over the preceeding character". The second version will take up two
characters, even though it appears as only one glyph.

Another possibility is if you have a null character near the end of the
file. This is a legal character for UTF-8, but it will indicate the end of
the string when you treat it as a C string.

--
Michael Ash
Rogue Amoeba Software

.



Relevant Pages

  • Re: printf and utf-8
    ... when using utf-8 encoding. ... The FreeBSD console does not work with utf-8, ... think it should affect printf's character counting, ... Problem with today's modular software: ...
    (freebsd-questions)
  • Re: FrontPage RPC put document doesnt recognice swedish characters
    ... The UTF-8 encoding scheme is described in RFC ... Find the ISO 10646-1/Unicode code point for the å character. ... binary representation, insert each bit into the UTF-8 mask, going from right ...
    (microsoft.public.sharepoint.windowsservices)
  • Re: Hiragana UTF-8 Encoding Table
    ... codepoints, but not the UTF-8 encoding of the codepoints. ... a character at code point 00f6 hexadecimal (its unicode name is "latin ... code c3b6. ...
    (sci.lang.japan)
  • Re: UTF-8 Headache --
    ... I do see my extended character represented by the two ... accepting UTF-8 when you're expecting iso-8859-1 will not ... your web page was sent with utf-8 encoding (did you also configure ... META tag, or even in a Content-Type HTTP header. ...
    (comp.lang.php)
  • [TOMOYO #15 3/8] Common functions for TOMOYO Linux.
    ... This file contains common functions (e.g. policy I/O, pattern matching). ... Since TOMOYO Linux is a name based access control, ... TOMOYO Linux's string manipulation functions make reviewers feel crazy, ... the Linux kernel accepts all characters but NUL character ...
    (Linux-Kernel)

Loading