Re: How to clean an xml files from non-utf-8 chars?



On Wed, Sep 17, 2008 at 12:47 PM, Jeremy Hinegardner
<jeremy@xxxxxxxxxxxxxxx> wrote:
On Wed, Sep 17, 2008 at 09:44:23PM +0900, James Gray wrote:
On Sep 17, 2008, at 4:07 AM, Krzysieq wrote:

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters?

If you can figure out the encoding they are actually in, I recommend using
Iconv's transliterate mode:

require "iconv"
Iconv.conv("UTF-8//TRANSLIT", old_encoding_name, data)

This is the approach we have take on some of our code, basically we wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've never used
that mode before.

module UTF8
module Cleanable
#
# Converts the string representation of this class to a utf8 clean
# string. This assumes that #to_s on the object will result in a utf8
# string. All chars that are not valid utf8 char sequences will be
# silently dropped.

To silently drop chars with IConv, you'd want to do:

Iconv.conv("UTF-8//IGNORE", old_encoding_name, data)

TRANSLIT just works a little harder and tries to convert your
characters into a series of UTF-8 chars if possible.
I'm not sure if it drops chars that can't be transliterated...

-greg

--
Technical Blaag at: http://blog.majesticseacreature.com | Non-tech
stuff at: http://metametta.blogspot.com

.



Relevant Pages

  • Re: How to clean an xml files from non-utf-8 chars?
    ... anything else that relies on the xml files being utf-8. ... module UTF8 ... # Converts the string representation of this class to a utf8 clean ...
    (comp.lang.ruby)
  • UTF-8
    ... I am trying to create html pages UTF-8 encoded transforming the ISO-8859-2 ... chars into UTF8. ... The char is printed fine as UTF8, but without an UTF-8 flag. ...
    (perl.beginners)
  • Re: How to clean an xml files from non-utf-8 chars?
    ... anything else that relies on the xml files being utf-8. ... particularly the "iconvert" method which attempts conversion to UTF-8, ... double-byte chars) then it replaces the chars with "?". ...
    (comp.lang.ruby)
  • Re: Fedora, unicode, console
    ... > to get UTF-8 enabled in console? ... *all* the Unicode characters: Fedora has chosen a good one, ... > has not all UTF-8 chars, ... Well, in vim, if you know the Unicode reference, try ...
    (Fedora)
  • Re: CTAN has a new package: lua-inputenc
    ... I have here lua code to convert LGR into UTF-8 based Greek. ... ligatures in the Beccari fonts) and UTF8 (via the ucs package and utf8 ... Let's have another Greek input encoding, and a Lua ...
    (comp.text.tex)