Re: Loading a data file containing character fields with different encodings
- From: Fernando Nunes <domusonline@xxxxxxxxx>
- Date: Fri, 12 Sep 2008 20:23:00 +0100
gerrit.schutte@xxxxxxxxx wrote:
On 12 Sep., 18:53, Ian Michael Gumby <im_gu...@xxxxxxxxxxx> wrote:I don't think that its an invalid assumption.
The data is coming from one database that contains UTF-8 characters and it appears that he's attempting to load
UTF-8 characters along with Latin-1 characters.
>From what was described, it seems that the loader utility barfs on the fact that there are both latin1 and utf-8 encoding within the same record.
I'm going from memory and a fast read of the e-mails, but I thought the OP had said that he could try different locales and either one or the other field would read correctly. Which would imply its the fact that its the two encodings in the same file.
He could write a simple script to split the file in to two files. One containing the latin-1 character set column, the second containing the utf-8 column and of course both files containing the primary key information. Then use the loader tool.
However, it would be just as easy to write the loader script that converts the encoding to a "unicode" intermediate format and then load with the correct database encoding.
Depending on how fancy you want the script to get, you can do a lot of different things. You can even write a TCL/TK front end to do some GUI input work...
The point is that its not a major deal, at least as described.
HTH
-G
From: domusonl...@xxxxxxxxx_________________________________________________________________
Subject: Re: Loading a data file containing character fields with different encodings
Date: Fri, 12 Sep 2008 17:28:30 +0100
To: informix-l...@xxxxxxxx
Ian Michael Gumby wrote:
Huh?You're assuming the Database was created with UTF8?
Not at all.
Its very possible to have data encoded in two different code sets.
You can have column 1 containing a 'latin-1' transliteration of a street
name while column 2 contains the street name in the
character set of the country. (Thai, Cyrilic, Greek, Arabic, Hebrew,
whatever)
Perhaps the standard database tools will have issues, in loading but you
can write very simple python scripts to do this.
(And yes, I have done exactly this to solve this problem.)
The python solution is as follows:
1) Using the codec module, open the file as a UTF-8 encoded file.
2) Read the line in using the utf-8 encoding.
3) Try to use the unicode() method to convert from utf-8
3a) If you fail, then try to use unicode() method to convert from
'latin1' (I'm assuming that the ISO characterset is ok under latin-1.
4) Now you have the column data in a generic unicode format.
5) Insert the data using the databases encoding.
This works using cx_Oracle and Oracle.
I can use pretty much the same script against IDS of course using the
correct python adapter from Carston Hasse ?sp?
It will load your data and work in the same order of time as the
standard sql loader.
Pretty straight forward.
--
Fernando Nunes
Portugal
http://informix-technology.blogspot.com
My email works... but I don't check it frequently...
_______________________________________________
Informix-list mailing list
Informix-l...@xxxxxxxx
http://www.iiug.org/mailman/listinfo/informix-list
Get more out of the Web. Learn 10 hidden secrets of Windows Live.http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog...
Ian, Fernando -
thank you both for your help.
Fernando -
yes you got it. Within one line of data there is one column
encoded as non UTF-8. This column goes to a field of type TEXT.
And there is another column within the same line of data
that is encoded as UTF-8. This column goes to a field of type VARCHAR.
I am told the source of the data file is a 'UTF-8 DB' on an older
informix version
( 7.31, I am told) running on UNIX (Solaris, I assume). It would not
be surprising if
the data in the TEXT column is not kept accurately at the source DB
already since
it is not made use of. But it is kept!
Could you please elaborate a little on how to enforce the 'old (wrong)
behaviour' you
mentioned?
Ian -
yes, your suggestions are convincing. However they do not
make me happy yet because
1) I can guess only what the non UTF-8 codesets in the (huge) file
are. I can see from the
data that is refers to 22 languages and there are 'funny' other then
8859-1 encodings among.
2) I am reluctant to modify the data files since I am trying to set up
a system as close as
possible to the source system the data was exported from.
The problem you're having will probably not be solved by what I was thinking... At least completely. Let's take the long story path:
Older versions of IDS would do two bad things:
1- Would accept a connection from a client with the wrong DB_LOCALE setting. This causes a very common issue: You create a DB on Unix with the default codeset (8859-1 AKA latin-1). Then you use windows applications and you don't configure the correct DB_LOCALE and CLIENT_LOCALE settings (DB_LOCALE=en_us.819;CLIENT_LOCALE=en_us.CP1252). So it used DB_LOCALE=CLIENT_LOCALE=en_us.CP1252). This caused that no conversion was done, but you were puting CP1252 characters into an 819 database!
2- Whenever DB_LOCALE != CLIENT_LOCALE codeset conversion must happen. But there are cases when there is no possible conversion (several CP1252 characters don't have 819 representation). In these cases, depending on the codeset conversion tables (specific from/to each codeset pair), it could do one of these options (I believe I don't remember them all):
a) All characters without possible conversion will be converted into a substitute character. So it's not possible to convert them back (!)
b) Each character without possible conversion will be converted to a different "unused" character in the destination codeset. This would allow conversion back
c) The character would be converted to a "similar" character
d) ?... I think they were four... but I maybe wrong ;)
So in recent versions these was changed. AFAIK all 11.x don't behave like this.
In v10, the issue 1 was fixed probably in 10.00.xC4 but I'm not sure and issue 2 was fixed in 01.00.xC7
In v9 I would have to look...
Note that CSDK also have implications on issue 1, and I think 2.90.TC6 was changed so that it looks at the database real LOCALE in order to choose the default value for DB_LOCALE
How can we revert this new (and correct) behavior...
For issue 1, set IFMX_UNDOC_B168163 to anything in the engine environment and restart it.
For the second issue if you want the old behavior you have to set a option in the ONCONFIG. I would have to check the name (it was reference as one but apparently it was changed), but YOU DON'T WANT TO DO THIS ;)
So... From scratch. You have a file with two columns, on using UTF8 and the other latin-1. And you want this data into a table within an UTF8 database.
I think you have two options:
1- Insert your data as UTF8 and get a correct database.
2- Insert your data "as is" and keep having a... err... corrupted (?) database
Solution for option 1:
1- Generate a third column in your file. You'll use it to uniquely identify your rows
2- Create a table temp_utf8_col with two columns (integer, column using UTF8)
3- Create a table temp_CP1252_col with two column (integer, column using CP1252)
4- split your file in two (sequential column + utf8 and sequential column + 88591)
5- Set DB_LOCALE=CLIENT_LOCALE=en_us.utf8 and load the utf8 file into temp_utf8_col
6- Set DB_LOCALE=en_us.utf8 and CLIENT_LOCALE=en_us.CP1252 and load the CP1252 file into temp_CP1252_col
7- connect to the database and:
INSERT INTO final_table
SELECT col_utf8, col_88591
FROM temp_utf8_col utf8, temp_CP1252_col cp1252
WHERE utf8.col1 = cp1252.col2
You may have problems with this if there are CP1252 characters that don't have representation in UTF8. This may seem impossible, but if you look closely at some CP1252 characters... try writing a "-" in MS Word, or "quote... text... quote" and you'll see what I mean
Solution for option 2:
- Set the variable mentioned above for issue 1
- restart the engine
- Follow the steps as for option 1, but when setting DB_LOCALE/CLIENT_LOCALE use the same value for both. In other words, change step 6 to:
- Set DB_LOCALE=CLIENT_LOCALE=en_us.CP1252 and load the CP1252 file into temp_CP1252_col
Obviously this serves as an example. You can do the steps in a different way, create indexes on col1 on one of the tables for efficiency etc.
But:
- Option 2 is a dirty trick... Your data will be "corrupted" in some way
- Option 1 may mean that you won't be able to load the file with CP1252 unless you change some data (Ian's idea of using a script language to convert to unicode may be good, but if there are codes without conversion I don't know how it handles...
- The variable to revert to old behavior may not be available on IDS 11.50. I'd have to check...
- It would help to check the original environment taking into consideration the "long story". If you understand how the data got "mixed up" it may help you to fix it...
Well... I think that's it... I should put this in the blog...
Regards,
Fernando Nunes
Portugal
http://informix-technology.blogspot.com
My email works... but I don't check it frequently...
.
- References:
- Loading a data file containing character fields with different encodings
- From: gerrit . schutte
- Re: Loading a data file containing character fields with different encodings
- From: Fernando Nunes
- Re: Loading a data file containing character fields with different encodings
- From: Fernando Nunes
- Re: Loading a data file containing character fields with different encodings
- From: gerrit . schutte
- Loading a data file containing character fields with different encodings
- Prev by Date: Re: Loading a data file containing character fields with different encodings
- Next by Date: Re: Loading a data file containing character fields with different encodings
- Previous by thread: Re: Loading a data file containing character fields with different encodings
- Next by thread: Re: Loading a data file containing character fields with different encodings
- Index(es):
Relevant Pages
|