Recently I was tasked with identifying all rows in a PostgreSQL database that contained non UTF8 data for migrating from SQL_ASCII encoding to UTF8. Because of the database's size, just using iconv and diff on a dump file wouldn't work.
I started with this perl script from Jon Jensen: http://blog.endpoint.com/2010/03/postgresql-utf-8-conversion.html
It got me a little way down the road, but what I really wanted was to know which table had the bad row and better yet, be able to easily import the bad rows with a COPY command for cleanup and eventual use as temporary tables that we can be used with UPDATE statements to fix the production DB.
So, I turned to Python. I found a isUTF8(text) definition here: http://www.haypocalc.com/wiki/Python_Unicode#isUTF8.28.29
I took that and wrote pg_find_bad_utf8.py which outputs COPY commands with only the non-UTF8 rows to STDOUT and also outputs a schema-only dump command to STDERR which only dumps the tables with bad data for ease of creating a temporary DB.
You can find the script on github here: https://github.com/pgexperts/pg_find_bad_utf8
What other methods do you folks use?
Some years ago I did a similar conversion. Most of my text data was ASCII or LATIN1, but some MS Windows "special" characters had crept in, e.g., "intelligent" quotes. I wrote a plpgsql function that used the PG ascii() function to analyze the text columns and raise notices whenever it found non-ASCII or non-LATIN1 characters. There are some other string functions (convert*) that could be used for flagging or converting strings in the database itself.
ReplyDeleteNice Joe. Did you save those functions or is it on a blog somewhere?
DeleteI think I may have them around somewhere, but they were quick and dirty, "get the job done" deals (on a fairly small db). I'd probably rewrite them nowadays (from what I recall, they used brute-force: a for loop to extract each character and then compare its ascii() value to > 127 or the range 127-160 for latin1). Now I'd probably explore the convert functions.
ReplyDelete