Recently I was tasked with identifying all rows in a PostgreSQL database that contained non UTF8 data for migrating from SQL_ASCII encoding to UTF8. Because of the database's size, just using iconv and diff on a dump file wouldn't work.
I started with this perl script from Jon Jensen: http://blog.endpoint.com/2010/03/postgresql-utf-8-conversion.html
It got me a little way down the road, but what I really wanted was to know which table had the bad row and better yet, be able to easily import the bad rows with a COPY command for cleanup and eventual use as temporary tables that we can be used with UPDATE statements to fix the production DB.
So, I turned to Python. I found a isUTF8(text) definition here: http://www.haypocalc.com/wiki/Python_Unicode#isUTF8.28.29
I took that and wrote pg_find_bad_utf8.py which outputs COPY commands with only the non-UTF8 rows to STDOUT and also outputs a schema-only dump command to STDERR which only dumps the tables with bad data for ease of creating a temporary DB.
You can find the script on github here: https://github.com/pgexperts/pg_find_bad_utf8
What other methods do you folks use?