Published on Tue, May 14, 2013
So what is “UTF-8 BOM” mean ? its byte order mark for UTF-8, some bytes (0xEF,0xBB,0xBF) are added at the start of the file to indicate that the file having unicode characters in it. BOM Characters “9”.
As per Unicode documentation, the presence of BOM in file are useless, because it causes problems with non-BOM-aware software’s to identify or parse the leading characters having at the start. Same has been quoted at the bottom of the Wikipedia page:
Related errors in
PostgreSQL:
ERROR: invalid input syntax for integer:
“9” (in psql-client)
SQL state: 22P02 (in PgAdmin-III)
Test case & fix on
Windows:
Sample file “state_data.txt” created in NOTEPAD with unicode characters
in it:
9,Karnataka,कर्नाटक
10,Kerala,केरळा
Table to import data:
create table states(state_code int, state_name char(30), state_in_hindi text);
Error:
postgres=# copy test from 'c:/Pgfile/state_data.txt' with delimiter ',' CSV;
ERROR: invalid input syntax for integer: "9"
CONTEXT: COPY test, line 1, column state_code: "9"
To fix, I have used a tool “bomremover.exe” to remove leading characters from a file as its on windows, if its on linux, then there are many tips & tricks available on net to wipe BOM from a utf-8 format file.
Tool Download link
and usage:
http://www.mannaz.at/codebase/utf-byte-order-mark-bom-remover/
Eg:-
C:\Pgfile>bomremover.exe . *
Added '.\state_data.txt' to processing list.
Press enter to process all files in the list. (1 files in total)
Processing file '.\state_data.txt'...
Finished. Press Enter to Exit
After running bomremover.exe on file, re-run COPY command which will succeed to import data.
state_code | state_name | State_name_in_hindi
------------+------------+---------------------
9 | Karnataka | αñòαñ░αÑ<8d>αñ¿αñ╛αñƒαñò
10 | Kerala | केरळा
(2 rows)
Some of the editors, avoids default saving text with UTF8-BOM:
–Raghav