Friday, December 08, 2017

Correctly generate CSV that Excel can automatically open

Software generating CSV should include the byte order mark (BOM) at the start of the text stream. If this byte is missing programs like Excel won't know the encoding and functionality like just double clicking the file to open it with Microsoft Excel won't work as expected in Windows neither MAC.

You might want to do a simple test yourself. Let us say that you have a BOM missing UTF-8 CSV and when opened in Excel it renders garbled text. If you open such file in Notepad and save it back with a different name, selecting UTF-8, the new file will be rendered correctly. If you compare the two files (using a nix system) you will notice the difference is in three bytes that specify the encoding of the CSV:
$ diff <(xxd -c1 -p  original.csv <(xxd -c1 -p  saved-as-utf8.csv) 
0a1,3
> ef
> bb
> bf
Tell the software developer in charge of generating the CSV to correct it. As a quick workaround you can use gsed to insert the UTF-8 BOM at the beginning of the string:
gsed -i '1s/^\(\xef\xbb\xbf\)\?/\xef\xbb\xbf/' file.csv
This command inserts the UTF-8 BOM if not present. Therefore it is an idempotent command.

No comments:

Followers