Japanese-language CSV export for Excel
MS Office is a complicated piece of junk software. Especially when it comes to Excel handling non-Roman encodings in anything outside of native Excel worksheets.
I have faced this problem when I was asked to output data from database (everything is stored in UTF-8, of course), to a CSV file. Which I, of course, did - and used UTF-8 as the encoding for CSV file.
Now we have: Unicode, already 10 years in use, and CSV, in use for even a longer time, and Excel, being developed by the biggest corporation in the world during all that time. What could go wrong you wonder?
Well.. Excel can and mostly always goes wrong.
The problem - Excel DOESN'T KNOW how to import Unicode UTF-8 CSV files! It just doesn't know about Unicode when it comes to CSV import! Absolutely unbelievable. More than that - if you use non-Japanese version of MS Office (I use English version) - you can't even export Japanese-language Excel document to CSV file! All Japanese characters will be converted to underscores.
The solution.
It took me several hours of googling, but I finally brought all the pieces together and here is what you need to do to create Excel-readable CSV with Japanese characters in it:
- When creating CSV, you must use the ShiftJIS encoding and Windows line-breaks \r\n
- In order to read the created CSV file you must use Japanese version of MS Office
And no, you can't test everything works as it should on non-Japanese version of MS Office. There's just no support for ShiftJIS there!
Of course you can use OpenOffice which read Unicode (and any other encoded)-CSV files without a problem. It doesn't work good on Mac, but is pretty nice on Windows.
I can talk for hours on what the hell is going with Microsoft and its Office package, but what I think is this situation is just not acceptable in the year 2007 (and was not acceptable even in year 2003 when the prev version of Office came out). But oh well.. at least there's a solution which will possibly save someone a couple of hours and some couple of millions nerve cells ;)
January 29th, 2008 at 6:32 pm
I got exactly the same problem. And the same frustration. Also the same irritation!
Excel does not bother checking the encoding of the incoming csv file (unless all encoded characters are affected, like utf16. sjis…, unlike utf-8).
I didn't want to affect the whole data and change the encoding of the exported csv text (like UTF-16 or SJIS) in order to keep some consistency along the whole data processing chain (utf-8 encoding).
So finally I changed the output to a XML text.
Advantages: keep utf-8 and a human readable output. Also you can specify data types accurately.
Drawbacks: use the internal coding of Excel and discover the incredible lack of flexibility of the software… When specifying a type, data cannot be blank (except String) and has to be provided the exact same way!
I also uncovered an interesting Excel specificity: any time hh:mm is actually a date, and dates start from 1899-12-31(?!), so 26:00 is actually 1900-01-01T02:00:00.000 ; that's not really a problem ; however some Excel versions consider 1900 to be a leap year, and when hours:minutes exceed 2 months…
January 29th, 2008 at 6:43 pm
Wow bloody hell! And I thought I had a pretty deep trip into Excel's code nonsense! But you definitely outdone me by a huge margin! :)
Thanks for sharing!
By the way, I have also saw some info on the net that you can export your tabular data as a HTML table, and import that into Excel. Haven't tried that myself, so I'm not sure what will happen with non-Roman characters inside these tables, even given the correct document encoding - but that might be another thing to try.