AdWords
2K members online now
2K members online now
Get started with AdWords - learn the basics to get set up for success
Guide Me
star_border
Reply

Character Encoding used by the editor

Visitor ✭ ✭ ✭
# 1
Visitor ✭ ✭ ✭

Hello,

 

I am trying to import csv files directly into the Editor. These files are generated by a script, using Unicode/UTF-8. 

My problem is that Adwords Editor doesn't encodes correctly the characters, so I must open the csv in Excel and save it again. 
I was wondering if anybody new which encoding is used by the Editor. The Adwords Help only says that the file have to be Unicode, without specifying the encoding.

 

Thanks in Advance

1 Expert replyverified_user
1 ACCEPTED SOLUTION

Accepted Solutions
Marked as Best Answer.
Solution
Accepted by topic author francescodellav
September 2015

Re: Character Encoding used by the editor

Google Employee
# 2
Google Employee

Have your script insert a byte order mark at the beginning of the file - three bytes EF BB BF. Then Editor will correctly identify it as UTF-8.

 

Microsoft Excel has two options to export files that Editor can import: "Comma Separated Values (CSV)", and "Unicode Text". The former uses a system-default ANSI codepage encoding, the latter is UTF-16LE.

 

Editor first looks for a byte-order mark (which Excel doesn't write, but other tools do, e.g. Open Office); absent that, it looks for a pattern of zero bytes associated with UTF-16 and UTF-32 encodings; and absent that, it assumes a system-default ANSI codepage (so that we can import ""Comma Separated Values" format from Excel). Since there is no tell-tale sign that would allow one to easily distinguish UTF-8 from ANSI encoding, UTF-8 has to be indicated with a byte-order mark.

View solution in original post

Marked as Best Answer.
Solution
Accepted by topic author francescodellav
September 2015

Re: Character Encoding used by the editor

Google Employee
# 2
Google Employee

Have your script insert a byte order mark at the beginning of the file - three bytes EF BB BF. Then Editor will correctly identify it as UTF-8.

 

Microsoft Excel has two options to export files that Editor can import: "Comma Separated Values (CSV)", and "Unicode Text". The former uses a system-default ANSI codepage encoding, the latter is UTF-16LE.

 

Editor first looks for a byte-order mark (which Excel doesn't write, but other tools do, e.g. Open Office); absent that, it looks for a pattern of zero bytes associated with UTF-16 and UTF-32 encodings; and absent that, it assumes a system-default ANSI codepage (so that we can import ""Comma Separated Values" format from Excel). Since there is no tell-tale sign that would allow one to easily distinguish UTF-8 from ANSI encoding, UTF-8 has to be indicated with a byte-order mark.

Re: Character Encoding used by the editor

Visitor ✭ ✭ ✭
# 3
Visitor ✭ ✭ ✭

I tried the byte order mark solution and it may work correctly on Windows. It didn't work for me on AdWords Editor for Mac (or Excel on Mac).

 

Google's documentation specifically states that for Adwords Editor on Mac that you must use UTF-16. This is ludicrous to require UTF-16 for languages that should only require UTF-8. But whatever... See bullet #4.

https://support.google.com/adwords/editor/answer/56368?hl=en

 

I work for Finch and we have a lot of Nordic customers. The special chars like ø and å would be correctly encoded in the CSV while in UTF-8, but the AdWords Editor for Mac would continue to display them incorrectly. I finally used the iconv tool to convert the files that I was creating and now it works fine. For example:

 

iconv -f UTF-8 -t UTF-16 in.csv > out.csv

 

If you're using php then you can use the iconv php library to format the strings while you're writing the file, but in my code it was easier to handle file after it was completed.

 

I hope this helps someone else.

 

 

 

Re: Character Encoding used by the editor

Google Employee
# 4
Google Employee
Editor uses the same detection logic on Windows and on Mac. UTF-8 with a byte order mark should work. Could you upload somewhere (e.g. Google Drive) an example file that doesn't work for you, and post a link?

The documentation doesn't say that Editor requires UTF-16. It only suggests that, when exporting from Microsoft Excel, you should select UTF-16 option. The only alternative that Excel provides is a system-default ANSI codepage, which may result in data corruption as the codepage cannot represent the full range of Unicode characters.

Other tools can generate UTF-8 files with a byte order mark, and Editor happily accepts them, to the best of my knowledge.

Re: Character Encoding used by the editor

Visitor ✭ ✭ ✭
# 5
Visitor ✭ ✭ ✭
Thank you for the information. I must have been doing the BOM incorrectly. I've never had to deal with it before.

I'm not using excel at all. I'm taking UTF-8 encoded XML and converting it to CSV to use with AdWords editor. I write the file with PHP fopen and fputcsv. All the special characters work fine in the XML and CSV while editing with a plain text editor like vi. iconv detects that the file is UTF-8. However, when I try importing to AdWords editor then the special characters all fail. Converting it to UTF-16 fixes the problem.

According to information here, AdWords editor should be reading the encoding type, not the BOM and that the BOM is "not recommended" for UTF-8 files.
http://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom

I'll investigate a way for PHP to write a BOM correctly.

Thanks!

Re: Character Encoding used by the editor

Google Employee
# 6
Google Employee
What is this "encoding type" of which you speak? Unfortunately, other than BOM, there is no indication in a UTF-8 file that it is, indeed, a UTF-8 file.

Re: Character Encoding used by the editor

Visitor ✭ ✭ ✭
# 7
Visitor ✭ ✭ ✭
I'm not an encoding master. Here's what I have figured out...

The BOM is not set in my test csv file, yet the standard Linux/OS-X 'file' command can correctly determine that it is UTF-8:

jason$ file --mime-encoding test-utf8.csv
test-utf8.csv: utf-8

Additionally, other tools like vi or TextWrangler can read the file and display the characters correctly and identify the file as UTF-8. However, AdWords Editor doesn't detect that it's UTF-8 because it doesn't have the BOM and doesn't handle the special chars.

The link that I sent says:

"Use of a BOM is neither required nor recommended for UTF-8, but may be encounter in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature."

The wikipedia article says:
"While Unicode standard allows BOM in UTF-8, it does not require or recommend it. Byte order has no meaning in UTF-8 so a BOM only serves to identify a text stream or file as UTF-8 or that it was converted from another format that has a BOM."

So rather than messing around trying to figure out how to get PHP to properly write BOM (I've tried and not been successful), I just use iconv to convert it to UTF-16 as shown above and then the file works.

It solves my problem and I hope it solves someone else's problem.

Re: Character Encoding used by the editor

Google Employee
# 8
Google Employee
I suspect all those tools simply default to UTF-8 when there is no indication otherwise. Unfortunately, Editor cannot do that - the default encoding produced by Excel is *not* UTF-8, and if we don't handle files produced by Excel as best we can, we'll have a riot on our hands. We learned this the hard way - the first version of Editor that implemented CSV Import did assume UTF-8 by default.

I'd love to use Unicode everywhere and forget that codepages ever existed, but sadly, reality intrudes.