Random notes from mg

a blog by Marius Gedminas

Marius is a Python hacker. He works for Programmers of Vilnius, a small Python/Zope 3 startup. He has a personal home page at http://gedmin.as. His email is marius@gedmin.as. He does not like spam, but is not afraid of it.

Thu, 07 Jan 2010

Latin-1 or Windows-1252?

Michael Foord wrote about some Latin-1 control character fun in a blog that's hard to read (the RSS feed syndicated on Planet Python is truncated, grr!) and hard to reply (no comments on the blog! my Chromium's AdBlock+ hid the comment link so I couldn't find it), but never mind that.

Unfortunately the data from the customers included some \x85 characters, which were breaking the CSV parsing.

0x85 is a control character (NEXT LINE or NEL) in Latin-1, but it's a printable character (HORIZONTAL ELLIPSIS) in Microsoft's code page 1252, which is often mistaken for Latin-1. I would venture a suggestion that the encoding of the customer data was not latin-1 but rather cp1252.

>>> '\x85'.decode('cp1252')
u'\u2026'
posted at 23:29 | tags: | permanent link to this entry | 3 comments
Actually, his blog does allow comments, so I added a comment and linked it back to here.
posted by dave-ilsw at Fri Jan 8 03:44:42 2010
D'oh!  It was my Chromium's AdBlock+ plugin that hid the "Comments" and "Traceback" links on Michael's blog.
posted by Marius Gedminas at Fri Jan 8 13:07:54 2010
You're probably right. The customer swears blind that it is latin-1 encoded but I bet it is CP1252.
posted by Michael Foord at Fri Jan 8 13:28:44 2010

Name (required)


E-mail (will not be shown)


URL


Comment (some HTML allowed)