Random notes from mg

a blog by Marius Gedminas

Marius is a Python hacker. He works for Programmers of Vilnius, a small Python/Zope 3 startup. He has a personal home page at http://gedmin.as. His email is marius@gedmin.as. He does not like spam, but is not afraid of it.

Thu, 07 Jan 2010

Latin-1 or Windows-1252?

Michael Foord wrote about some Latin-1 control character fun in a blog that's hard to read (the RSS feed syndicated on Planet Python is truncated, grr!) and hard to reply (no comments on the blog! my Chromium's AdBlock+ hid the comment link so I couldn't find it), but never mind that.

Unfortunately the data from the customers included some \x85 characters, which were breaking the CSV parsing.

0x85 is a control character (NEXT LINE or NEL) in Latin-1, but it's a printable character (HORIZONTAL ELLIPSIS) in Microsoft's code page 1252, which is often mistaken for Latin-1. I would venture a suggestion that the encoding of the customer data was not latin-1 but rather cp1252.

>>> '\x85'.decode('cp1252')
u'\u2026'
posted at 23:29 | tags: | permanent link to this entry | 3 comments