For work-related reasons I found myself trying to parse Comma-Separated-Value files (CSV) and to feed such data into a database. Definitely not a complicated task: the most difficult part was getting the permission to access resources, and figuring what data to put and where.
I decided to use Python 3 because I needed to be fast in implementing it. So it happens that input file are UTF-8 encoded, and Python 3 is also well known for the good Unicode support. To be 100% honest, after all this work on PFT i would have preferred Perl… My choice was on Python also because my colleagues know how to use it.
In a few days my prototype was done, tested on my local system and ready to be installed with pip. Docker is really handy when it comes to testing! Unfortunately I made a big mistake in not realizing how the target system is so old that Python 3 is not even packaged for it. Experience is what you get when you fail, they say. At least I knew Python 2 was available, so I worked to back-porting it.
The back-porting procedure was more or less painless. The only part which is worth mentioning is the hack required to handle the CSV format together with Unicode. This helped me in understanding a number of things about Unicode and Python, and why UTF-8 is what it takes to adapt legacy systems.
The problem
Both Python 2 and Python 3 come with a module named csv. In the simplest case everything boils down to a single call:
csv_reader = csv.reader(iterable)
Where the iterable
parameter can be any object iterating over strings
(ideally the lines of a text file), while the returned csv_reader
is an
iterator over arrays of strings (each item is a column, each column
is an array). I put the "strings" word in bold because everything
depends on the notion of strings (used to be array of bytes, nowadays
it is unicode string).
Let's consider the following one line CSV text file (test.csv
):
☺,👪,🐻,🐯
And the following python script parse.py
:
from __future__ import print_function
import csv
with open('text.csv', 'rb') as fd:
lines = [line.decode('utf-8') for line in fd]
for line in lines:
print(repr(line), type(line))
for i, row in enumerate(csv.reader(lines)):
for j, val in enumerate(row):
print('row', i, 'col', j, 'val', val)
A Python 3 execution will produce the following output:
'☺,👪,🐻,🐯\n' <class 'str'>
row 0 col 0 val ☺
row 0 col 1 val 👪
row 0 col 2 val 🐻
row 0 col 3 val 🐯
While in Python 2 we get the following (problematic) one:
u'\u263a,\U0001f46a,\U0001f43b,\U0001f42f\n' <type 'unicode'>
Traceback (most recent call last):
File "parse.py", line 10, in <module>
for i, row in enumerate(csv.reader(lines)):
UnicodeEncodeError: 'ascii' codec can't encode character u'\u263a' in position 0: ordinal not in range(128)
The perfectly legit question is: who the hell asked for str.encode(…,
'ascii')
?. And the answer is: the csv
module did. Apparently this is
a well known issue, since a workaround is reported in the online
documentation of Python 2 for the csv
module.
The workaround
The suggested workaround is encoding the input of csv.reader
as UTF-8,
and decoding from UTF-8 any returned string.
This is my monkey-patch module, largely inspired (copy-paste-tweak) from the relevant Python 2 documentation:
from __future__ import print_function, absolute_import
import sys
if sys.version_info.major == 3:
from csv import reader
else:
import csv
def _utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
def reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(_utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs
)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
Why it works
Clearly it works because of some good properties of UTF-8, but realizing exactly why was not immediate.
By chance I was reading a nice document about how to write secure applications under Unix, when I stumbled into the interesting section about additional security risks when using UTF-8.
To summarize the concept, in UTF-8 one can use one to six bytes for
representing a 31-bit value. If a value does not need all 31 bits to be
represented it can and should be encoded with the shortest padding
possible. Look here for more detailed information. The
possible vulnerabilities come from the fact that tricky values (e.g.
0x00
) can be represented in multiple ways. The rationale is: first
decode, then do security checks.
While studying how this works, I also realized how UTF-8 can maintain back-compatibility with ASCII: it represents all ASCII characters as expected by any "old-style" software, while all non-ASCII symbols, represented with multiple bytes, never use the same sequences internally (not even in following bytes).
In the case of our CSV problem, the parser is designed to split rows over
the ,
(COMMA
) symbol. Any input encoded as UTF-8 is a sequence of
bytes in which 0x2C
can only represent COMMA
. The same value can not
be found in the binary sequence used for representing other Unicode
symbols, so the parser will do the Right Thing™ while ignoring "weird"
sequences which would correspond to ASCII control characters.