Encoding and debugging

It's hard to succinctly describe why text encoding is a problem. Basically, it stems from the early days of computing when Western characters mattered. Others, not so much — they were barely an afterthought.

By the way, the em-dash on the last sentence? That can be represented many different ways based on the encoding of the text.

Different encodings translate the bits that underpin all manner of characters into what we actually see on screen. Many have the same for A through Z, 0 through 9 and basic punctuation, so you don't even know you have a problem.

For more information, read this sweeping overview.

When it's a problem, the trick is to decode your input early into unicode, do whatever you need to do with your code, and then encode back into something like UTF8 when it's been written to a file.

We encountered this a bit during our scrape at the beginning of the workshop. We had a few columns in the table with these other characters, and we were combining them with regular strings into individual lines that would be written to a CSV. We encoded them before they were written so they didn't break our script.

Files to use:

  • encoding.py: A script where we're trying to read a text file, print the lines to the screen and then write the output to another text file. It's not working out so well.

  • broken_code.png: Flow chart for common errors in Python, what they mean and what else you can check when things aren't working out exactly as you expect them to.

  • some_text.txt: A simple text file, encoded in Windows-1252, known by Python as 'cp1252.'

As always, a working version of the script is in completed.