Handling character encodings in Python or any other language can at times seem painful. Places such as Stack Overflow have thousands of questions stemming from confusion over exceptions like
UnicodeEncodeError. This tutorial is designed to clear the
Exception fog and illustrate that working with text and binary data in Python 3 can be a smooth experience. Python’s Unicode support is strong and robust, but it takes some time to master.
This tutorial is different because it’s not language-agnostic but instead deliberately Python-centric. You’ll still get a language-agnostic primer, but you’ll then dive into illustrations in Python, with text-heavy paragraphs kept to a minimum. You’ll see how to use concepts of character encodings in live Python code.
One of the toughest things to get right in a Python program is Unicode handling. If you’re reading this, you’re probably in the middle of discovering this the hard way.
The main reasons Unicode handling is difficult in Python is because the existing terminology is confusing, and because many cases which could be problematic are handled transparently. This prevents many people from ever having to learn what’s really going on, until suddenly they run into a brick wall when they want to handle data that contains characters outside the ASCII character set.
If you’ve just run into the Python 2 Unicode brick wall, here are three steps you can take to start thinking about strings and Unicode the right way…
“Readers of this blog on my twitter feed know me as a person that likes to rant about Unicode in Python 3 a lot. This time will be no different. I’m going to tell you more about how painful “doing Unicode right” is and why. “Can you not just shut up Armin?”. I spent two weeks fighting with Python 3 again and I need to vent my frustration somewhere. On top of that there is still useful information in those rants because it teaches you how to deal with Python 3. Just don’t read it if you get annoyed by me easily.
There is one thing different about this rant this time. It won’t be related to WSGI or HTTP or any of that other stuff at all. Usually I’m told that I should stop complaining about the Python 3 Unicode system because I wrote code nobody else writes (HTTP libraries and things of that sort) I decided to write something else this time: a command line application. And not just the app, I wrote a handy little library called click to make this easier.
Note that I’m doing what about every newby Python programmer does: writing a command line application. The “Hello World” of Python programs. But unlike the newcomer to Python I wanted to make sure the application is as stable and Unicode supporting as possible for both Python 2 and Python 3 and make it possible to unittest it. So this is my report on how that went…”