3 Common Unicode Issues and How to Fix them

3 Common Unicode Issues and How to Fix them

Unicode Issues when Manipulating String and Modules

Usually, it's not hard for us to do text or string manipulation. We can read from a file, write to a file and compare and order strings. Only At times, it's not so straightforward because there are some potential pitfalls.

SyntaxError When Loading Modules with Unexpected Encoding

Every one of us Python enthusiasts has probably encountered this at least once in his programming or engineering career: SyntaxError: Non-UTF-8 code.

If you load a .py module containing non UTF-8 data with no encoding declaration you will get the error above, a likely scenario is opening a .py file created on Windows with cp1252 codec.

To solve this we just add a magic coding comment at the top of the file!

# coding: cp1252
print('Olá, Mundo!')
# Olá, Mundo!

Text File Issues

The best practice for handling text I/O is the Unicode sandwich. It's important to decode bytes to str as soon as you can when you're opening a file for reading. The middle layer of the sandwich is the program's business logic, which involves strictly manipulating text with str objects.

The current Best Practice to handle strings. Image from the Fluent Python Book

Sometimes when decoding we will encounter an error, it's the UnicodeEncodeError Exception! This Happens because we did not specify the encoding argument when opening the file.

Python by default calls locale.getpreferredencoding() when it doesn't find the encoding argument in your open file call. The preferred encoding is different from one platform to another, in Windows, it's the cp1252 encoding while in Linux it's the UTF-8 and this raises the exception.

Hence the solution would be to always provide the encoding argument!

fp = open('cafe.txt', 'w', encoding='utf_8')
print(fp)
# <_io.TextIOWrapper name='cafe.txt' mode='w' encoding='utf_8'>

fp2 = open('cafe.txt')
print(fp2)
# <_io.TextIOWrapper name='cafe.txt' mode='r' encoding='cp1252'>
print(fp2.encoding)
# cp1252

Sorting Text Issue

Python sorts strings by comparing the code points. Unfortunately, this produces unacceptable results for anyone who uses non-ASCII characters, more specifically, different locales have different sorting rules.

The standard way to sort non-ASCII text in Python is to use the locale.strxfrm function which, according to the locale module docs, transforms a string to one that can be used in locale-aware comparisons.

import locale

# set locale to Portuguese with UTF-8 encoding
my_locale = locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8')

fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=locale.strxfrm)

print(sorted_fruits)
# ['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

There are a few caveats though:

  • The locale must be installed on the OS, otherwise setlocale raises a
    locale.Error: unsupported locale setting exception.

  • The locale must be correctly implemented by the makers of the OS, which is not always the case.

  • Locale settings are global, calling setlocale is global. Your application or framework should set the locale when the process starts, and should not change it afterward.

How to Properly Sort Text?

The simplest solution for us is to use a Library, as always.

Pyuca is a pure Python implementation of the Unicode Collation Algorithm (UCA).

The UCA details how to compare two Unicode strings while remaining conformant to the requirements of the Unicode Standard.

import pyuca
coll = pyuca.Collator()
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=coll.sort_key)
print(sorted_fruits)
# ['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

This is simple and works on GNU/Linux, macOS, and Windows.

Conclusion

All sorts of bugs can creep their way into our code and the encoding bugs are no joke! A Unicode Exception could break the app and it's better to catch it as early as possible.

Further Reading

Did you find this article valuable?

Support Nadim Jendoubi by becoming a sponsor. Any amount is appreciated!