Python string formatting and UTF-8 problems workaround

By Muhammad Najmi bin Ahmad Zabidi
July 1, 2015

Recently I worked on a program which required me to filter hundred of lines of blog titles. Throughout the assignment I stumbled upon a few interesting problems, some of which are outlined in the following paragraphs.

Non Roman characters issue

During the testing session I missed one title and investigating why it happened, I found that it was simply because the title contained non-Roman characters.

Here is the code’s snippet that I was previously using:

for e in results:                                                                                                                        
    simple_author=e['author'].split('(')[1][:-1].strip()                                                             
    if freqs.get(simple_author,0) < 1:                                                                                               
        print parse(e['published']).strftime("%Y-%m-%d") , "--",simple_author, "--", e['title']

And here is the fixed version

for e in results:                                                                                                                        
    simple_author=e['author'].split('(')[1][:-1].strip().encode('UTF-8')                                                             
    if freqs.get(simple_author,0) < 1:                                                                                               
        print parse(e['published']).strftime("%Y-%m-%d") , "--",simple_author, "--", e['title'].encode('UTF-8')

To fix the issue I faces I added .encode(‘UTF-8’) in order to encode the characters with the UTF-8 encoding. Here is an example title that would have been otherwise left out:

2014-11-18 -- Unknown -- Novo website do Liquid Galaxy em Português!

Python 2.7 uses ASCII as its default encoding but in our case that wasn’t sufficient to scrape web contents which often contains UTF-8 characters. To be more precise, this program fetches an RSS feed in XML format and in there it finds UTF-8 characters. So when the initial Python code I wrote met UTF-8 characters, while using ASCII encoding as the default sets, it was unable to identify them and returned an error.

Here is an example of the parsing error it gave us while fetching non-roman characters while using ASCII encoding:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 40: ordinal not in range(128)

Right and Left text alignment

In addition to the error previously mentioned, I also had the chance to dig into several ways of formatting output.

The following format is the one I used as the initial output format:

print("Name".ljust(30)+"Age".rjust(30))
Name                                                     Age

Using “ljust” and “rjust” method

I want to improve the readability in the example above by left-justify “Name” by 30 characters and “Age” by another 30 characters distance.

Let’s try with the “*” fill character. The syntax is str.ljust(width[, fillchar])

print("Name".ljust(30,'*')+"Age".rjust(30))
Name**************************                           Age

And now let’s add .rjust:

print("Name".ljust(30,'*')+"Age".rjust(30,'#'))
Name**************************###########################Age

By using str, it counts from the left by 30 characters including the word “Name” which has four characters

and then another 30 characters including “Age” which has three letters, by giving us the desired output.

Using “format” method

Alternatively, it is possible to use the same indentation approach with the format string method:

print("{!s:<{fill}}{!s:>{fill}}".format("Name", "Age",fill=30))
Name                                                     Age

And with the same progression, it is also possible to do something like:

print("{!s:*<{fill}}{!s:>{fill}}".format("Name", "Age",fill=30))
Name**************************                           Age
print("{!s:*<{fill}}{!s:#>{fill}}".format("Name", "Age",fill=30))
Name**************************###########################Age

“format” also offers a feature to indent text in the middle. To put the desired string in the middle of the “fill” characters trail, simply use the ^ (caret) character:

print("{!s:*^{fill}}{!s:#^{fill}}".format("Age","Name",fill=30))
*************Age**************#############Name#############

Feel free to refer the Python’s documentation on Unicode here:

https://docs.python.org/2/howto/unicode.html

And for the “format” method it can be referred here:

https://www.safaribooksonline.com/library/view/python-cookbook-3rd/9781449357337/ch02s13.html

python

Custom Ecommerce

Application Development

Database Consulting

Cloud Hosting

Systems Integration

Legacy Business Systems

Security & Compliance

GIS

DevOps / Cloud

Databases

Frameworks

Ecommerce

About Us

Our Team

Clients

Blog

Careers

Our Blog