• Home

  • Custom Ecommerce
  • Application Development
  • Database Consulting
  • Cloud Hosting
  • Systems Integration
  • Legacy Business Systems
  • Security & Compliance
  • GIS

  • Expertise

  • About Us
  • Our Team
  • Clients
  • Blog
  • Careers

  • VisionPort

  • Contact
  • Our Blog

    Ongoing observations by End Point Dev people

    Python string formatting and UTF-8 problems workaround

    Recently I worked on a program which required me to filter hundred of lines of blog titles. Throughout the assignment I stumbled upon a few interesting problems, some of which are outlined in the following paragraphs.

    Non Roman characters issue

    During the testing session I missed one title and investigating why it happened, I found that it was simply because the title contained non-Roman characters.

    Here is the code’s snippet that I was previously using:

    for e in results:                                                                                                                        
        simple_author=e['author'].split('(')[1][:-1].strip()                                                             
        if freqs.get(simple_author,0) < 1:                                                                                               
            print parse(e['published']).strftime("%Y-%m-%d") , "--",simple_author, "--", e['title']
    

    And here is the fixed version

    for e in results:                                                                                                                        
        simple_author=e['author'].split('(')[1][:-1].strip().encode('UTF-8')                                                             
        if freqs.get(simple_author,0) < 1:                                                                                               
            print parse(e['published']).strftime("%Y-%m-%d") , "--",simple_author, "--", e['title'].encode('UTF-8') 
    

    To fix the issue I faces I added .encode(‘UTF-8’) in order to encode the characters with the UTF-8 encoding. Here is an example title that would have been otherwise left out:

    2014-11-18 -- Unknown -- Novo website do Liquid Galaxy em Português!
    

    Python 2.7 uses ASCII as its default encoding but in our case that wasn’t sufficient to scrape web contents which often contains UTF-8 characters. To be more precise, this program fetches an RSS feed in XML format and in there it finds UTF-8 characters. So when the initial Python code I wrote met UTF-8 characters, while using ASCII encoding as the default sets, it was unable to identify them and returned an error.

    Here is an example of the parsing error it gave us while fetching non-roman characters while using ASCII encoding:

    UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 40: ordinal not in range(128)
    

    Right and Left text alignment

    In addition to the error previously mentioned, I also had the chance to dig into several ways of formatting output.

    The following format is the one I used as the initial output format:

    print("Name".ljust(30)+"Age".rjust(30))
    Name                                                     Age
    

    Using “ljust” and “rjust” method

    I want to improve the readability in the example above by left-justify “Name” by 30 characters and “Age” by another 30 characters distance.

    Let’s try with the “*” fill character. The syntax is str.ljust(width[, fillchar])

    print("Name".ljust(30,'*')+"Age".rjust(30))
    Name**************************                           Age
    

    And now let’s add .rjust:

    print("Name".ljust(30,'*')+"Age".rjust(30,'#'))
    Name**************************###########################Age
    

    By using str, it counts from the left by 30 characters including the word “Name” which has four characters

    and then another 30 characters including “Age” which has three letters, by giving us the desired output.

    Using “format” method

    Alternatively, it is possible to use the same indentation approach with the format string method:

    print("{!s:<{fill}}{!s:>{fill}}".format("Name", "Age",fill=30))
    Name                                                     Age
    

    And with the same progression, it is also possible to do something like:

    print("{!s:*<{fill}}{!s:>{fill}}".format("Name", "Age",fill=30))
    Name**************************                           Age
    print("{!s:*<{fill}}{!s:#>{fill}}".format("Name", "Age",fill=30))
    Name**************************###########################Age
    

    “format” also offers a feature to indent text in the middle. To put the desired string in the middle of the “fill” characters trail, simply use the ^ (caret) character:

    print("{!s:*^{fill}}{!s:#^{fill}}".format("Age","Name",fill=30))
    *************Age**************#############Name#############
    

    Feel free to refer the Python’s documentation on Unicode here:

    https://docs.python.org/2/howto/unicode.html

    And for the “format” method it can be referred here:

    https://www.safaribooksonline.com/library/view/python-cookbook-3rd/9781449357337/ch02s13.html

    python


    Comments