Saturday, October 23, 2010

Python repeated string concatenation: should we change our habits?

One of the most common errors newbies do is related to string concatenation in Python. In Python, string are immutable objects; consequently, concatenating two string effectively creates a new object (and possibly releases the two former object). The common advise is to build strings using the popular str.join method.
sep.join(string_framents_sequence)
which frequently just turns into:
''.join(string_framents_sequence)
Of course, sometimes the advice goes as far as "never" ever concatenate two strings". In some sense, the advise is good, but it is often given for the wrong reasons. In fact, most of the times the '%' (or format) idiom is simply clearer. Compare:
'[%d:%d] %d %s>' % (min, hour, ret_code, path)
with:
'[' + min + ':' + hour + '] ' + ret_code + ' ' + path + '>'
Anyway, recently it has been disputed whether the str.join alternative is so much faster than repeated concatenation, given the fact that the += operator has been improved. My goal is not to investigate this matter.

Baseline

At first, I decided to benchmark this simple function:
def baseline(loop_count):
    for num in xrange(loop_count):
        str(num)
Since my first testing code is creating strings from a bunch of integer, here I am just benchmarking performance of the bare operation, without actually building the string. The second step is to accumulate all the fragments. The idea is to discover the point where thrashing starts, that is to say the range where the operation behavior remains linear. Outside these zone, we must figure out other factors and in fact the benchmark is not relevant. A third baseline tests uses list comprehensions.
def baseline2(loop_count):
    lst = []
    for num in xrange(loop_count):
        lst.append(str(num))
    return lst

def baseline3(loop_count):
    return [str(int) for int in xrange(loop_count)]
If we examine the plot of the execution times, we can see that when we keep in memory all the string fragments, thrashing starts after 107 iterations. The graph is logarithmic in both axis. As a side note, using list comprehensions is twice as fast than manually adding stuff to the list one by one. Benchmark of String Fragments Creation

Tested Code

These are the functions I actually benchmarked. Full code for the benchmark can be found in my git repository.
def str_sum(loop_count):
    out_str = ''
    for num in xrange(loop_count):
        out_str += str(num)
    return out_str

def str_join(loop_count):
    str_list = []
    for num in xrange(loop_count):
        str_list.append(str(num))
    out_str = ''.join(str_list)
    return out_str

def str_join_lc(loop_count):
    return ''.join(str(num) for num in xrange(loop_count))

def string_io(loop_count):
    file_str = StringIO()
    for num in xrange(loop_count):
        file_str.write(str(num))
    out_str = file_str.getvalue()
    return out_str
str_sum is essentially string concatenation using operator +=. str_join and str_join_lc are string concatenations performed with str.join. The str_join_lc variant uses list comprehensions. string_io is a StringIO based method (here it is the StringIO from cStringIO). Code has been adapted from the code benchmarked here. However, I decided not to use the `` operator, even though it is faster. Since it is going away from Python, I believe it is not relevant anymore. Moreover, using str in place of `` uniformly slows down all the methods.

Results

In this graph, I present results relative to Python2.6. Essentially, nowadays the faster method to create the strings has been proven to be the once advised str.join. However, I believe that difference is not great enough to change habits.
Here I present also plots for Python 2.7 and Python 2.6; essentially there is no significative difference with relative results. Python 2.7 is just faster than Python 2.6 with every considered method (and Python 2.5 is slower).
So, it appears that indeed concatenating strings with the += operator is simply faster. Out of curiosity I also tried a variant:
def str_sum_b(loop_count):
    out_str = ''
    for num in xrange(loop_count):
        out_str = out_str + str(num)
    return out_str
and I found no real difference with the variant with +=. Wow! The benchmark was executed on my MacPro. Python2.5 and Python2.6 are the Python variants distributed with the OS (Snow Leopard). Python 2.7 has been dowloaded from Python.org. In this other post I investigated string concatenation using pre-built list of string fragments. The results are quite different, indeed! Technorati Tags: , , , ,

No comments: