Saturday, October 30, 2010

Python String Concatenation (again): the revenge of join!

In this second post (the first here) I proposed some benchmarks to investigate the efficiency of string concatenation in Python. I believe that the results are partial for two basic reasons:
  1. we benchmark just concatenation of huge amounts of very small strings
  2. we are creating the strings in the iteration. As a consequence, we are implicitly penalizing the methods using str.join
In this post I address the second issue. I moved the creation of the list outside the benchmarked code, and simplified the functions accordingly.
def str_sum(seq):
    out_str = ''
    for item in seq:
        out_str += item
    return out_str

def str_sum_b(seq):
    out_str = ''
    for item in seq:
        out_str = out_str + item
    return out_str

def str_join(seq):
    return ''.join(seq)

def str_join_lc(seq):
    return ''.join(item for item in seq)

def string_io(seq):
    file_str = StringIO()
    for item in seq:
        file_str.write(item)
    out_str = file_str.getvalue()
    return out_str
These are the same functions we benchmarked last time simplified where necessary. str_sum and str_sum_b are basically unchanged. On the other hand, str_join is completely different: since the first three lines where used to create the list, now they have completely disappeared.
@old_version
def str_join(loop_count):
    str_list = []
    for num in xrange(loop_count):
        str_list.append(str(num))
    out_str = ''.join(str_list)
    return out_str
As a consequence, the funcion is much shorter. I mantained the _lc variant: however, I will remove it from serious benchmarks as it is inevitably slower than the regular version. I also introduced some new functions to test here:
def str_sum_c(seq):
    return reduce(op.add, seq)

def str_sum_d(seq):
    return reduce(op.iadd, seq)

def str_sum_e(seq):
    return sum(seq, '')
They are also disabled: the _e variant is invalid Python. Calling sum on a sequence of strings gives this instructive error:
TypeError: sum() can't sum strings [use ''.join(seq) instead]
The other two were excluded because they are too slow:
itsstr_sum str_sum_cstr_sum_d
1000000.0105722.7818532.704670
Unfortunately, reduce appear to be very slow. Consider these additional functions:
def str_sum_f(seq):
    return reduce(str.__add__, seq)

def str_sum_g(seq):
    out_str = ''
    for item in seq:
        op.iadd(out_str, item)
    return out_str
itsstr_sum_bstr_sum_fstr_sum_g
1000000.0114482.8126120.018933
I decided to benchmark only a relatively small subset of the presented functions. The measures I plotted are the result of 10 executions of the function; the average time can be obtained simply dividing by 10. In the first plot I consider strings composed by numbers of fragments one order of magnitude smaller than the ones benchmarked in my other post.
In this other plot, I present the results with the same order of magnitude. However, the times presented result from 10 executions of the same function: the spikes and irregularities were not eliminated by averaging the times. Consequently, I don't consider the plot meaningful unless I discover the reason of such behavior (which comes completely unexpected).
In the future I will try a benchmark consisting of concatenation of larger strings. Technorati Tags: , , , , ,

No comments: