RiK0 Tech Temple: Python String Concatenation (again): the revenge of join!

In this second post (the first here) I proposed some benchmarks to investigate the efficiency of string concatenation in Python. I believe that the results are partial for two basic reasons:

we benchmark just concatenation of huge amounts of very small strings
we are creating the strings in the iteration. As a consequence, we are implicitly penalizing the methods using str.join

In this post I address the second issue. I moved the creation of the list outside the benchmarked code, and simplified the functions accordingly.

def str_sum(seq):
    out_str = ''
    for item in seq:
        out_str += item
    return out_str

def str_sum_b(seq):
    out_str = ''
    for item in seq:
        out_str = out_str + item
    return out_str

def str_join(seq):
    return ''.join(seq)

def str_join_lc(seq):
    return ''.join(item for item in seq)

def string_io(seq):
    file_str = StringIO()
    for item in seq:
        file_str.write(item)
    out_str = file_str.getvalue()
    return out_str

These are the same functions we benchmarked last time simplified where necessary. str_sum and str_sum_b are basically unchanged. On the other hand, str_join is completely different: since the first three lines where used to create the list, now they have completely disappeared.

@old_version
def str_join(loop_count):
    str_list = []
    for num in xrange(loop_count):
        str_list.append(str(num))
    out_str = ''.join(str_list)
    return out_str

As a consequence, the funcion is much shorter. I mantained the _lc variant: however, I will remove it from serious benchmarks as it is inevitably slower than the regular version. I also introduced some new functions to test here:

def str_sum_c(seq):
    return reduce(op.add, seq)

def str_sum_d(seq):
    return reduce(op.iadd, seq)

def str_sum_e(seq):
    return sum(seq, '')

They are also disabled: the _e variant is invalid Python. Calling sum on a sequence of strings gives this instructive error:

TypeError: sum() can't sum strings [use ''.join(seq) instead]

The other two were excluded because they are too slow:

its	str_sum	str_sum_c	str_sum_d
100000	0.010572	2.781853	2.704670

Unfortunately, reduce appear to be very slow. Consider these additional functions:

def str_sum_f(seq):
    return reduce(str.__add__, seq)

def str_sum_g(seq):
    out_str = ''
    for item in seq:
        op.iadd(out_str, item)
    return out_str

its	str_sum_b	str_sum_f	str_sum_g
100000	0.011448	2.812612	0.018933

I decided to benchmark only a relatively small subset of the presented functions. The measures I plotted are the result of 10 executions of the function; the average time can be obtained simply dividing by 10. In the first plot I consider strings composed by numbers of fragments one order of magnitude smaller than the ones benchmarked in my other post.

In this other plot, I present the results with the same order of magnitude. However, the times presented result from 10 executions of the same function: the spikes and irregularities were not eliminated by averaging the times. Consequently, I don't consider the plot meaningful unless I discover the reason of such behavior (which comes completely unexpected).

In the future I will try a benchmark consisting of concatenation of larger strings. Technorati Tags: Algorithms, Efficiency, Optimization, Python, String Append, String Concatenation

RiK0 Tech Temple

Saturday, October 30, 2010

Python String Concatenation (again): the revenge of join!

No comments: