In this second post (the first
here) I proposed some benchmarks to investigate the efficiency of string concatenation in Python.
I believe that the results are partial for two basic reasons:
- we benchmark just concatenation of huge amounts of very small strings
- we are creating the strings in the iteration. As a consequence, we are implicitly penalizing the methods using str.join
In this post I address the second issue. I moved the creation of the list outside the benchmarked code, and simplified the functions accordingly.
def str_sum(seq):
out_str = ''
for item in seq:
out_str += item
return out_str
def str_sum_b(seq):
out_str = ''
for item in seq:
out_str = out_str + item
return out_str
def str_join(seq):
return ''.join(seq)
def str_join_lc(seq):
return ''.join(item for item in seq)
def string_io(seq):
file_str = StringIO()
for item in seq:
file_str.write(item)
out_str = file_str.getvalue()
return out_str
These are the same functions we benchmarked last time simplified where necessary. str_sum and str_sum_b are basically unchanged. On the other hand, str_join is completely different: since the first three lines where used to create the list, now they have completely disappeared.
@old_version
def str_join(loop_count):
str_list = []
for num in xrange(loop_count):
str_list.append(str(num))
out_str = ''.join(str_list)
return out_str
As a consequence, the funcion is much shorter. I mantained the _lc variant: however, I will remove it from serious benchmarks as it is inevitably slower than the regular version.
I also introduced some new functions to test here:
def str_sum_c(seq):
return reduce(op.add, seq)
def str_sum_d(seq):
return reduce(op.iadd, seq)
def str_sum_e(seq):
return sum(seq, '')
They are also disabled: the _e variant is invalid Python. Calling sum on a sequence of strings gives this instructive error:
TypeError: sum() can't sum strings [use ''.join(seq) instead]
The other two were excluded because they are too slow:
its | str_sum | str_sum_c | str_sum_d | |
---|
100000 | 0.010572 | 2.781853 | 2.704670 |
Unfortunately, reduce appear to be very slow. Consider these additional functions:
def str_sum_f(seq):
return reduce(str.__add__, seq)
def str_sum_g(seq):
out_str = ''
for item in seq:
op.iadd(out_str, item)
return out_str
its | str_sum_b | str_sum_f | str_sum_g | |
---|
100000 | 0.011448 | 2.812612 | 0.018933 |
I decided to benchmark only a relatively small subset of the presented functions. The measures I plotted are the result of 10 executions of the function; the average time can be obtained simply dividing by 10. In the first plot I consider strings composed by numbers of fragments one order of magnitude smaller than the ones benchmarked in
my other post.
In this other plot, I present the results with the same order of magnitude. However, the times presented result from 10 executions of the same function: the spikes and irregularities were not eliminated by averaging the times. Consequently, I don't consider the plot meaningful unless I discover the reason of such behavior (which comes completely unexpected).
In the future I will try a benchmark consisting of concatenation of larger strings.
Technorati Tags:
Algorithms,
Efficiency,
Optimization,
Python,
String Append,
String Concatenation
No comments:
Post a Comment