Saturday, November 13, 2010

Have you got enought Fiber?

I recently bought last edition of the Pickaxe and started reading through. The "Fibers, Threads and Processes" chapter caught my attention immediately. I'm not on par with latest ruby features (especially, I have never used 1.9), so I decided to understand what fibers are.

The example in the book closely resembles the one presented in the Pickaxe when explaining Fibers. However, I performed some slight modifications are I prefer "filter" programs, reading from stdin and printing to stdout, than the ones which require a command line argument (or worse, have a fixed constant in the program code). Apparently this strategy is very poorly supported by IDEs but well... shit happens. Afterall I'm already tolerating that the environment is not usable before 2 whole minutes... ah, Netbooks!

Anyway, this is the code:
require 'fiber'
require 'fiber'

words = Fiber.new do
  STDIN.each_line do |line|
    line.scan(/\w+/) do |word|
      Fiber.yield word
    end
  end
  STDIN.close
end

counts = Hash.new(0)
while word = words.resume
  counts[word] += 1
end

counts.keys.sort.each {|k| print "#{k}: #{counts[k]}\n"}


So, the idea is that basically the block passed to the fiber is not executed.
When the resume message is sent to the fiber, the control is transferred inside the fiber. The fiber is executed up to a yield message; at this point the control is transferred back to the code which called the Fiber and the value returned by the resume message are the arguments of the yield message.

Ok, rather easy. This is basically equivalent to Python generators. As it happens most of the times, in Ruby you pass a block so to someone, in Python you define a function.

So I decided to provide a rather faithful Python version:
import re
import sys
import collections

def yield_words(iterable):
    for line in iterable:
        for word in re.finditer('\w+', line):
            yield word.group(0)
    
count = collections.defaultdict(lambda: 0)
for word in yield_words(sys.stdin):
    count[word] += 1
 
for word, occurrencies in count.iteritems():
    print word, ':', occurrencies


The structure is essentially the same. The last component in Python and Ruby is extremely similar. Of course in Ruby you pass a block, in Python you have a for loop on an iterable.

The second component is also quite similar. However, in Python we simply iterate over a for loop. We don't have to call explicitly a "resume" command/method (although we could). I think that the essential difference is that Python generators are iterators. On the other hand Ruby Fibers are something slightly different

I decided to keep the structure of the first block as similar as possible. Chaining Python iterators with functions in functools and intertools, the yield_words function could have been written like:

import itertools as it
import functools as ft

def yield_words(iterable):
    return (word.group(0) for word in
            ft.reduce(it.chain, 
                      (re.finditer('\w+', line) for line in iterable)))


Not really sure that chaining all that stuff increases readability, though. As a side note, all the presented versions are really interactive. If you use stdin instead of piping a file (you were not typing lots of text in, were you?), the program is fully interactive.

That is to say, for example, that is, each time you read a line it is immediately processed, as if the two program parts (the "producer", which asks input and the "consumer" which puts it in the dictionary) were mingled. However, the logical separation is quite evident and the code clarity is greatly improved.


No comments: