01 February 2006

Generators

Generators are a breakthrough in programming. Python got them from Icon, and I'm pretty sure C# got them from Python (they're called iterators in C#). I confidently predict that in ten years every major programming langauge, except C++, will have them.*

To the programmer, a generator is a function that returns a sequence of values. It's like a function that returns an array; but instead of building the array, then returning the whole array at the end, a generator yields the items one by one.

For example, here's a Python function that returns all the headings in an XHTML document.

    HEADINGS = ('h1', 'h2', 'h3', 'h4', 'h5', 'h6')

    def getHeadings(node):
        headings = []
        for c in node.childNodes:
            if c.nodeType == c.ELEMENT_NODE:
                if c.localName in HEADINGS:
                    headings.append(c)
                else:
                    for h in getHeadings(c):
                        headings.append(h)
        return headings

Here's the same function, reworded to be a generator.

    def iterHeadings(node):
        for c in node.childNodes:
            if c.nodeType == c.ELEMENT_NODE:
                if c.localName in HEADINGS:
                    yield c
                else:
                    for h in getHeadings(c):
                        yield h

When you call getHeadings(), it processes the entire document and returns a Python list. When you call iterHeadings(), it returns an iterator instead. But you use them the same way: in a for loop.

    doc = xml.dom.minidom.parse(filename)
    for h in iterHeadings(doc):
        print h.toxml()

How is this better than the version that builds a list? It uses less memory (since you never have to have the whole list in memory at once). It yields the first value sooner, because it doesn't scan the entire file up front. You can process some of the file and then break out of the loop; the rest of the file won't be searched.

What generators are good for

Suppose you want to write some code that produces (or parses, or whatever) a sequence of values (or objects, whatever). For example, suppose you're parsing a server log file and spitting out a stream of log records. Now you need to choose between a push processing model and a pull model. (These are other ways, but these models are good because they let you scan lots of data without using lots of memory, and they don't require you to use threads.)

You could write a method void parse(File filename, LogRecordHandler handler) that opens the file, reads the stream of log records, and calls a handler object method for each record in the file. The caller provides the handler object. This is the push model: the parse() method has control, and it pushes data to the caller via handler methods.

Another way to do it is to write a LogFileReader class. It could have a next() method that returns the next record from the log file. This is the pull model: the user creates a LogFileReader object and explicitly pulls each record out of it.

(SAX is pushy; DOM node iterators are pull-y.)

Each approach sucks, for a different reason. Push is easier to write but harder to use. It tends to be great for trivial use cases, but then later, when the caller wants more control or the data gets more complex, it turns out to be a pain. The caller ends up writing a bunch of complicated, boring dispatch code, because in even slightly complex cases, the handler object has to have some state to remember where it is and what's going on, and wants to handle incoming data differently depending on what's happening. And the flow of control, which I think properly belongs in the caller's hands, is instead left to the library being called (in our example, the parse() method).

Pull is easier on the caller: looping over the results from an iterator is quite natural, and special things like breaking out of the loop, or even passing the iterator to a whole different routine to handle part of the log differently, are trivial to implement. But pull is harder to implement for the producer.

The problems are style problems. In the push model, the consumer is the push-ee, and the push-ee code is convoluted and ugly. In the pull model, the producer is the pull-ee, and the pull-ee code is convoluted and ugly. To summarize: active code, code that does something, is concise, direct, and readable. Passive code, code that has a sequence of things done to it, is a mess.

Generators allow both the library and the caller to use active language. The caller's loop and the library's generator run in tandem. The library generator is written to make it look like it's pushing values to the caller. The caller's loop looks like it's pulling values from the generator. Both use active, procedural language.

*Ruby blocks do most of the same stuff as generators. The main thing Ruby blocks can't do is act like an iterator, which you could pass around to various functions, or stop iterating and resume later. For example, zip() can't be efficiently implemented over Enumerables in Ruby. Not a big deal.

I don't expect Ruby to get new generator syntax. Instead it might get an Enumerable::iterator() method that uses continuations to do essentially the same thing. It'll be hairy; but occasionally you need an iterator over a data structure, and continuations do the trick.

For C++ to die out, C++ programmers will have to die out. It'll be a while.

Correction: Ruby has a stdlib class Generator that turns a Ruby iterator into an object with next? and next methods. Nice. I hear it used to be continuation-based, but it'll use lightweight threads in an upcoming release.

No comments: