26 February 2006

Quiz

I received the following e-mail today from joldford@strato.com:

hello jason orendorff,
If you don't mind me asking are you a programer by profession? I noticed Microsoft has new Visual programming products, it is hyped for easier, quicker, more affordable programing, are you familiar with it? http://msdn.microsoft.com/developercenters/ http://msdn.microsoft.com/vstudio/products/
Do you freelance (independently write, produce, or make) any desktop programs for Windows 98, XP? Or Mac?
Regards,
John Oldford

The question is, is this spam?

23 February 2006

To be continued

An MIT mailing list on lightweight languages poses the warm-up question: What is the value of this expression:

  (call/cc call/cc)

Ready for the main event? Figure out this one:

  ((call/cc call/cc) (call/cc call/cc))

The answer is helpfully missing.

I went through several contradictory answers before figuring out a good way to reason about continuations. I'll put a hint in the comments.

22 February 2006

SICM?

Structure and Interpretation of Computer Programs, by Abelson and Sussman, is the best book on programming I've ever mostly read. You can read the full text online for free. You'll learn Scheme as you go.

This course and the accompanying book, Structure and Interpretation of Classical Mechanics, coauthored by Sussman, approaches basic physics via computer simulation, rather than algebra and calculus. Shades of A New Kind of Science. Interesting.

21 February 2006

To do

  • Explore possible APIs for discrete finite graphs, along the lines of the stuff for arrays and hash tables that has become canon in PL design these days. (I guess awk started this trend?) Graphs are everywhere: directory trees with symlinks; PL syntax trees; objects in memory and the pointers that tie them together; network architecture—any relational database with foreign keys is a graph.

  • Come to think of it, what is it exactly about arrays and hash tables that makes them so insanely useful? Arrays: ordered, integer-indexed, random-access in constant time, non-sparse, mutable, growable in constant time on average, efficiently sortable, non-unique collection of values. Hash tables: non-ordered, arbitrarily-indexed, random-access in constant time on average, mutable, growable in constant time on average, unique-per-key collection of key-value pairs. There are nigh-infinite combinations of collection properties like these; why these two combinations in particular?

  • That reminds me: submit a patch for Python that implements a priority_queue type to add to the collections module. It would basically deprecate the heapq module. heapq is just a collection of functions; this would be a class. Unlike heapq, it would accept key=, cmp=, and reversed= keyword arguments that would govern the sort order. This is a really good idea, worth doing. It's the only idea I know of that clearly belongs in collections right away.

    One problem with this is that I've never implemented a Fibonacci heap before. I only know about binary heaps, like the one in heapq, which are slower. For example, merging binary two heaps is O(n); with Fibonacci heaps it's O(1).

20 February 2006

Python-style generators and listcomps in JavaScript?

According to a random Mozilla developer, JavaScript may get generators and array comprehensions. The proposed syntax is extremely similar to Python.

Possible side projects

I think I've picked my next side project. It sits nicely on the border between pointless and useful, and it's something particularly interesting and profitable for me. I've been thinking about how to do it, and I'm pretty sure I have a good approach. Best of all, it isn't anything I've mentioned here yet. The only thing is, it will take some focused effort.

One more side project idea:

  • Blog about the cool ideas that own my brain. I suspect there are a few really killer counterintuitive ideas I've run into that really knocked me flat and guide my entire philosophy—reliable systems made of unreliable parts (i.e. the Internet); order without design (the most important idea in economics and biology); etc. The main question is whether enough of them are non-math and non-computing to make them worth writing about.

17 February 2006

Possible side projects

  • As an April Fool's joke, write a PEP to add Lisp-style macros to Python, complete with quoting, the whole nine yards. (I have a nagging feeling this idea is just short of brilliant, so if you happen to be a Python developer, please don't give me away! If this entry mysteriously disappears from my blog, you'll know I'm committed...)

Possible side projects

Considering my goal of working on one frivolous side project at a time, I've decided to post (even more of) my stupid ideas here, so I always have plenty of frivolous things to choose from.
  • It would be cool if subversion had a command svn post-patch -u <url> that would basically do svn diff, put a summary of the files touched into a text file, open an editor so the user can add an explanation of the patch, tar+gzip it all together, and post the resulting patch to a specified HTTP URL. svn could even do something dumb to find the appropriate URL within the source tree itself, rather than requiring the user to specify it.

    The idea is, if you're a big open-source project using subversion, you could install some hypothetical web-based tool to receive and track patches. Your core developers could browse and apply patches with a few clicks.

16 February 2006

What about fantasy football?

fishsupreme linked an essay on Al Qaeda’s Fantasy Ideology. This made me smile:

It is, to be frank, something like “Dungeons and Dragons” carried out not with the trappings of medieval romances — old castles and maidens in distress — but entirely in terms of ideological symbols and emblems. The difference between them is that one is an innocent pastime while the other has proven to be one of the most terrible scourges to afflict the human race.

Dungeons and Dragons, an innocent pastime? My, how times have changed.

Possible side projects

  • Help daniele with his really cool idea for a web site. (Bonus: It'll be a great way to learn Ruby on Rails.)
  • Make a write-only object model for generating C code. I wonder if this exists. I have not come up with the right set of search terms to find it. The point is, if you create little programming languages, it's easier and more portable for your compiler to output C than machine language. On the other hand, you can't have continuations if you're stuck with C. I guess it's easier to produce Scheme code anyway...
  • Write a library that allows any of a nice set of VMs (CLR, JVM, Python VM, etc.) to load any other on demand and use its libraries. The benefit is, you wouldn't have to choose a programming language for a particular project based on the libraries you want to use. There are tons of implementations of this for specific pairs of VMs (i.e. a version of Python that embeds the CLR) but nothing generic that I can see. This is probably too hard.
  • Download pypy and see if I can write an objectspace where everything is a thunk. Lazy Python! (There is no point whatsoever to doing this.)
  • Make a language called Chef 2 that's like Chef, but with a larger standard library of verbs and tools to give the recipes more variety. Possibly the theme could be more of a horror-flick thing. (Equally pointless.)
  • Write some foldout toys in Haskell. Nothing specific in mind.
  • Write more interpreters.

Interfaces and duck typing

It's nice to be able to add interfaces to a class with minimum effort.

In Java, you have to add the implements tag at the top, copy all the methods in, and implement them all. There's nothing terribly wrong with this.

In Ruby or Python, there are no explicit interfaces, and the result when you make a mistake is AttributeErrors at run time. One advantage is, you don't have to "pretend"-implement interface methods that you don't actually support. For example, Python sockets mostly resemble files, but they don't have a seek() method.

The thing is, you can never really tell if an object provides an interface or not. Static analysis is hopeless because the interface isn't defined anywhere. Runtime analysis can't work because there's no tag on your object, nothing to check. You can use hasattr(f, 'read'), but even then there's no guarantee that the caller intended it to work like file.read().

Some Python people say you ought to just try the operation, and if it fails, cope with the error. I think this is ridiculous.

Here's an alternative syntax, somewhere in between Java and Python.

package std;

interface bytes read(int len);
interface bool isAtEnd();
interface long tell();
interface boid seek(long offset, SeekDir whence);
interface void close();
Here I'm defining interfaces. Each one is a single method. Now suppose I have a spam generator. I can add methods that make it act kind of like a file.
class Spammer {
    ...
    std.read(len) { return generateSpam(len); }
    std.isAtEnd() { return false; }
    std.close() {}
}

The spammer supports read() and isAtEnd() (always false, because there's always more spam). close() is harmless enough. But code that uses seek() isn't going to work with a Spammer, which I think is fine.

The key here is that by naming the method std.read, the author is indicating (with a minimum of typing) that the class implements a specific, documented interface. In introspectable languages (like Java and Python) this would be easily queryable at run time.

(Incidentally, a statically typed language can infer the return type and parameter types, so it's actually less typing to define std.read(len) than bytes read(int len). Not sure if this is actually a good idea. But it's trivially possible.)

15 February 2006

Katrina revisited

I still don't get the sense that people appreciate how close we were to losing a lot more lives. All it would have taken was one more bureaucratic snafu, a little more fear of liability, a little more concern for appearances, and many thousands more would have died.

I'm talking about what the government got right, the first-order effect: calling for a complete evacuation, no excuses, this means you. It was a day late in coming, I guess. But remember, it had never been done before. Mayor Nagin, Governor Blanco, and President Bush got on the phone and did it (shrewdly spreading the responsibility among the three of them). I don't know how many lives this saved, but I'd guess tens of thousands at a minimum: far more than the actual death toll.

In engineering, if you're looking at thousands dead and you manage to cut that number by 80%, you're having a pretty good day. Somehow we expect a lot more of disaster response. Saving thirty thousand lives is gross incompetence; saving thirty-one thousand is just tolerable; saving thirty-one thousand and twelve is a good job.

A screwy perspective, but maybe it's the right one. If you want to save all the lives, you have to ask for it, and in America we ask by raising a ruckus and throwing the bums out when they get it wrong. Hmm.

SPARK

poormattie pointed me at SPARK. It's a commercial programming language and toolset for making extremely reliable programs. The idea is that testing isn't good enough: you need to prove your program correct.

Here's how to construct the SPARK language. (I am not 100% sure on all the details; this is from reading the SPARK web site, which contains no code examples.)

  1. Start with Ada. In fact all SPARK programs are Ada programs, and you can use whatever Ada compiler you like to compile them.
  2. Remove all the interesting features (threading, allocating objects with new, exceptions, generics, even recursion). I speculate that by doing this you have already achieved 80% of the reliability improvement you're going to get. You've effectively discouraged programmers from taking any mental shortcuts whatsoever or doing anything hard.
  3. Add an annotation language for preconditions, postconditions, and assertions. The annotations go in Ada comments, so your Ada compiler will ignore them.
  4. Add a separate file format (outside your source files) for machine-verifiable proofs. You might want to prove, for example, that a given assert will always be true, if the preconditions and invariants are met; that all methods of a class preserve the class invariants; that a procedure always terminates; that a given array index is always in the right range; that a given call to a function always passes parameters that meet its preconditions; etc.

You can buy a static analysis tool that checks your code and makes sure it's SPARK language (i.e. that you're not using any Ada features that SPARK forbids). Then you can buy a proof assistant that helps you write proofs. Presumably one of these tools also verifies proofs and tells you how much of your program is actually covered by your proofs. (For example, your development project might institute a policy that every assertion needs to be proved, and every array access needs to be proved in-bounds; the tool presumably tells you when you've done this and what's missing.)

There is a definite appeal to this kind of tool for some projects, and the notion that test-driven development is horribly misguided makes me smile. But SPARK in particular incurs a huge penalty. The first thing you lose is libraries—I'm guessing not many Ada packages are designed to avoid exceptions and the heap. Growable arrays and hash tables are right out.

Suppose you're willing to take that hit. It seems like there's also a cultural problem. For example, suppose your policy is that you need to prove all division operations won't cause divide-by-zero. This encourages programmers to change code like this:

  Velocity := Distance / DeltaT;

to this:

  if DeltaT = 0 then
    -- XXX very bad default value, will cause the missile to fall out
    -- of the sky; this can never happen though
    Velocity := 0;
  else
    Velocity := Distance / DeltaT;
  end if;

For some programmers, it'll always be easier to make this kind of change than to keep the code as-is and prove that DeltaT is never 0. Likewise, it's easier to delete an assertion than to prove it. (Maybe the solution is to have separate people responsible for code and proofs, but the web site doesn't mention this.)

I also saw nothing on the site about how robust the tools are against source code changes. Having to keep reproving things as the code undergoes minor changes sounds like a serious drag. Keeping unit tests up to date is hard enough.

06 February 2006

Managed frivolity

I am struck by a sudden impulse to organize my free time to maximize frivolous output. I think the most important thing is to spend all my free time on one useless project until it's done.

My first project is to update my web site with pictures of Abby and new pictures of James. We have some really good ones.

Update: Done!

Puzzle

daniele posed this problem last night:

A plane is booked full. Each passenger has a boarding pass with a specific assigned seat. Boarding is about to begin.

Normally each passenger would board the plane, go to his or her assigned seat, and sit down. But today, the first passenger to board is a little crazy, and takes a seat completely at random.

Here is how the remaining passengers will behave: they'll board one at a time. Each one will go first to his or her assigned seat and sit down if possible. But if someone's already sitting there, that passenger will pick an unoccupied seat at random (somebody else's seat) and sit there.

As it happens, the last passenger to board the plane is a VIP who must have his assigned seat, seat 1-A, or heads will roll. What is the probability that when the VIP boards, seat 1-A will be unoccupied?

Neat.

02 February 2006

Laziness in Haskell

People actually say things about Haskell like:

When you see something like '(n-1)' and 'y + a1/6', it is a
red flag. These are exactly the kinds of expressions that lead to
memory exhaustion.

Those expressions mean the same thing as they would in C: simple arithmetic.

The reason they exhaust memory is that Haskell is lazy. When Haskell sees a calculation, it never performs it right away; at most it creates a thunk consisting of the calculation to be performed and references to the arguments. Then it can do the calculation later when and if the result value is needed. Sometimes it takes more memory to keep the thunk around than to just calculate it and be done. In the case of (n-1), Haskell has to remember two values and the - operation, even if n has already been calculated. But it gets worse. If you run a calculation many times in a loop, and each iteration uses the result of previous iterations, Haskell can end up building a huge tree of nested thunks, just because it's too lazy to do the calculations (which it'll eventually have to do anyway) and free up the memory.

Another programmer lists techniques for avoiding laziness in Haskell. She sounds a bit disenchanted with it.

What good is laziness? It's very clever for implementing backtracking. Just write code to produce all the possible results at each step. At runtime, you'll only ask for one result; the rest will never be evaluated. Laziness is also good for any case where there's an infinite (or just fantastically large) amount of analysis you could do, if you had infinite memory and CPU—like lookahead in a chess game—but you'd rather write unconstrained code first and apply pruning as an afterthought. And of course I think lazy lists are a brilliant programming technique—the Haskell equivalent of a generator is just an ordinary Haskell list.

Generators for code factoring

Another neat feature of generators is that they let you factor out things that are logically separate but are all munged together in your existing code.

For example, you might have a complicated while loop that gets stuff from a queue and breaks when the queue is exhausted. The queueing algorithms, locking, and so forth might be partly wormed in with the rest of your code, which actually processes whatever you've gotten from the queue. You want to refactor this into two parts: looping code and processing code. (Maybe because you want to reuse the basic loop structure to process different stuff.)

Where I work (C++), they'd just copy a bunch of code. Without generators or blocks, refactoring this code is a huge pain. You'd either have to rewrite the looping code as a pullable iterator; or else rewrite the processing code as a pushable visitor. Generators let you separate them without rewriting either one.

01 February 2006

Generators

Generators are a breakthrough in programming. Python got them from Icon, and I'm pretty sure C# got them from Python (they're called iterators in C#). I confidently predict that in ten years every major programming langauge, except C++, will have them.*

To the programmer, a generator is a function that returns a sequence of values. It's like a function that returns an array; but instead of building the array, then returning the whole array at the end, a generator yields the items one by one.

For example, here's a Python function that returns all the headings in an XHTML document.

    HEADINGS = ('h1', 'h2', 'h3', 'h4', 'h5', 'h6')

    def getHeadings(node):
        headings = []
        for c in node.childNodes:
            if c.nodeType == c.ELEMENT_NODE:
                if c.localName in HEADINGS:
                    headings.append(c)
                else:
                    for h in getHeadings(c):
                        headings.append(h)
        return headings

Here's the same function, reworded to be a generator.

    def iterHeadings(node):
        for c in node.childNodes:
            if c.nodeType == c.ELEMENT_NODE:
                if c.localName in HEADINGS:
                    yield c
                else:
                    for h in getHeadings(c):
                        yield h

When you call getHeadings(), it processes the entire document and returns a Python list. When you call iterHeadings(), it returns an iterator instead. But you use them the same way: in a for loop.

    doc = xml.dom.minidom.parse(filename)
    for h in iterHeadings(doc):
        print h.toxml()

How is this better than the version that builds a list? It uses less memory (since you never have to have the whole list in memory at once). It yields the first value sooner, because it doesn't scan the entire file up front. You can process some of the file and then break out of the loop; the rest of the file won't be searched.

What generators are good for

Suppose you want to write some code that produces (or parses, or whatever) a sequence of values (or objects, whatever). For example, suppose you're parsing a server log file and spitting out a stream of log records. Now you need to choose between a push processing model and a pull model. (These are other ways, but these models are good because they let you scan lots of data without using lots of memory, and they don't require you to use threads.)

You could write a method void parse(File filename, LogRecordHandler handler) that opens the file, reads the stream of log records, and calls a handler object method for each record in the file. The caller provides the handler object. This is the push model: the parse() method has control, and it pushes data to the caller via handler methods.

Another way to do it is to write a LogFileReader class. It could have a next() method that returns the next record from the log file. This is the pull model: the user creates a LogFileReader object and explicitly pulls each record out of it.

(SAX is pushy; DOM node iterators are pull-y.)

Each approach sucks, for a different reason. Push is easier to write but harder to use. It tends to be great for trivial use cases, but then later, when the caller wants more control or the data gets more complex, it turns out to be a pain. The caller ends up writing a bunch of complicated, boring dispatch code, because in even slightly complex cases, the handler object has to have some state to remember where it is and what's going on, and wants to handle incoming data differently depending on what's happening. And the flow of control, which I think properly belongs in the caller's hands, is instead left to the library being called (in our example, the parse() method).

Pull is easier on the caller: looping over the results from an iterator is quite natural, and special things like breaking out of the loop, or even passing the iterator to a whole different routine to handle part of the log differently, are trivial to implement. But pull is harder to implement for the producer.

The problems are style problems. In the push model, the consumer is the push-ee, and the push-ee code is convoluted and ugly. In the pull model, the producer is the pull-ee, and the pull-ee code is convoluted and ugly. To summarize: active code, code that does something, is concise, direct, and readable. Passive code, code that has a sequence of things done to it, is a mess.

Generators allow both the library and the caller to use active language. The caller's loop and the library's generator run in tandem. The library generator is written to make it look like it's pushing values to the caller. The caller's loop looks like it's pulling values from the generator. Both use active, procedural language.

*Ruby blocks do most of the same stuff as generators. The main thing Ruby blocks can't do is act like an iterator, which you could pass around to various functions, or stop iterating and resume later. For example, zip() can't be efficiently implemented over Enumerables in Ruby. Not a big deal.

I don't expect Ruby to get new generator syntax. Instead it might get an Enumerable::iterator() method that uses continuations to do essentially the same thing. It'll be hairy; but occasionally you need an iterator over a data structure, and continuations do the trick.

For C++ to die out, C++ programmers will have to die out. It'll be a while.

Correction: Ruby has a stdlib class Generator that turns a Ruby iterator into an object with next? and next methods. Nice. I hear it used to be continuation-based, but it'll use lightweight threads in an upcoming release.