Our Zope 3 application had a speed problem: a view used to export an XML file with user data stored in the ZODB took two thirds of an hour. Clearly, the time to optimize it has arrived.

I've been indoctrinated by reading various sources (Federico Mena-Quintero's blog is a good one) that you must do two things before you start fiddling with the code:

  1. Create a reproducible benchmark
  2. Profile

Step one: benchmark. A coworker convinced me to reuse the unit/functional test infrastructure. I've decided that benchmarks will be functional doctests named benchmark-*.txt. In order to not clutter (and slow down) the usual test run, I've changed the test collector to demote benchmarks to a lower level, so they're only run if you pass the --all (or --level 10) option to the test runner. Filtering out the regular tests to get only the benchmarks is also easy: test.py -f --all . benchmark.

Next, the measurement itself. Part of the doctest is used to prepare for the benchmark (e.g. create a 1000 random users in the system, with a fixed random seed to keep it repeatable). The benchmark itself is enclosed in function calls like this:

>>> benchmark.start('export view (1000 users)')
...
>>> benchmark.stop()

The result is appended to a text file (benchmarks.txt in the current directory) as a single line:

export view (1000 users), 45.95

where 45.95 is the number of CPU seconds (measured with time.clock) between start/stop calls.

Next, I experimented with various numbers of users (from 10 to 2000) and recorded the running times:

export view (10 users), 0.25
export view (10 users), 0.26
export view (100 users), 2.33
export view (200 users), 4.81
export view (200 users, cached), 4.47
export view (200 users), 4.91
...
export view (2000 users), 155.96
export view (2000 users, cached), 159.91

The "cached" entries are from the same doctest run, with the benchmark repeated a second time, to see whether ZODB caches have any effect (didn't expect any, as functional tests use an in-memory storage, and besides all the objects were created in the same thread).

Clearly, the run time grows nonlinearly. Some fiddling with Gnumeric (somewhat obstructed by Ubuntu bug 45341) showed a pretty clear N**2 curve. However it is not fun to copy and paste the numbers manually into a speeadsheet. A couple of hours playing with matplotlib and cursing broken package dependencies, and I have a 109 line Python script that plots the results and shows the same curve.

Step two: profiling. (Well, actually, I did the profiling bit first, but at least I refrained from changing the code until I got the benchmark. And it would have been easier to profile if I had the benchmark in place and didn't need to do it manually. For the purposes of this narrative pretend that I did the right thing and created a benchmark first.)

A long time ago I was frustrated by the non-Pythonicity of the profile module (my definition of a Pythonic API is API that I can use repeatedly without having to go reread the documentation every single time) and wrote a @profile function decorator. This came in handy:

class ExportView(BrowserView):

    template = ViewPageTemplateFile('templates/export.pt')

    @profile
    def __call__(self):
        return self.template()

One run and I see that 80% of the time is spent in a single function, SubjectVocabulary (users are called subjects internally, for historical reasons). It is registered as a vocabulary factory and iterates through all the users of the system, creating a SimpleTerm for each and stuffing all of those into a SimpleVocabulary. It was very simple and worked quite well until now. Well, now it's being called once per user (to convert the value of a Choice schema field into a user name), which results in O(N2) behavior of the export view.

A few minutes of coding, a set of new benchmarks, and here's the result:

Benchmark time graphs