My post on hunting memory leaks in Python received a lot of feedback via email. And both^H^H^H^H most of them asked for the source code.

Aside: if someone could recommend a good spam-proof comments plugin for PyBlosxom, I'd be very grateful. I see six plugins on the main site and it's unclear which combination is the one I need.

So, here's the mysterious checks module I used in the post. There's nothing complicated in it, just a bunch of ad-hoc functions I wrote in an afternoon. I'm sure you could do better.

import gc
import inspect

The gc module gives us access to the Python garbage collector, which tracks object references. The inspect module has useful functions for distinguishing certain kinds of built-in objects such as modules or call stack frames.

def count(typename):
    return sum(1 for o in gc.get_objects() if type(o).__name__ == typename)

A very simple function that counts the number of objects by the class or type name. If you happen to use old-style classes, well, it won't work for you. Before the great class and type unification in Python 2.2, all instances of all classes were of type 'instance', and that's what you'll see today if you still use old-style classes.

Also, this simple function doesn't distinguish between classes with the same name defined in different modules. I didn't need it — I prefer to have unique class names, so that I can easily navigate my source tree with ctags.

def by_type(typename):
    return [o for o in gc.get_objects() if type(o).__name__ == typename]

Essentially the same thing, but this function returns a list of objects instead of counting them.

def at(addr):
    for o in gc.get_objects():
        if id(o) == addr:
            return o
    return None

This is a way to "dereference" a "pointer" in Python, the reverse to id(). If you see an interesting object in the graph, and want to get hold of it in the pdb prompt to do some manual looking, you can do that by calling at and passing the id shown in the graph.

Actually, it's not a true reverse of id(). It can only find objects tracked by the garbage collector, such as lists, dicts, instances, functions and so on, but not trivial value objects (such as strings or ints) that never contain references to other objects

def find_backref_chain(obj, predicate, max_depth=20, extra_ignore=()):

This function, given an object and a predicate, finds the shortest chain of object references from some source object such that predicate(source_object) is true. It does a bog-standard Breadth-first search in a graph. The max_depth argument limits the search so that I wouldn't have to wait too long to get a negative answer (I'm pretty sure any such chain should be shorter than 20 references).

The extra_ignore argument lets me specify additional ids of objects that I want to be ignored. For example, if I stored the results of by_type('Foo') in a variable, I'd have an extra list object referencing each of the Foo objects. If I don't want that list to clutter up my searches, I can pass extra_ignore=[id(my_list)].

    queue = [obj]
    depth = {id(obj): 0}
    parent = {id(obj): None}

The standard BFS elements: a FIFO queue, the mapping from a node to its depth (depth[x] = the length of the shortest path from obj to x), and the mapping that lets me construct the shortest path easily (parent[x] = the next node on the shortest path from x to node).

    ignore = set(extra_ignore)
    ignore.add(id(extra_ignore))
    ignore.add(id(queue))
    ignore.add(id(depth))
    ignore.add(id(parent))
    ignore.add(id(ignore))

The objects that implement the search should not themselves become part of the search!

    gc.collect()

Just to make sure we have no collectable garbage lying around.

    while queue:
        target = queue.pop(0)
        if predicate(target):

Found it! Let's reconstruct the reference chain now

            chain = [target]
            while parent[id(target)] is not None:
                target = parent[id(target)]
                chain.append(target)
            return chain

Otherwise, continue looking...

        tdepth = depth[id(target)]
        if tdepth < max_depth:

... but don't dive deeper than the user asked.

            referrers = gc.get_referrers(target)
            ignore.add(id(referrers))

We definitely don't want new objects created during the search to become part of the search space (you could easily get into an infinite loop). And even if the referrers list got garbage collected at the end of the inner loop, this doesn't hurt our search. Even if the same id got reused by a different new object, we don't want any new objects to become a part of the search space.

            for source in referrers:

We look at each source object referring to the current target object...

                if inspect.isframe(source) or id(source) in ignore:
                    continue

... ignoring ones that we want to ignore. Note that here I also ignore call-stack frames, even though I probably shouldn't. Frame objects contain references to local variables in a function. It is entirely plausible that some long-running function is the source of your memory leak, so I shouldn't do that. However, in my particular case it wasn't, and all the frame objects appeared in my graph solely because of locals I defined in the pdb prompt. Adding this check simplified my object graphs without affecting the outcome.

                if id(source) not in depth:
                    depth[id(source)] = tdepth + 1
                    parent[id(source)] = target
                    queue.append(source)

The standard BFS steps again.

    return None # not found

And if we ever get out of that loop, it means the search failed.

Since this post is getting excessively long, I'll describe the show_backrefs function that you all really wanted to see in the next post.

Update: continue reading.

Update 2: the 'checks' module was open-sourced as objgraph.