Random notes from mg

a blog by Marius Gedminas

Marius is a Python hacker. He works for Programmers of Vilnius, a small Python/Zope 3 startup. He has a personal home page at http://gedmin.as. His email is marius@gedmin.as. He does not like spam, but is not afraid of it.

Tue, 26 Oct 2004

Removing spam from Mailman's queue

Mailman is a wonderful mailing list manager, but when you have thousands of spam messages sitting in the moderation queue, it's web interface is not enough.

The messages live as Python pickles on the file system, in the mailman data directory. The file name pattern is heldmsg-listname-number.pck. Newer versions of Mailman1 come with a script discard that takes a list of path names on the command line and discards them all. In other words, to get rid of all held messages all you have to do is type

/usr/lib/mailman/bin/discard /var/lib/mailman/data/heldmsg*

(you may have to change the directory names to suit your mailman installation).

1 Mailman 2.1.5 has the discard script, Mailman 2.1.2 doesn't.

However I want to be really sure that the messages I'm discarding are spam. The most straightforward way to do that is to extract the RFC 2822 messages from Mailman's pickles, and pipe them to spamassassin. I could not find a script for message extraction included with Mailman, so I had to write my own (mmextract.py):

#!/usr/bin/env python
Extract an email message from a Mailman pickle.

Usage: mmextract.py filename > outputfile
import sys
import cPickle
sys.path.insert(0, '/usr/lib/mailman') # you might need to change this

def main(argv=sys.argv):
    if len(argv) < 2:
        print __doc__
    msg = cPickle.load(open(argv[1]))
    print msg.as_string()

if __name__ == '__main__':

The rest is a matter of simple shell scripting:

for fn in /var/lib/mailman/data/heldmsg*; do
    ./mmextract.py $fn | spamassassin -L -e > /dev/null || echo $fn
done | xargs /usr/lib/mailman/bin/discard

(untested, but it should work).

posted at 15:17 | tags: , | permanent link to this entry | 0 comments

Thu, 21 Oct 2004

Subversion annoyances

The most annoying subversion error message:

$ svn up
svn: Won't delete locally modified directory 'foo/bar'
svn: Left locally modified or unversioned files

This happens when a subdirectory of 'foo/bar' is removed from the upstream repository, and subversion tries and fails to remove ir locally -- fails because it finds some some files that are listed in svn:ignore (e.g. editor backup files, compiled object files, compiled Python modules).

Now you have to figure out which subdirectory of 'foo/bar' subversion wants to remove. Then you have to manually remove junk files from it. Then you have to repeatedly try various combinations of svn up and svn cleanup until Subversion finally agrees to continue the interrupted svn up operation.

Please, Subversion folks -- if you cannot delete a directory because it contains just junk files (those ignored in the output of svn status), just print a meaningful warning message (and name the correct directory rather than its parent!) and continue with the update.

Debian bug 246131.

Update: It seems to be fixed in Subversion 1.1. Unfortunately Subversion 1.1 is not in Debian.

posted at 21:05 | tags: , , | permanent link to this entry | 0 comments

Mon, 04 Oct 2004

Setting umask for Subversion

If you want to set up a shared Subversion repository, accessible over SSH, you need to make the following three directories group-writeable (and setgid):

  1. /path/to/svn/repository/db
  2. /path/to/svn/repository/locks
  3. /path/to/svn/repository/dav (not sure about this, it's likely that it is not necessary if you only want SSH access)

You also need to make sure that all user accounts that access the repository have the correct umask (002 instead of the default 022). If you do not do that, the repository will break when two different developers access it, and you'll have to go fix the permissions and run svnadmin recover.

Setting the umask is tricky because there are a lot of places where you think you could set it, but most of them do not work. Also, testing is difficult because interactive SSH sessions act differently from noninteractive ones. Here are some red herrings:

The correct solution is to put umask 002 in /etc/bash.bashrc, and make sure that user's .bashrc files do not override it.

posted at 14:54 | tags: | permanent link to this entry | 0 comments

Sat, 02 Oct 2004

Sending Unicode emails in Python

Sending a properly encoded email that contains non-ASCII characters is not as trivial as it should be. Here's more or less what I want:

# U+263A and U+263B are smiley faces (☺ and ☻)
sender = u'Sender \u263A <sender@example.com>'
recipient = u'Recipient \u263B <recipient@example.com>'
subject = u'Smile! \u263A'
body = u'Smile!\n\u263B'
send_email(sender, recipient, subject, body)

The hard part is getting all the unicode strings to be properly encoded in the email. Details like multiple recipients, additional headers, attachments, SMTP configuration and error handling are ignored for the purposes of this article.

Here's the solution:

from smtplib import SMTP
from email.MIMEText import MIMEText
from email.Header import Header
from email.Utils import parseaddr, formataddr

def send_email(sender, recipient, subject, body):
    """Send an email.

    All arguments should be Unicode strings (plain ASCII works as well).

    Only the real name part of sender and recipient addresses may contain
    non-ASCII characters.

    The email will be properly MIME encoded and delivered though SMTP to
    localhost port 25.  This is easy to change if you want something different.

    The charset of the email will be the first one out of US-ASCII, ISO-8859-1
    and UTF-8 that can represent all the characters occurring in the email.

    # Header class is smart enough to try US-ASCII, then the charset we
    # provide, then fall back to UTF-8.
    header_charset = 'ISO-8859-1'

    # We must choose the body charset manually
    for body_charset in 'US-ASCII', 'ISO-8859-1', 'UTF-8':
        except UnicodeError:

    # Split real name (which is optional) and email address parts
    sender_name, sender_addr = parseaddr(sender)
    recipient_name, recipient_addr = parseaddr(recipient)

    # We must always pass Unicode strings to Header, otherwise it will
    # use RFC 2047 encoding even on plain ASCII strings.
    sender_name = str(Header(unicode(sender_name), header_charset))
    recipient_name = str(Header(unicode(recipient_name), header_charset))

    # Make sure email addresses do not contain non-ASCII characters
    sender_addr = sender_addr.encode('ascii')
    recipient_addr = recipient_addr.encode('ascii')

    # Create the message ('plain' stands for Content-Type: text/plain)
    msg = MIMEText(body.encode(body_charset), 'plain', body_charset)
    msg['From'] = formataddr((sender_name, sender_addr))
    msg['To'] = formataddr((recipient_name, recipient_addr))
    msg['Subject'] = Header(unicode(subject), header_charset)

    # Send the message via SMTP to localhost:25
    smtp = SMTP("localhost")
    smtp.sendmail(sender, recipient, msg.as_string())

I wish I could write it like this:

from smtplib import SMTP
from email.MIMEText import MIMEText

def send_email(sender, recipient, subject, body):
    """Science-fictional simple version of send_email."""

    # The email module should be able to deal with Unicode message bodies and
    # headers and pick an appropriate charset automatically.  Today (on Python
    # 2.3) it just bombs out with an Unicode error when as_string() is called.
    msg = MIMEText(body)        # won't work
    msg['From'] = sender        # won't work
    msg['To'] = recipient       # won't work
    msg['Subject'] = subject    # won't work

    # At least the SMTP module is smart enough to discard the real name part
    # that it doesn't need
    smtp = SMTP("localhost")
    smtp.sendmail(sender, recipient, msg.as_string())
posted at 02:29 | tags: | permanent link to this entry | 18 comments