a blog by Marius Gedminas

X.org and stuck Ctrl

Warning: this is going to be a long story with a semi-happy ending. Skip it unless you enjoy tales of woe and debugging.

So, I open up my laptop, plug in a USB keyboard and mouse, start up Inkscape and start fooling around. Suddenly I notice strange behaviour:

  • I cannot select two objects in Inkscape by holding down ctrl or shift and clicking -- only the last object I clicked on becomes selected.
  • Menus don't pull down, although the highlighted bar follows my mouse
  • I cannot type any text into text boxes
  • Moving the mouse into a screen corner doesn't trigger the Expose-like effect
  • I cannot drag windows around by grabbing the title-bar
  • I cannot drag windows around by alt+dragging
  • Dragging the mouse inside an xterm creates a vertical selection

That last hint seems to indicate that the Ctrl key could be stuck. I try pressing it and releasing, then the other one, then both Ctrl keys on the USB keyboard. Surely, if X sees a press and a release event for each Ctrl it will realize none of them are still down? No such luck. I try to unplug the USB keyboard and mouse next. No results.

This is not the first time this has happened to me. Previously I found no exit out of this state other than killing X.org with (great pleasure and) Ctrl+Alt+Backspace. Surely there must be a better way?

I ssh in and start xev. MotionNotify events have state 0x4, which is control, I think. By the way, I only see mouse events in xev, keyboard events don't make it. The keyboard itself is alive at some level, as Caps Lock turns on its LED, and Ctrl+Alt+F1 gives me a (garbled and unusable) text console. Holding down Alt or Shift doesn't change the state of events seen by xev, though.

Did my keyboard map get lost? Is some X client grabbing all the keys? how do I recover?

I use x2x to connect to the laptop from my desktop. I now can use my desktop's mouse and keyboard to control the laptop. x2x works by injecting X events via XTest. I see mouse events x2x injects, but xev again shows nothing on the keyboard front.

I try to guess which X client might have the keyboard grab (if there is one). I killall compiz. I'm surprised that gnome-session doesn't restart it (or spawn a different window manager in its place). I start metacity manually. The problem is not gone. I kill metacity and start Compiz again.

I have a VNC server running (vino), but I've forgotten its password.

I notice a weird message in my xev log:

KeymapNotify event, serial 40, synthetic NO, window 0x0,
    keys:  4294967195 0   0   0   32  0   0   0   0   0   0   0   0   0   0   0   
           0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   

Huh? 4294967195 keys? That looks like an unsigned 32-bit int underflow. I scroll back and find the first KeymapNotify event seen by xev:

KeymapNotify event, serial 25, synthetic NO, window 0x0,
    keys:  0   0   0   0   32  0   0   0   0   0   0   0   0   0   0   0   
           0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   

Looks normal. Then I notice this at the end of the xev log:

FocusOut event, serial 40, synthetic NO, window 0x3e00001,
    mode NotifyWhileGrabbed, detail NotifyNonlinear

What's "NotifyWhileGrabbed" mean? How do I find the rogue app and kill it? xrestop shows me 35 X clients, do I just kill them one-by-one until the problem disappears? Some of those clients are <unknown> and show no PID.

I suspend the laptop (which is a very stupid idea when your other machine's mouse and keyboard are redirected to it via x2x) and resume it, hoping that gnome-screensaver will somehow overpower the existing application's lock with its own. Gnome-screensaver is nowhere in sight. At least my x2x connection becomes alive again and I have my keyboard & mouse back on the desktop.

I notice that I can no longer make any kinds of selections in gnome-terminal. Why? Window focus no longer follows mouse. xev no longer sees mouse motion events. It sees a couple of MappingNotify events when I plug in a (different) USB keyboard, though.

I killall gnome-screensaver (which was invisible, remember?) and can now again see motion events in xev and select windows with the mouse.

I start randomly killing applications. gnome-power-manager. nautilus. inkscape. firefox. gtk-window-decorator. gnome-panel. notification-daemon. gnome-terminal. pulseaudio (no reason). vino-server. update-notifier. seahorse-agent. gnome-keyring-daemon. bluetooth-applet. fast-user-switch-applet. multiload-applet-2. mixer_applet2. system-config-printer. gnome-settings-daemon.

And the system becomes ugly (no GNOME theme) but alive. I get a second xev window from an earlier attempt to type 'xev' in a terminal that was unable to receive key events.

Of course, since I killed all the actual applications and half of the necessary support programs my session is now useless, so I'll have to log out and log back in again. But at least in the future I'll know: when something like this goes wrong, killall gnome-settings-daemon.

Now I'd like to report a bug (the thought of hurling a brick through the responsible developer's window never crossed my mind, honest!), but without a reliable way of reproducing the problem will it be of any use?

I restart gnome-settings-daemon and it promptly invokes xrandr to set up a dual-head mode that confuses compiz. By "confuses" I mean displays a rotating cube in the top-left 1280x700 area of my 2560x1024 extended desktop, filling the rest with whatever was in the video memory last time I had a dual-head mode.

I try to start up Firefox on my desktop and put a link to the relevant bug, but Firefox quietly refuses to start up. Well, 'Segmentation fault' at the end of ~/.xsession-errors is quiet, isn't it? Thankfully, 'firefox http://someurl' for some reason works and opens a window. I cannot find the Compiz bug I remember (could it have been #135418?), but this looks like a better fit anyway: #317431 (and #206998 might be a duplicate). And here's my gnome-settings-daemon bug: #335201.

This is all on Ubuntu 8.10 (Intrepid Ibex).

Some days I just hate Linux. Then I remember that it's worse on other systems...

Largest apps on my laptop #2

I recently posted some data about applications taking up the most RAM on my laptop. That was after 9 days of uptime, while this is after 12 hours:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
16033 root      20   0  527m 109m  11m S    3  5.5   6:52.82 Xorg
17834 mg        20   0  244m 105m  24m S    6  5.3   3:26.49 firefox
26425 mg        20   0 96872  58m  12m S    0  2.9   0:01.71 evince
16747 mg        20   0 90704  54m  19m S    0  2.8   0:10.70 tomboy
27169 mg        20   0  166m  53m  23m S    0  2.7   0:07.05 banshee-1
27167 mg        20   0 77392  31m  17m S    0  1.6   0:01.16 pidgin
16706 mg        20   0 86508  27m  18m S    0  1.4   0:35.35 gnome-panel
16708 mg        20   0 75196  20m  14m S    0  1.0   0:02.80 nautilus
20092 mg        20   0 61880  19m  11m S    1  1.0   0:07.08 gnome-terminal
16614 mg        20   0 58456  15m 9836 S    0  0.8   0:09.82 gnome-settings-

I don't have GNOME Do any more, and I've only one of the two PDFs open in Evince. I don't see multiload-applet on the first page of top output, which seems to indicate a slow leak. Evince has the same two documents. That concept doesn't quite apply to Banshee or Pidgin, but Pidgin's numbers are quite striking anyway (from 70 megs VIRT to 1.6 gigs VIRT in 9 days; thankfully RES only grows 2x during that time).

OS: Ubuntu 8.10, up-to-date with all the updates from -security, -updates, -proposed-updates and -backports.

Incidentally, I have 12 hours of uptime because my battery died while the laptops was suspended during my flight back home (either that, or it work up in the backpack, which is a scary thought). Apparently Ubuntu tried to hibernate when the battery was very low, which was a nice gesture. This didn't work out so well when resuming, since the kernels didn't match -- I had installed a kernel update, but hadn't rebooted. I don't think I ever used hibernation successfully in Linux.

Largest apps on my laptop

After reading Alexander Larsson's post on de-bloating nautilus I thought it would be interesting/useful to see what apps are eating my RAM, as a statistical data point if nothing else.

OS: Ubuntu 8.10. Uptime: 9 days, 20:45 (desktop session also started 9 days ago; I suspend and never log out). Top ten apps, according to top:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
29594 mg        20   0  472m 269m  32m S    3 13.6  23:14.38 firefox
 7383 root      20   0  591m 182m  14m S    4  9.2 557:37.73 Xorg
 5199 mg        20   0  114m  67m  13m S    0  3.4   0:35.71 evince
17256 mg        20   0  230m  64m  24m S    0  3.3   4:10.93 banshee-1
29306 mg        20   0 1652m  62m  19m S    0  3.1   7:38.97 pidgin
 8060 mg        20   0  105m  59m  17m S    0  3.0   1:23.20 tomboy
 8015 mg        20   0  137m  42m  20m S    0  2.2  41:55.95 gnome-panel
 8017 mg        20   0  138m  41m  16m S    0  2.1   2:26.75 nautilus
 8063 mg        20   0 45876  30m 8788 S    0  1.6  33:22.14 multiload-apple
 8087 mg        20   0 80708  28m  15m S    0  1.4   4:45.49 gnome-do

Update: compare these numbers with what I get just 12 hours uptime.

Firefox started leaking memory quite rapidly lately, possibly after I upgraded to 3.0.6+nobinonly-0ubuntu0.8.10.1 exactly a week ago. I have to restart it once a day if I don't want my 2 gigs of RAM to fill up completely. I hadn't needed to do that before, memory usage stayed pretty constant.

I cannot explain the X.org numbers. pmap doesn't show me RSS numbers, but 150 megs of VIRT are attributed to the heap (an anonymous read-write mapping), while 256 megs look like the frame buffer ("resource2"). xrestop sees a total of 21027K of resources (pixmaps etc.) attributed to all the clients. I have a vague suspicion that this number doesn't include OpenGL textures used by Compiz, but I'm pretty clueless about those things. compiz --replace reduces Xorg's RSS down by 9 megabytes and increases VIRT by 5 megabytes. The increase is mirrored by xrestop, which now shows 25652K in total.

I have two PDFs open in the background since I intend to read them within the next couple of days, this explains the evince data.

I'm not happy about the Banshee memory usage. I wouldn't mind that much if it didn't insist on minimizing to the system tray.

I'm even less happy about Pidgin. Banshee at least has the excuse that it is built on top of Mono, which adds a whole new runtime & virtual machine. And why on Earth does a chat program need 1.6 gigs of virtual memory?

Tomboy: Mono again. But Tomboy is a killer application and I need it.

GNOME Panel: self-explanatory. The multitude of applets that I think that I need fill up both panels and slow down login times as well.

Nautilus: pretty consistent with Alexander's numbers, looks like I can expect improvements in Ubuntu 9.04 or at least 9.10.

Multiload applet: looks like I shouldn't have blamed GNOME Panel's memory usage on applets. 30 megs RSS for four ticking graphs seems a bit biggish, but maybe memory fragmentation is at fault.

GNOME do: people keep blogging about its coolness, then I install it, try it out twice, and forget it. The default GNOME Run dialog does what I need/am used to better.

Keep the buildbot green!

This post started as a comment to Michael Rooney's question: Failing tests: When are they okay?, and then it became a bit too long for a comment.

picture of a green traffic light
Green light by morberg, cc:by-nc

For me the most important aspect of a build is to accurately represent my knowledge about the health of the product. New problems must be noticed as soon as possible. This won't happen if the developers are used to seeing (and ignoring) broken builds.

For this reason you want to distinguish known failures from unknown failures. For example, it's okay to commit a test that reproduces a bug even if you don't have a fix for that bug, but do it in a way that keeps the buildbot green. (Two common ways of doing that is marking the test in a special way so the test runner knows it's expected to fail, or disabling the test so that it doesn't even run.) The worst thing ever is fragile tests that fail only sometimes, especially if everyone grows accustomed to them. I speak from experience. I still have nightmares...

Collaboration is not reason enough to break the trunk. You can use branches or send patches via email, whichever works best. Patches are often simpler when you're taking over someone's unfinished work when that someone gets stuck and asks for help, or if you decide to switch machines when pair-programming. Sometimes I use shell one-liners like 'ssh othermachine svn diff /path/to/source/tree | patch -p42' to get the changes into my checkout. Branches are more appropriate for longer-term collaboration. It's perfectly fine to have a broken test suite on a branch -- you can always discard it; that's what you do to prototypes. Reimplementing something you've already done, in a cleaner fashion, is often a simple and rather pleasant way of merging.

If the tools you have aren't polished enough and you don't feel comfortable creating new branches even when they're necessary, invest a day every now and then improving your tools (shameless plug: eazysvn, because eazysvn switch -c newbranch does not require you to lose your train of thought remembering how to type long subversion URLs for svn cp).

That's all theory; in practice IMHO it's acceptable to take shortcuts. Small self-contained checkins are best (and this topic deserves a blog post of its own), but if you're forced to wait 20 minutes for the full test suite before every one of them, you won't use small checkins. It's fine to run just a subset of tests covering the code you've changed before every checkin, even if that means you sometimes will break the build by accident. However it's your responsibility to clean up any breakage if it occurs before you leave at the end of the day (or at least to feel guilty when you don't).

Back to the original question: I can imagine only one set of circumstances where the right thing to do is to knowingly commit a broken test to trunk. Imagine that you discovered a show-stopper bug, but the fix is elusive. By committing a failing test you force the whole team to notice it, drop everything else and work on the problem. And you also prevent somebody from accidentally releasing a broken version of the product. (Your release process includes a step ensuring that all the tests pass, right?)

Playing with Pylons

I've an opportunity to get to know Pylons. Here's an unsorted list of first (and second) impressions:

  • Pylons has great documentation, though I did stumble upon a few broken links
  • Pylons has a great development environment (instant and automatic server restarts; interactive Python console in your web browser on errors)
  • It seems that nobody using Paste is interested in logging the startup and shutdown time of the web server
  • SQLAlchemy overwhelms with TMTOWTDI
  • zc.buildout can be replaced by a 4-line shell script using virtualenv and easy_install; this will save you headaches
  • setuptools is made of pure crazyness, but we can't live without it

These aren't directly related to Pylons:

  • distributed version control systems are great for throwaway prototypes (especially when you want to compare several ways to do it)
  • non-distributed version control systems aren't
  • py.test is weird and takes some getting used to, but has some nice properties as a test runner; shame about breaking compatibility with unittest
  • automated functional tests for system deployment in a freshly cloned Xen virtual machine are cool, albeit slow-ish

Update: About the naive notion that using easy_install instead of zc.buildout would help me avoid headaches? Muahahahahaha. Ha. Haha. Muahhaaaaaa. Wrong.

Also, TMTOWTDI is maybe too strong a word for SQLAlchemy's plethora of choices. And you really want to be using 0.5. And Pylons is even more awesome than I first thought. Obligatory grain of salt (*thud*): I haven't finished writing my first page yet. Integrating new stuff into existing elaborate functional test suites takes time.