Finding and Fixing a Five-Second Stall

2012-12-17 — The Witness

By	Casey Muratori

The great thing about programming on Windows is that it is the only commercially viable platform where you can ship software to users without getting approval from a giant bureaucracy (well, perhaps I should say it used to be). The not-so-great thing about programming on Windows is that, well, the Windows API is a horrific nightmare.

Granted, it is not uniformly horrific. It ranges from only slightly scarey (xinput, comdlg32) to full psycho (DirectShow, Event Tracing, TabletPC, etc.). But no matter which part of the spectrum you encounter, programming on Windows tends to be about doing a lot of unnecessary research, experimentation, and debugging due to a confluence of poorly designed APIs.

Since time spent debugging Windows issues isn’t often chronicled, I thought I’d go ahead and describe my experience this Saturday of tracking down a Windows bug in The Witness.

The Symptom

Apparently, not everyone on The Witness team was seeing the symptom. But trust me, it was there.

On both Jon and my machines, there was a period during the initial loading of the game where you couldn’t do things like ALT-TAB back to the debugger. If you set a breakpoint then ran the game, when the breakpoint eventually hit and the debugger came to the front, it would wait about five seconds before it allowed you to actually interact with it. Sometimes, in rare cases, it would go into some sort of psycho mode whereby mouse movements stuttered, and the computer would literally beep at you as the mouse moved, for several seconds. Then everything would return to normal.

While not technically a showstopper, this is the kind of bug that’s worth fixing if you can, because it costs you real development time every day. Five seconds wasted every breakpoint is bad news, and people often underestimate the effects of frustration that can build up over time due to flaky and unpleasant development environments. So I decided to try to track it down.

Because Windows is a closed-source platform, there’s no great way to track down problems of this nature. However, having programmed on Windows for over twenty years now, I have a nose for it, so I felt like I knew the best place to start: since there was definitely a UI blackout somewhere during startup, there had to be something occurring early in the execution of the program that was causing at least that particular symptom. So before doing anything else, I started stepping through the entire startup sequence of the program to see if anything out of the ordinary struck me.

I very quickly came upon this code:

// @@ The hook appears to install properly, but is never invoked :(

keyboard_hook = SetWindowsHookEx(WH_KEYBOARD_LL, LowLevelKeyboardProc, GetModuleHandle(NULL), 0);

Now, I don’t want to make it sound like the “find the source of the problem” phase of Windows debugging is straightforward. It usually takes a very long time, sometimes multiple days. But as luck would have it, this was literally the first piece of code I suspected, and it turned out to be the culprit.

I commented out the SetWindowsHookEx, ran the program, and poof! All the symptoms were gone.

The Problem

From experience, I’ve learned that it’s always best to fully understand a problem before you fix it. If you just patch over its symptoms but never figure out what the problem really was, it will often come back to haunt you.

In this context, that meant two things: first, I should figure out specifically why the keyboard hook was causing the UI stalls; second, I should determine why the code had been trying to call SetWindowsHookEx in the first place, even though apparently it never quite worked (according to the “is never invoked” comment).

The first part turns out to be obvious in hindsight, but it took me a little while to put all the pieces together and be sure I had a solid explanation for what we were seeing. As with any puzzle, looking at the finished picture is easy, but figuring out what that picture might be when all you have are little pieces coming in one at a time is much more difficult. Here’s the best picture I could come up with, but of course there’s no way to completely verify it external to Microsoft:

Normal Windows hooks are pieces of code which execute in the address space of the eventual message recipient, not the installer of the hook. It is for this reason that, normally, if you want to install a global hook, you must put the hook code in a DLL, because that DLL must be mapped into the address space of every running executable on the machine that can receive Windows messages.

But low-level hooks are completely different. A low-level hook like WH_KEYBOARD_LL is a piece of code that resides in only the process that registers the hook. Windows remembers which thread registered the hook, and when any other executable is about to have a hooked message posted to its message queue, Windows actually waits until it can switch contexts back to the hook thread, run the hook in that context, and then finally deliver the message.

This leads directly to the symptom that we were experiencing. When the game is loading, its main thread isn’t yielding to Windows frequently like it is during normal play. Since the main thread is the one that registered the hook, Windows must wait for that thread to finish all the load-time work it was doing and call a function which yields to Windows in order for Windows to actually call the hook procedure. Since no keyboard processing can occur until the hook procedure has run, the keyboard becomes unresponsive as messages back up waiting to be processed through the hook. This completely explains the UI problems during startup, and fits with the oddity that the mouse never seemed to be affected, just the keyboard.

When the game hit a breakpoint, the keyboard hook couldn’t execute at all because the debugger had halted the thread required to execute it. This would have meant that keyboard input would be completely halted, and the debugger would never be able to receive keystrokes, if it weren’t for the fact that Windows actually uses a timeout on hook calls. If the timeout value is exceeded waiting for the hook thread to become available, the hooking executable is assumed to have crashed, and it will silently remove that hook and continue with normal operation.

That timeout value? Five seconds — right in line with our observed pause (technically, it’s whatever is set in the registry under HKEY_LOCAL_MACHINE\Control Panel\Desktop\LowLevelHooksTimeout, but the default is five seconds).

So, no mystery as to the pathology. But what about the second part? What was the code actually trying to do?

The Windows Logo Keys

Although SetWindowsHookEx might seem like an odd call to make in a game executable, it’s actually quite common. It’s there to prevent the Windows logo keys on modern keyboards from ruining full-screen games. Because the keys are placed in a location that’s easy to hit by mistake, many gamers would accidentally hit them, causing their games to be instantly deactivated by Windows in favor of bringing up the start menu. So widely acknowledged was this problem that, when these keys were first introduced, Microsoft itself published a recommended work-around. From Disabling Shortcut Keys in Games:

This article describes how to temporarily disable keyboard shortcuts in Microsoft Windows to prevent disruption of game play for full screen games… Use a low-level keyboard hook to filter out the Windows key from being processed.

The article goes on to show sample code that is basically the same code that was being used in The Witness. It’s a simple SetWindowsHookEx call with WH_KEYBOARD_LL whose hook routine does nothing but block VK_LWIN and VK_RWIN keys from being processed. The only difference between the code in The Witness and the code in the sample is that The Witness would conscientiously remove the hook when it no longer had focus, and then reinstall the hook when focus was regained. The sample code, by contrast, just set a global variable.

So if this was what the code was doing, why the frowny face? If you look back at the comment, it states very clearly that the hook was never being invoked, even though no error message came back from Windows. And, although I normally remap my Windows keys via the registry, I went ahead and mapped them back to see if The Witness was properly blocking their effects.

Much as the frowny face foretold, it wasn’t.

Obeying the Static Discipline

I don’t know anything about digital circuit design, but I once watched a lecture by Gill Pratt where he explained something called “the static discipline”. In essence, it is the requirement that any component in a circuit must accept voltages with a certain range for “1” and “0”, and then must output its own results in a certain range. These ranges are defined so as to require that components always produce equal or better conditioned voltages than their inputs, thus ensuring that the digital signals being propagated through the circuit don’t degrade into noise due to lots of tiny losses.

For some reason, I took this concept to heart in a programming sense and have found it is a good rule to code by. My version of the static discipline, adapted for software, is that whenever you are making a modification to a piece of code, you should always leave it in a state of stability equal to or better than how you found it. And preferably the latter.

All too often, people go in to fix a problem or add a feature to a piece of code, and they just hammer on it until it does that one new thing. The resulting code is then usually more fragile, less well designed, more unnecessarily complex, etc. To try to prevent myself from having this effect, I try to observe the static discipline.

And so it was with this Windows key situation. I could see what the code was supposed to be doing, and although I could reasonably fix my bug by just commenting out the hook (because it didn’t work), I felt it was the more disciplined thing to do to figure out why the hook wasn’t working, and to implement something that did what it was trying to do. Since there was no obvious bug, I had to start experimenting.

After adding some debug outputs and running Spy to keep an eye on the window messages, the first thing I tried was removing the part of the code that unregistered the hook when his window lost focus. Since I didn’t know exactly how his windowing code worked at all levels, I figured it wasn’t entirely out of the question that it could be getting disabled prematurely.

Although that turned out not to be the case, this experiment did pay off unexpectedly (as many do): although the keyboard hook still never got called when The Witness window had focus, the hook did start getting called when it lost focus. Yes, oddly enough, the hook was working just fine for everyone else’s window, but on the one window where it actually needed to work, it wasn’t working.

This lead to the obvious question, what was The Witness doing differently?

Raw Input

This was a case where having lots of Windows programming experience probably saved a lot of time. Instead of having to experiment blindly, I knew right off the bat that Windows Raw Input probably had something to do with the situation. I’ve debugged lots of Windows front end code in the past, and any time Raw Input, DirectInput, XInput, or anything else with the word “input” in it is involved, you know that things are going to be going a little haywire.

I commented out the device registration call for Raw Input, and lo and behold, the hook started working for The Witness window, no problem. So, although I was completely unable to find any documentation on MSDN that discussed what was going on under the hood, it must be the case that Raw Input processing happens before low-level keyboard hook processing in such a way as to prevent the keyboard hook processing from ever seeing keyboard messages on a Raw Input-registered window. Normally, this wouldn’t be a problem, but in our case, the Microsoft-recommended fix for the Windows logo key problem relies on keyboard hooks. The Witness needs Raw Input to get good low-level relative mouse movements, but it also needs to disable the Windows logo keys; sadly, this is the kind of situation that is all too routine for anyone who programs Windows professionally.

I couldn’t think of any plausible ways that I could convince Raw Input to start using the Windows keyboard hook, since I don’t have access to any of the Windows source code and I have no idea how it works internally. So I focused my efforts on Raw Input.

I experimented with just about everything I could think of. I tried not calling DefWindowProc on WM_INPUT calls, to see if it was the default window processing that was using the Windows logo key events to switch to the start menu. No such luck. I tried using RIDEV_NOLEGACY during Raw Input initialization to see if preventing keyboard messages from being generated would solve the problem. It didn’t. After many failed attempts, and finding no relevant information on the web, I decided to call it a night and sleep on it.

When I woke the next morning, for some reason I remembered the most important thing anyone programming Windows must remember: never, ever trust the documentation. It’s always either inaccurate or incomplete. So when I sat down at my keyboard, the first thing I did was try using the RIDEV_NOHOTKEYS flag when initializing Raw Input.

You see, the previous day I hadn’t bothered to try this flag, because it explicitly says in the documentation that it does not block system-level hotkeys, only application-level hotkeys (like the kind you create yourself with RegisterHotKey). But why was I believing that? That was just something somebody wrote down when they needed to create a documentation page. They probably never even saw the actual code for Raw Input in the first place.

Surprise surprise, RIDEV_NOHOTKEYS worked. After all the fussing, all that had to be done to make The Witness impervious to Windows Logo keys was one little flag during Raw Input initialization. No keyboard hook necessary.

Zero-Sum Game

It’s worth taking a step back for a moment and asking the question, why did all of this happen? Why did I have to spend several hours working on code that has nothing to do with my project, is not valuable in any intrinsic way, and whose specifics are relevant only in a fleeting temporal sense (there will be no need to know anything about RIDEV_NOHOTKEYS when the next user input model for Windows is invariably introduced)?

The answer is the same on Windows as it is on most every other platform that exists today: the platform providers just don’t spend real resources designing and improving their APIs. If there had simply been an API call, WindowIgnoresWindowsLogoKeys(HWND), then everyone would always just call that, and the platform developers and Microsoft could field it and do the right thing, no matter how many times they revised their platform and its infrastructure.

Better yet, way back when they first started catering to games with Windows 95, they could have introduce a call like BeginFullScreenGame(), and that could have done literally everything for you: initialize the window, set up the input stream, prepare the sound device, etc. Then they wouldn’t have had to ever worry about the Windows logo keys at all, because they would have been automatically disabled in any application that initialized itself with that call.

But instead of any of these things, Microsoft chose to ship a new feature (the Windows logo hotkey) without ever thinking about the consequences. Then they patched over its problems by telling developers to install a keyboard hook (SetWindowsHookEx), something that has nothing to do with the actual problem but which was the most expedient way for them to get the problem solved without doing any real work on their platform. Then they broke that method, which they’d previously advocated, when they introduce Raw Input, so (presumably) they introduced a new way to do the same thing (RIDEV_NOHOTKEYS), although they failed to document it. Then they didn’t even bother to publish an obvious tech article explaining this change, or even revise their old documents saying that SetWindowsHookEx was the preferred solution.

As broken as all that sounds, it is absolutely par for the course. On every project, in every discipline, developers just like me spend hundreds of hours doing nothing but diagnosing and repairing problems with the platforms on which we ship our software. It may sound like it’s just “part of the job”, but the pernicious thing about it is that development resources are always finite. External to the platform, development is always a zero-sum game. The hours I spent tracking down this bug were hours that won’t get spent making parts of The Witness better, and there’s nothing me or the team can do about that. If we want to ship a polished project, it’s a sacrifice we have to make.

And it’s a sacrifice that’s not just made by us, but by thousands of teams on thousands of projects every year. The cost of obtuse, easy to misuse APIs is multiplied across all these projects, turning the hundreds of hours we will spend on The Witness platform work (across multiple platforms) into hundreds of thousands of hours when you consider that this same work has to be done by every other developer, too. It’s really thousands of zero-sum games being played, all across the industry, and the end result is universally worse software because the underlying platforms take up too much effort.

But there is a way for this not to be a zero-sum game, and that’s if there’s a shift in thinking internal to the platform providers. If Microsoft, for example, started realizing that software on Windows gets better if their APIs are clear, concise, and difficult to misuse, they might start changing their development processes to make API quality a priority. They might try to hire and cultivate good API designers, and have real, talented developers who do nothing but work on the APIs to the internal systems.

If that happened, the relatively small amount of hours spent on this endeavor internal to Microsoft (or Sony, or Apple, or Nintendo, or Google) would have a massively leveraged effect. If nobody has to think about things like the Windows keys again, that’s four hours right there that every software project gets back. That is literally thousands of hours, just that. Don’t need to worry about task switching or window minimization? That’s a day, at least — which makes it thousands of days saved. Straightforward user input? Easy access to camera inputs? Each little thing adds up until every developer is getting weeks of their schedule back to devote to working on their actual software, not to mention all the bugs that wouldn’t slip through the cracks because the developers may not have been experienced enough with the particular platform to know to look for them.

So how do we get there? Honestly, I have no idea. In the meantime, I guess we just keep writing articles like this one that detail the right solutions as we find them, and hope that saves everyone some of the vast amount of time that we all lose to platform-related problems each year. Beyond that, it’s really up to the platform providers to start taking these issues more seriously.