Fixing Multi-monitor and Multi-window OpenGL on Windows

2016-02-06 — API Design

By	Casey Muratori

As it stands, it’s really nasty to do multi-monitor and multi-window output on Windows OpenGL.

In order to output OpenGL-rendered graphics on Windows your program needs to create an OpenGL “context” using the Windows OpenGL interface layer, wgl. While the concept of an OpenGL context was originally intended to mean “the state of OpenGL for this thread of execution,” it also, perhaps by accident, happened to mean “the OpenGL stream that targets this particular Windows device context.”

The reason these two concepts were coupled was because of the implicit pairing of the wgl call used to create the context in the first place with the subsequent GDI command used to display the results of OpenGL rendering:

HGLRC WINAPI wglCreateContext(HDC hdc);

BOOL WINAPI SwapBuffers(HDC hdc);

As you can see, an OpenGL context at creation time is targeting a specific device context, and when GDI later wants to display the results, it uses that specific device context to know which OpenGL stream you’re talking about.

Later, the replacement way of creating a context for OpenGL 3.x and later kept the same pairing:

HGLRC wglCreateContextAttribsARB(HDC hDC, HGLRC hShareContext, const int *attribList);

So even though OpenGL has been expanded dramatically since the days of the original wglCreateContext, this coupling has been retained.

This causes a nasty problem for multi-target output.

With the benefit of hindsight, I think it’s fair to say that this implicit behavior was always a bad idea, it just never really caused that much of a problem because most early OpenGL usage involved a single thread targeting a single window on a single monitor. Since this trivial scenario only involves one device context, there was never an issue.

However, now that it is typical for modern applications to be multi-threaded and target multiple windows on multiple monitors, the implicit device context targeting becomes a real issue.

Let’s start with the common case.

The most obvious problem the implicit device context target causes is that it is by definition impossible to draw to more than one window from the same OpenGL device context without repeatedly calling wglMakeCurrent() to switch its target. Even if all the application wants to do is draw one full-screen image to one monitor, and one full-screen image to another monitor, Windows treats this as two separate windows with two separate device contexts, thus at least one wglMakeCurrent() per window is required every frame.

This is bad for two reasons. First, wglMakeCurrent() is not assumed to be a lightweight operation on OpenGL drivers, and in general the OpenGL driver has no way of knowing a priori that all you’re doing with the wglMakeCurrent() is flip-flopping between two consistently targeted windows. Thus, if a driver wants to support this path efficiently, it introduces yet more “pattern discovery” into the GPU driver as it tries to determine that the user is not actually targeting a fresh window requiring new setup, but rather simply switching between two consistent windows in an attempt to update them both.

Second, wglMakeCurrent() is not an OpenGL call per se, it’s a Windows call, thus it is not platform-independent and cannot be used by the platform-independent part of a graphics layer. While this may seem like a minor issue, it turns out to be rather unfortunate, since operations that target the window back buffer now must thunk to the platform layer and back in order to ensure that the magic “target” of the operation has been set up before they can continue issuing what would otherwise have been platform-independent OpenGL code.

Things get worse when more threads get involved.

Things don’t get any better when multi-threaded code is involved. As per the original intention with “one GL context per thread”, it is actually well-specified in OpenGL for an application to make each of its threads (say, one per CPU core) have its own GL context so that streams of commands can be constructed separately. Assuming these threads work off a job queue, whichever thread dequeues some OpenGL work from the queue can simply do it on their context, and everything is fine.

However, all that falls apart once you need to target an output surface. Because each thread’s OpenGL context is implicitly targeting only one device context, a thread that tries to do OpenGL work that targets a specific device context must call wglMakeCurrent() using the HDC of the window in question. But it is illegal for two OpenGL contexts to target the same HDC at the same time, so now the two competing threads must introduce a mutex around the wglMakeCurrent().

While it may seem like this would be necessary in any case, the truth is that no such mutex is required during construction of the frame’s GPU commands — rather, the mutex (or more specifically, fence) required here is one on the GPU command retirement, namely that the streams of commands built by the threads should retire in a specific order on the GPU, regardless of when they were articulated to the driver.

There is a very simple API change that would solve these problems.

All of these problems come from one root cause: the fact that OpenGL contexts implicitly target a single Windows device context. But there’s no real reason that has to remain true. We could add a single pair of API calls and fix it.

OpenGL already has a robust set of APIs that deal with the concept of framebuffers. This was a necessary addition to the API because developers wanted to be able to render to textures, so instead of always having rendering target the implicit device context, they needed ways of targeting a texture instead. Thus we already have:

void glGenFramebuffers(GLsizei n, GLuint * framebuffers);

void glBindFramebuffer(GLenum target, GLuint framebuffer);

When these functions were added, the implicit framebuffer handle of 0 became shorthand for the implicit device context target, and any other handle was one that the application could create and target for its own internal use.

So in some sense, the problem of how to tell the GL where output is suppose to go was already solved, and already works quite well. Thus all we really need to do to solve all the Windows-specific targeting problems is remove the implicit device context targeting of framebuffer handle 0 and replace it with explicit device context targeting with specific framebuffer handles. We can do this with two very simple extensions to wgl:

CGLuint wglAcquireDCFramebuffer(HDC hDC);

void wglReleaseDCFramebuffer(HDC hDC, CGLuint framebuffer);

Now, an application can use a single OpenGL context to target any number of output surfaces, just by acquiring framebuffer handles for their device contexts and then using the existing glBindFramebuffer call to say which one they mean:

HDC DC0; // NOTE(casey): The device context from window 0

HDC DC1; // NOTE(casey): The device context from window 1

CGLuint Framebuffer0 = wglAcquireDCFramebuffer(DC0);

CGLuint Framebuffer1 = wglAcquireDCFramebuffer(DC1);

// NOTE(casey): Do any common rendering to textures here

// ...

// NOTE(casey): Draw Window 0's specific output

glBindFramebuffer(GL_FRAMEBUFFER, Framebuffer0);

// ...

// NOTE(casey): Draw Window 1's specific output

glBindFramebuffer(GL_FRAMEBUFFER, Framebuffer1);

// ...

// NOTE(casey): Show everything

SwapBuffers(DC0);

SwapBuffers(DC1);

Note that the wglReleaseDCFramebuffer call isn’t even really necessary for most usage, because device contexts may never have to be explicitly released, but it might be nice for completeness in the case where OpenGL rendering ceases on a device context and the application would like to let the GL driver know it can free any associated resources.

There is one caveat.

Specifically, if multiple GL contexts call wglAcquireDCFramebuffer on the same device context, the GL driver has to be smart enough to be thread-safe with its processing of commands targeted at those contexts. It would still be entirely up to the application to ensure that these are issued mutually-exclusively, so that the order in which the driver got them was always the order the application wanted them.

This doesn’t strike me as a particularly good way to do things, because it forces the driver to do implicit ordering of things, which may restrict how threading can work in the driver and lead to unnecessary complexity or performance problems. Thus I would additionally propose that, were such an extension to actually be added, there should be one requirement:

Whenever two or more OpenGL contexts target the same framebuffer, they are required to use a fence.

To relieve the driver from any responsibility of serializing command streams across threads, any application using two OpenGL contexts to target the same device context for output should be required to insert a fence that serializes them:

// NOTE(casey): Thread 0's code

glBindFramebuffer(GL_FRAMEBUFFER, ContextBuffer0);

// ...

GLsync Sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);

// NOTE(casey): Thread 1's code

glWaitSync(Sync, 0, GL_TIMEOUT_IGNORED);

glBindFramebuffer(GL_FRAMEBUFFER, ContextBuffer0);

// ...

This forces the application to be specific about the order in which the two thread’s commands are to be issued, and allows the driver to delay examination of the thread commands for as long as it likes while still being able to construct the proper order afterward. Although it unnecessarily serializes other GL calls occuring before and after the relevant portion, fixing that problem is a separate issue that deals with OpenGL’s dependency model and is well beyond the scope of any attempt to fix multi-target rendering specifically.