Just before Christmas break, I
fixed a bug in the transform manipulator and checked it in. Having a transform manipulator is a nice addition to any editor, and we were all happy to have it working. But shortly thereafter, I had to leave for a cross-country family Christmas trip. This meant no more real
Witness work for about ten days, because my laptop only has Linux on it, and right now the Linux version of
The Witness is still under construction.
This hiatus was the primary motivation behind writing my previous entry as a cliffhanger: although I had fixed the bug, I didn’t have time to adequately document the bug via debugger inspection, so it would have been hard to write anything meaningful showing how the bug manifested itself in practice. I wanted to wait until I had a chance to go through every last little bit of it, so I could be as specific as possible. In that sense, it was meant to be a bit of a cliffhanger for me as well.
But as it turned out, it was much more of a cliffhanger than I’d originally anticipated.
The good news: congratulations to those of you who posted “floating point precision” as a suggestion for the cause of the bug. You aren’t any dumber than I am.
The bad news: we’re all dumber than we should be.
For those of you who don’t already know all about floating point precision problems, I’ll explain the problem we all thought we were solving. The original code, verbatim, looked like this:
I’ve done exactly the same operation, but instead of using the d value of the plane equation to solve for t, I just move the ray itself to be centered around a point on the plane, so that the d value is implicitly zero. My reasoning was very simple: if I subtract two things that I know are close together (the camera location and the picked point on the manipulator), I will have less cancellation error than if I subtract two things that may be wildly different in scale (the d value and the dot product of the camera location and the plane normal).
This is pretty weak sauce as far as numerical precision improvements go. I am far from an expert on the subject (read some
Forman Acton, for example, if you want to see how the pros roll). But, when I did this, it fixed the bug perfectly, so I assumed, at least for the moment, that it must be a reasonable fix.
But, honestly, I didn’t really have a good feeling about it.
Sometimes, when there’s a lot to do, you’ll live with a bug fix because it works, rather than because it makes sense. But invariably, at least for me, I end up going back to look more closely at something if I don’t feel like I fully understood it. I am sure this is mostly just an obsessive personality trait, but regardless, I rarely find the time to be wasted, because I almost always learn something from the process.
In this specific case, I didn’t have time to look into the problem in depth before I left. I planned to do so for the eventual writeup, but of course, I couldn’t do much without being able to run the code, and while I was on Christmas break, that wasn’t an option. Unfortunately, this meant that I had the entire Christmas break to stew about the bug that I wasn’t sure I’d really fixed properly.
From the beginning, the thing that didn’t sit right with me about the fix was that when Jon originally reported the problem to me, he didn’t mention anything about it happening more on the outskirts of the island than on the interior. Jon’s a very experienced programmer, so if something like that were happening, he would probably notice and include that in the report, because he’d know it would help me find the bug faster.
Instead, it seemed pretty clear that the bug was related more to the camera angle than the position in the world. While obviously the camera angle affects the input values, and could have a number of effects on the numerical precision of the program, it’s not intuitively the kind of thing that I would expect. If these sorts of algorithms were that sensitive in this way, I would have expected to encounter similar problems with camera-relative plane intersections many times before — but I never had.
The more I thought about that, the more it gnawed at me that I couldn’t go see exactly what was happening and prove to myself why the fix worked. I did have the game on my laptop, because I’d done some initial Linux porting work, so I realized that maybe, if I was lucky, I could reconstruct the problem without actually being able to load and run the editor.
That is when I became very thankful that I had been writing these blog entries, because it meant I had stopped to take this screenshot before I left, specifically so I could do the writeup over break:
Do you see what I see? That lovely entity number up in the corner is pure gold for step one of reconstructing the problem. Because I could see the entity number, I was able to go into the repository and get the data file that says where that entity is placed in the world. That let me know the numbers of the center point of the manipulator: -304.114, 193.4923, 4.7582.
If you’re an experienced 3D programmer, then you probably are thinking the same thing that I was thinking when I saw those: how the hell could I be having numerical precision problems with such low numbers?
The Witness’s scale is one unit per meter, which means that, at the scale of the dragging we’re talking about, there should be absolutely no problems with solving a single plane equation, regardless of how sloppily you did it.
But if that was the case, then how did I fix the bug? This lead me to step two of reproducing the problem: writing a test program. I wrote a little C program that would take numbers around that point and, for a large set of programmatically generated planes and rays, perform the intersection using both the old ray intersection routine and my “improved” one.
Perhaps unsurprisingly, these results reinforced my suspicions that there wasn’t enough of a difference between the two routines to account for the original bug. In all the tests, my “improved” routine was no more than one decimal digit more precise; so if the original routine produced intersection points that were actually, say, 0.0001 units away from the plane, the new routine might produce ones that were 0.00001 away.
Clearly, the fix just plain shouldn’t have worked. But it did.
Why?
The Wrong Part of the Post
I would love to be able to say that I guessed the real reason before stepping through the code, but I absolutely didn’t.
When I returned from Christmas break, I eagerly booted up the Witness and stepped through the code very carefully, keeping accurate track of all the numbers as I went, waiting to find that super-elusive numerical thing that I had overlooked. To my complete surprise, it turned out to have
absolutely nothing to do with precision.
In fact, somehow, although I didn’t know it at the time, I actually did write about the root cause of the problem in my original post. It just wasn’t in the part that I thought. Indeed, rather than “Always Look Behind You” foretelling the eventual solution, it was “Foreign Languages” that contained the hint: due to my naivete with
The Witness codebase, I misunderstood what it was doing with planes. It turns out that this one line from the original routine was the entire culprit:
This is my fault on its face, because hey, it
does say normal, and normals are traditionally supposed to be normalized. But in keeping with my reference to foreign languages, this is something I really just didn’t think about, because I am used to the way I’ve always done things with planes: nothing is normalized automatically, everything is just left unnormalized, because back in the day, normalization was very expensive and you never did it in a math routine unless it was absolutely necessary.
So plane_
from_point_and_normal was actually doing what I thought it would: it just stored plane_n as the a, b, c values, and computed d with the inner product. But get_normal() was not doing what I thought it would. Rather than just returning me the vector {a,b,c}, it was normalizing them before giving them to me, which meant all my calculations with them would no longer agree with the d value computed in plane_from_point_and_normal.
And therein, clearly, lies the entirety of the bug.
Just to underscore how much of a native language issue this really is, when I showed Jon the preview of this article, he said, “If you are thinking of the plane as a geometric object, of course the normal is unit length, what else would it be?? But if you are thinking of it as some specific equation all the time, then it is natural to care about the a, b, c, you put in.” So I can only assume, were he using my math library, he would’ve been tripped up by the fact that I don’t normalize normals automatically, exactly the inverse of the problem I encountered with his.
I honestly don’t believe that this sort of thing is a case of us being presumptuous or sloppy; it’s just the price you have to pay for being able to program complex things at a reasonable speed. If we constantly have to think about what it means to work with a vector, or a plane, or a quaternion, we’re going to be a lot slower. It’s much more efficient to have a consistent perspective, and always rely on it.
And of course, the cost is bugs like this — but that’s a very small cost relative to the benefits.
So, where does that leave you?
Clearly, automatic partial credit goes to everyone who wrote in suggesting that numerical precision was the problem. You came to the same conclusion I did with the information I knew at the time, which was all you had to go on. You certainly can’t be blamed for not finding the real bug since you would have needed the source code to the Witness math library to even see it. And it is true that the numerical precision of the routine could have been improved, as I believe I was able to verify separately with my test program.
But by the same token, nobody gets automatic full credit. There were a lot of reasons why, even with only the information in the original article, you could have said, “I don’t think it’s numerical precision, Muratori.” Maybe I am ascribing overly heroic attributes to the man, but I feel like
Forman Acton would have known, from looking at the plane intersection snippet, that there wasn’t any way to lose enough precision in there to account for the bug as I described it, at least not in any way that would have been substantially improved by the suggested fixes people wrote in (or the fix I actually did, for that matter).
So in order to actually get full credit, you’re going to have to earn it, I think, but on the honor system. Ask yourself, if you had encountered this bug, and you had “fixed” it as I had, would you have gone back and figured out what was really happening before moving on to something else?
If you can honestly answer yes, then you get full credit, for this is about as good as you can hope to do as a working programmer, I think. You’re no
Forman Acton, of course… but then again, who among us is?