What happened with 27.2: The tale of a legendary hotfix
That was a crazy week. Let’s talk about what happened with 27.2 and what we had to do this past week.
With the Windows version of 27.2, we updated all of our dependencies. A dependency is a library made from external source code; something which is not OBS source code, but that OBS depends upon for major functionality and features. Sometimes this can be a feature such as software H.264 encoding, which relies on the x264 encoder library, or a feature such as the browser source, which relies on a much bigger dependency: Chromium (the browser engine that powers Google Chrome). More specifically, the browser source utilizes the Chromium Embedded Framework (CEF) to render a webpage as a source, or to render a webpage as a panel inside of OBS.
On Windows, before OBS 27.2, our browsers were stuck on Chromium version 75 because we had to use a complex custom Chromium patch to be able to use it with reasonable performance, and that patch was incompatible with newer versions. Chromium 75 was almost three years old as of 27.2, and many important features and changes have been added to Chromium since then; features and changes which are essential to modern production components such as stream overlays and advanced production displays. It was getting very outdated, so needless to say, it was a high priority for us to update Chromium as a dependency, and thanks to the effort of OBS Project members Pat, Dillon, Matt, pkv, as well as some wonderful people who contribute to the Chromium Embedded Framework, we were finally able to update it to version 95.
However, updating dependencies, especially dependencies that large, can pose some challenges: during the 27.2 release, we started to get sporadic reports of people’s entire computers freezing, requiring a full system reboot. We immediately began investigating, and at first I slowed, then eventually reverted the release of 27.2 as more reports came in confirming the issue. On the first day after 27.2, we were frantically trying to find affected users who would have the patience to allow us to examine and understand what was happening. Fortunately, we found some very kind users who were able to force a bluescreen during the system freeze and generate debug dump files of their system kernels. This allowed us to get the first hint of what was happening: it was very likely a bug in the graphics driver. We noticed it was centered around graphics operations, and it was only happening with one graphics card manufacturer. Fortunately, we have contacts with all the major graphics card manufacturers, so we immediately got in contact with them to file a driver bug report.
Being that we were in no position to expect a graphics card manufacturer to debug and fix a suspected driver bug in a timely manner, let alone expect users to update to those drivers within any reasonable timeframe after their release, we had no choice: either find a workaround soon, or revert Chromium back to version 75. Considering so much effort in this update was spent updating Chromium to improve stream production features for users, reverting Chromium back to 75 was not the option I wanted to take. I had to do something, and soon, as Twitch was deprecating their v5 API in two weeks, meaning that old versions of OBS would no longer be able to use the “Connect Account” feature with Twitch.
R1CH, a contributor to OBS, had figured out a way to reproduce the system freeze: add a bunch of very active browser sources, and run OBS in 1300/1 fractional framerate (i.e. 1300 frames per second). This did the trick, and we were now able to reproduce the issue ourselves and debug it a bit more easily. While debugging the bluescreen kernel dumps, I immediately suspected what the problem was likely triggered by: the IDXGIKeyedMutex API. This API is used to lock and synchronize shared graphics memory between two different processes or threads on the system; our latest Chromium update had been modified to use it, a big change from how Chromium 75 functioned. Being that I am incredibly stubborn, and being that I already hated that API, around two or three days after 27.2 was released, I decided that I had to find out whether or not that was the trigger. For the next day or so after that, I modified and compiled Chromium to remove IDXGIKeyedMutex almost everywhere I could see it. Because Chromium is such a monumentally large project involving tens of millions of lines of code, and because it uses so many layers of abstraction and interprocess communication, I had doubts I would be able to accomplish it; some of our contributors suggested we should just let it go and revert back to Chromium 75. But being incredibly stubborn, after a few days of learning how Chromium works internally, I managed to remove almost all usage of keyed mutexes.
And I couldn’t believe it: it solved the system freeze. It did the trick! I felt like Luke Skywalker in the Death Star’s garbage compactor right after it was deactivated.
However, it introduced a new issue: although it solved the system freeze issue, removing synchronization inevitably caused frame stuttering and frame pacing issues when rendering the browser source. It was noticeable enough that I knew that my job wasn’t finished. For the next sleepless day or two after that, in an attempt to solve this issue, I went to work trying to figure out some other way to synchronize textures shared between Chromium and OBS. Coupled with the fact that Chromium code is so incredibly abstract and relies on so many different separate independent interprocess parts working together, it made the task incredibly difficult. After a day or two of no success, another contributor had suggested that I just let it go, and that it was good enough as-is, and that we had the v5 API deadline. I originally conceded, and let it go, but that same night while I was in the shower, I had an idea of how to solve it!
After that shower, I told the other contributors that I’m going to try one last thing that night, and that if I couldn’t do it before the night was over, I’d give up. In an all-or-nothing last-ditch effort, I spent the rest of the night reprogramming a couple key parts of Chromium to share a single texture with OBS, which would be automatically updated by a simple copy operation from the backbuffer texture: the same exact way that the patch for version 75 accomplished it.
My effort proved fruitful, and not only did it fix the frame stuttering issue, but it also vastly improved performance of the Chromium 95 build. Not only did we fix the system freeze and fix frame stuttering, we also greatly improved browser source performance!
Words can’t describe how good it felt. My elation went from simply getting out of the garbage compactor on the Death Star to blowing up the entire Death Star in one fell swoop. We tested it with everyone we could: everyone confirmed that all of their issues were fixed, and that the browser source was performing better than ever. After an entire week of sleepless toil, we’re now here with 27.2.1, a legendary hotfix. It was the worst week of my life that somehow turned into the best week of my life.
I want to make a big shout out to the very patient users who purposely crashed their PCs to get us kernel dumps, a big shout out to R1CH for figuring out a way to reproduce the system freeze reliably, and a big shoutout to Matt, pkv, Flaeri, Shaolin, RytoEX, Ace, and everyone else who spent time testing different builds on different systems. Thank you all. Without all of our wonderful contributors working together, none of this would have been possible.
What an incredibly crazy week. I can finally get some sleep again.