Really, it was a software bug rather easily fixed. It turns out rewriting eight lines of code would have quickly done the trick.
Instead, it took a month of recurring errors before stack.io CEO Hany Fahim and his team solved the problem. Maybe it was partly the product of pride, of stubbornness and a healthy dose of curiosity. After all, it’s hard to resist a good mystery.
What’s with that error message?
It happened more than four years ago, but the whole episode is still fresh in Hany’s mind. He’d just come back from lunch on a wet and blustery November afternoon when someone pointed to an error message: “Content is not a valid SSH key.”
“SSH keys are like personal identification cards that allow secured communication with Linux systems,” says Hany. “Our team has developed a web application that allows clients to manage their SSH keys and map them to systems they need access to.”
A list of potential culprits
This application was written in Python, a popular programming language using a framework called Django. Remember that name, Python, because it makes an important return to our story. As does Unicode, an expiring cache… there’s a list of potential culprits.
One of the most important steps in troubleshooting is being able to isolate and replicate an issue at will. Doing so allows you to reliably test to fix. The team tried to replicate the issue many times, without success.
“We didn’t think anything more of it,” says Hany. “Little did we know we were about to embark on a multi-week bug-hunting quest.”
It’s baaack
A week later, another error message about an invalid SSH key. By then, the team had already located the piece of code responsible for the error. It was only eight lines long and had been run successfully thousands of times, and yet it breaks twice in a week. No one knows what the real error is.
There’s maybe a quick solution
At this point, maybe the best idea is to rewrite the offending code. After all, it’s only eight lines long.
Indeed, the team had actually planned for this; a fully rewritten version was prepared and tested. But the decision was made to just keep it in a back pocket for now.
But it’s more fun finding the complete answer
“From a business perspective, we should have simply applied this and put an end to this fiasco,” says Hany. “But not knowing what was really going on was eating me up inside. “I justify to the team that continuing with this hunt was paramount because we did not understand it. The bug could affect other systems, maybe even cause more serious events such as data loss or security breach.”
So it happens again around Christmas, and Hany calls for all hands on deck to solve this problem. “Finally, we’re able to reproduce the error,” he says, “But again, things don’t add up. Based on what we’re seeing, this should always fail, but somehow it works… most of the time.”
“Personally, I succumb to the sunk-cost fallacy,” he admits, “a cognitive bias whereby you continue spending time and money on a problem simply because you’ve spent so much time already. I was hell-bent on solving this problem one way or another.”
The fix was insultingly simple
The fix, when it was finally found, was almost insultingly simple. But maybe it wasn’t that simple, because, as we found out, a single bug in a complex system is not enough to cause failure. They occur when multiple bugs are triggered under the right circumstances.
But you’ll have to listen to the “Unpacking an Anomaly” episode of Tales From the Ops Side podcast to find out what this simple solution was and why it proved so hard to find.