After having his lunch interrupted by an urgent call from a client, stack.io CEO Hany Fahim put on his Sherlock Holmes hat to investigate a missing file full of sensitive data.
“Constant change results in one of the most attractive aspects of working behind the scenes of the internet – how varied each day can be,” observes Hany.
I’ll take that to go
Hany was enjoying lunch in late 2017 when he received an urgent request from a client. An investigation was needed – now.
“I’ll take that to go,” Hany told the staff at his favourite pita joint.
The problem was a failed file transfer that had occurred several weeks earlier. Each night, the client was to transfer a file containing sensitive data to their partner. The transfer was an automated task called a cron job.
The hunt begins
Hany worked with the client to troubleshoot the issue. After getting a summary of the problem he began asking questions.
Was logging enabled? Have you confirmed that logging was actually working?
Hany checked the system logs that would give high-level information about what might have happened. From the logs, he saw the cron job did execute on the night in question. He assumed that was the end of the story.
“Was it successful? That was the ultimate question. There was no easy answer,” says Hany. There was no direct information to indicate success or failure.
A deep dive
Hany started to check using some indirect methods. The file size was small, it wouldn’t register on a network graph.
“I needed to dig deeper,” says Hany.
There was nothing in the database logs. Maybe low disk space? No, the disks were fine. Also, there wasn’t an alert history for that night.
“We went through nearly every available option and struck them off the list. I was out of ideas,” says Hany.
He told the client there were no errors and it seemed the program ran successfully. He called it a day.
Breaches, penalties, and lawyers
After starting for home, the client called. Hany was informed of how dire this situation was. The client used terms such as breach of contract and penalties. Lawyers were involved. This was serious. The client needed hard evidence.
At home, back at the computer, he started on the client’s problem again.
“All I could think about was where else could I find that hard evidence?” confesses Hany.
He spent the evening checking every log file, trying to find some data related to the client’s program.
Nothing was found.
As evening turned to night, and night to morning, Hany was searching Google for possible answers. He was getting nowhere and called it a night.
Light at the end of a dark long night
The next morning, over strong coffee, he spotted something. One of his open browser tabs was a six-year-old post from Stack Overflow, a popular question and answer site. He had disregarded the post, but reading it again, he saw the second answer to a user’s question might help in his client’s problem. That was Process Accounting.
Process Accounting is a collection of tools used to record and summarize the commands run on Linux systems. Hany’s team had installed it on all their systems. All the client’s process account records were stored in several files in a non-standard location – which was why Hany hadn’t seen them.
“The file was right there. The hard evidence we were so desperately seeking was contained in that file,” says Hany.
The file was not in human-readable form. Hany had to read up on Process Accounting commands but still ended up with a screen of numbers and unusable data. What he needed to find was the exit code – which would show if the program was successful or not in sending the file.
Finding the error, finally
A user on a Stack Overflow page had written a program to get the exit code information out of the jumble of data. After acquiring the program, scanning 1000’s of lines of code, and making some changes, Hany was finally able to get the data he needed.
The client’s program had run for approximately 10 minutes trying to upload the data file. On an average night, it took less than one minute for the transfer to complete. Checking with the client, Hany learned that the program would try to connect to their partner’s system for 10 minutes. If it was unsuccessful, it would timeout and abort.
“That was the smoking gun they needed,” concludes Hany. The problem had been solved. Moving forward, the stack.io team made process accounting a more prominent weapon in their arsenal.
For the full story with all the technical details make sure to listen to the podcast.