The Mystery Blips

A young operations engineer learns a lesson about creative debugging

Nov 19, 2022

My first team at FB was SRO: Site Reliability Operations.

This is different from SRE: Site Reliability Engineering. Think much less proactive/creative/forward-thinking/preventative coding and investigating, and much more OH GOD PUT THE FIRE OUT NOW.

Our primary job responsibility was oncall. Oncall. That's pretty much it. We would have 3/6 hour split shifts, spread across the week. We were responsible for the ENTIRE SITE. And I mean that. The whole thing, from Messenger to databases to the web tier to caching. If a service going down could take down the site or some critical functionality, their alarms came to us.

Let's just say the oncall was stressful.

The top ten most stressful hours of my life, bar none, were all SRO oncall.

You're just sitting there, watching charts. For 3 hours. Or you're the backup oncall, ready to take over watching charts if your primary needs to pee or get a snack. That's really how we did things then. Watch charts and alerts, and wait for something to break. No pagers, no alerting noises, just... Watching the charts, waiting for something to break1. It seems almost barbaric in comparison to the systems FB would have for oncall by the time I left, but it's what we had and everyone relied on us to keep the site up.

Once something died, you could literally watch the money draining away from the company while the site was down. Could see the millions of people unable to use the site. And you'd know that you, just you, and no one else, were in charge of getting it back up.

Again. Stressful.

Anyway. The dynamic was this: roughly a dozen people are sitting around quietly typing, one person is slightly on edge, and one person is either waiting for the hammer to fall or is in FULL CRISIS MODE.

If nothing was broken, you'd just sit there. Scanning charts visually, waiting for things to turn red if they dropped too much, looking for anomalies. Very, very often, the shape of the red charts instantly confirmed a SEV. Important metrics dive off a cliff, and it's time to pull folks into IRC, wake them up if need be, escalate to an IMOC - whatever was called for to get things resolved.

So, usually, if an alarm fired off and its chart went red: something was seriously broken.

But sometimes, there would be Blips.

You'd find some unexplained massive spike in Egress (the amount of data flowing out of FB to its users), and then it would just... Go away. No more Blip. Or you'd have a 3-minute spike in 5XX errors (server-side errors that often meant either a slow or broken experience for the user), but then after those 3 minutes they'd go back down and you'd never see them again.

Chasing down the Blips can teach you a lot about how a complex system functions. Sometimes it can help you find real problems before they happen. And a lot of the time, they'll teach you only one thing: That complex systems are chaotic, and sometimes things just harmlessly Blip and you'll never know why.

Our general policy was that if a Blip was big enough to set off an alert, we'd give it at least a cursory look.

So one morning, I'm in the office. I'm oncall. I'm staring at the charts. It's one of my very first shifts.

And I get the mother of all Blips. Massive spike in egress, posts, likes2, everything. Blip.

So I start to take a look. No alerts from other systems, no corresponding spike in 5XX, nothing seems to be under load, no large changes pushed out at that time, just a big fat blip. The metrics all seemed to be a little bit low before the blip, but that was the only anomaly.

I decide it's just one of those things, and let it be. The site seems perfectly healthy.

Seventeen minutes pass. I stare at charts, chitchat on IRC, and relax.

BLIP. Holy hell. The dashboard glitters red. All the major metrics go nuts. Egress, posts, likes.

Now I'm really worried. These blips aren't happening on round numbers like 11:30 or 11:45, which would indicate an automated system. Something is up.

I take to IRC. "Does anyone know if there are any large changes happening to the site right now?"

A couple people chime in, nothing going on. I'm wondering if it's time to file a SEV, so I pull in some slightly more experienced SROs to take a look.

"Wow. Look at that depressed traffic before the blip. Maybe something is delaying reporting of metrics and then dumping it all at once?"

We continue down this path a while. Looking through logs, comparing charts. Traffic trends downward and downward, and I'm starting to feel worried...

BLIP.

Holy hell. Through the roof. Egress, posts, likes. Skyrocket and crash back down. But this time, they stay elevated for several minutes before going back to normal.

These spikes are not happening on a regular cadence. They're shooting up higher each time. It's time to call in the big boys. This looks like a SEV.

I run over to someone's desk, an SRO who had been with the team for years. I explain what's going on, and he comes to my desk to look at the charts.

Concern spreads across his face. "Wow. That's a disturbing pattern. You guys checked for events, right?"

"Events? Yeah, no deploys, no major system changes, nothing going on with the network, no..."

"What? No, I mean like... Events. News. Did you check for that?"

"What?"

"Check the news. Look for major events happening."

So I go to a news aggregator and start to browse while they look on. It's a quiet day.

"I don't see anything, nothing is really happening. Maybe a queueing system has…"

"I know what it is. Go to <xyz website>."

I type in the URL, smack Enter, wait for it to load, and...

Everyone around me lets out a collective "OHHHHHH".

I'm staring at the screen dumbfounded. Really?

It's a soccer game.

Final score: 2-1.

On the off days, we would work on automation and fixing up things that had broken (and later on in my tenure, we'd work on automating ourselves out of a job - but that's a story for another day).

Funny aside, one of the very best metrics we had was the number of Likes at any given moment. You would be *shocked* at how regular Likes were from day to day and week to week. The cyclical nature of it was hypnotic. Somehow, week after week, if you added up all the number of likes happening in a particular minute on FB, you'd get a number eerily similar to the same number a week ago, and very similar to the number a day ago (weekends aside). There's a deep beauty to the thought that untold millions of people using the app randomly, but oh so slightly habitually, aggregated together, makes such a predictable pattern.

Mosquito Chronicles

Discussion about this post