Early in my career, I was responsible for managing a large fleet of printers across a large campus. We're talking several hundred networked printers. It often required a 10- or 15-minute walk to get to some of those printers physically, and many were used only sporadically. I didn't always know what was happening until I arrived, so it was anyone's guess as to the problem. Simple paper jam? Driver issue? Printer currently on fire? I found out only after the long walk. Making this even more frustrating for everyone was that, thanks to the infrequent use of some of them, a printer with a problem might go unnoticed for weeks, making itself known only when someone tried to print with it.
Finally, it occurred to me: wouldn't it be nice if I knew about the problem and the cause before someone called me? I found my first monitoring tool that day, and I was absolutely hooked.
Since then, I've helped numerous people overhaul their monitoring systems. In doing so, I noticed the same challenges repeat themselves regularly. If you're responsible for managing the systems at your organization, read on; I have much advice to dispense.
So, without further ado, here are my top five reasons why your monitoring is crap and what you can do about it.
By far, the most common reason for monitoring being screwed up is a reliance on antiquated tools. You know that's your issue when you spend too much time working around the warts of your monitoring tools or when you've got a bunch of custom code to get around some major missing functionality. But the bottom line is that you spend more time trying to fix the almost-working tools than just getting on with your job.
The problem with using antiquated tools and methodologies is that you're just making it harder for yourself. I suppose it's certainly possible to dig a hole with a rusty spoon, but wouldn't you prefer to use a shovel?
Great tools are invisible. They make you more effective, and the job is easier to accomplish. When you have great tools, you don't even notice them.
Maybe you don't describe your monitoring tools as "easy to use" or "invisible". The words you might opt to use would make my editor break out a red pen.
This checklist can help you determine if you're screwing yourself.
If you answered yes to any of those, you are relying on bad, old-school tooling. My condolences.
The good news is your situation isn't permanent. With a little work, you can fix it.
If you're ready to change, that is.
It is somewhat amusing (or depressing?) that we in Ops so readily replace entire stacks, redesign deployments over a week, replace configuration management tools and introduce modern technologies, such as Docker and serverless—all without any significant vetting period.
Yet, changing a monitoring platform is verboten. What gives?
I think the answer lies in the reality of the state of monitoring at many companies. Things are pretty bad. They're messy, inconsistent in configuration, lack a coherent strategy, have inadequate automation...but it's all built on the tools we know. We know their failure modes; we know their warts.
For example, the industry has spent years and a staggering amount of development hours bolting things onto Nagios to make it more palatable (such as nagios-herald, NagiosQL, OMD), instead of asking, "Are we throwing good money after bad?"
The answer is yes. Yes we are.
Not to pick on Nagios—okay, yes, I'm going to pick on Nagios. Every change to the Nagios config, such as adding or removing a host, requires a config reload. In an infrastructure relying on ephemeral systems, such as containers, the entire infrastructure may turn over every few minutes. If you have two-dozen containers churning every 15 minutes, it's possible that Nagios is reloading its config more than once a minute. That's insane.
And what about your metrics? The old way to decide whether something was broken was to check the current value of a check output against a threshold. That clearly results in some false alarms, so we added the ability to fire an alert only if N number of consecutive checks violated the threshold. That has a pretty glaring problem too. If you get your data every minute, you may not know of a problem until 3–5 minutes after it's happened. If you're getting your data every five minutes, it's even worse.
And while I'm on my soapbox, let's talk about automation. I remember back when I was responsible for a dozen servers. It was a big day when I spun up server #13. These sorts of things happened only every few months. Adding my new server to my monitoring tools was, of course, on my checklist, and it certainly took more than a few minutes to do.
But the world of tech isn't like that anymore. Just this morning, a client's infrastructure spun up a dozen new instances and spun down half of them an hour later. I knew it happened only after the fact. The monitoring systems knew about the events within seconds, and they adjusted accordingly.
The tech world has changed dramatically in the past five years. Our beloved tools of choice haven't quite kept pace. Monitoring must be 100% automated, both in registering new instances and services, and in de-registering them all when they go away. Gone are the days when you can deal with a 5 (or 15!) minute delay in knowing something went wrong; many of the top companies know within seconds that something isn't right.
Continuing to rely on methodologies and tools from the old days, no matter how much you enjoy them and know their travails, is holding you back from giant leaps forward in your monitoring.
The bad old days of trying to pick between three equally terrible monitoring tools are long over. You owe it to yourself and your company at least to consider modern tooling—whether it's SaaS or self-hosted solutions.
At the other end of the spectrum is an affinity for new-and-exciting tools. Companies like Netflix and Facebook publish some really cool stuff, sure. But that doesn't necessarily mean you should be using it.
Here's the problem: you are (probably) not Facebook, Netflix, Google or any of the other huge tech companies everyone looks up to. Cargo culting never made anything better.
Adopting someone else's tools or strategy because they're successful with them misses the crucial reasons of why it works for them.
The tools don't make an organization successful. The organization is successful because of how its members think. Its approaches, beliefs, people and strategy led the organization to create those tools. Its success stems from something much deeper than, "We wrote our own monitoring platform."
To approach the same sort of success the industry titans are having, you have to go deeper. What do they do know that you don't? What are they doing, thinking, saying, believing that you aren't?
Having been on the inside of many of those companies, I'll let you in on the secret: they're good at the fundamentals. Really good. Mind-blowingly good.
At first glance, this seems unrelated, but allow me to quote John Gall, famed systems theorist:
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system.
Dr. Gall quite astutely points out the futility of adopting other people's tools wholesale. Those tools evolved from simple systems to suit the needs of that organization and culture. Dropping such a complex system into another organization or culture may not yield favorable results, simply because you're attempting to shortcut the hard work of evolving a simple system.
So, you want the same success as the veritable titans of industry? The answer is straightforward: start simple. Improve over time. Be patient.
If there's one argument I wish would die, it's the one where people opine about wanting to "avoid vendor lock-in". That argument is utter hogwash.
What is "vendor lock-in", anyway? It's the notion that if you were to go all-in on a particular vendor's product, it would become prohibitively difficult or expensive to change. Keurig's K-cups are a famous example of vendor lock-in. They can be used only with a Keurig coffee machine, and a Keurig coffee machine accepts only the proprietary Keurig K-cups. By buying a Keurig, you're locked in to the Keurig ecosystem.
Thus, if I were worried about being locked in to the Keurig ecosystem, I'd just avoid buying a Keurig machine. Easy.
If I'm worried about vendor lock-in with, say, my server infrastructure, what do I do? Roll out both Dell and HP servers together? That seems like a really dumb idea. It makes my job way more difficult. I'd have to build to the lowest common denominator of each product and ignore any product-specific features, including the innovations that make a product appealing. This ostensibly would allow me to avoid being locked in to one vendor and keep any switching costs low, but it also means I've got a solution that only half works and is a nightmare to manage at any sort of scale. (Have you ever tried to build tools to manage and automate both iDRAC and IPMI? You really don't want to.)
In particular, you don't get to take advantage of a product's unique features. By trying to avoid vendor lock-in, you end up with a "solution" that ignores any advanced functionality.
When it comes to monitoring products, this is even worse. Composability and interoperability are a core tenet of most products available to you. The state of monitoring solutions today favors a high degree of interoperability and open APIs. Yes, a single vendor may have all of your data, but it's often trivial to move that same data to another vendor without a major loss of functionality.
One particular problem with this whole vendor lock-in argument is that it's often used as an excuse not to buy SaaS or commercial, proprietary applications. The perception is that by using only self-hosted, open-source products, you gain more freedom.
That assumption is wrong. You haven't gained more freedom or avoided vendor lock-in at all. You've traded one vendor for another.
By opting to do it all yourself (usually poorly), you effectively become your own vendor—a less experienced, more overworked vendor. The chances you would design, build, maintain and improve a monitoring platform better—on top of your regular duties—than a monitoring vendor? They round to zero. Is tool-building really the business you want to be in?
In addition, switching costs from in-house solutions are astronomically higher than from one commercial solution to another, because of the interoperability that commercial vendors have these days. Can the same be said of your in-house solution?
Many years ago, at one of my first jobs, I checked out a database server and noticed it had high CPU utilization. I figured I would let my boss know.
"Who complained about it?", my boss asked.
"Well, no one", I replied.
My boss' response has stuck with me. It taught me a valuable lesson: "if it's not impacting anyone, is there really a problem?"
My lesson is this: data without context isn't useful. In monitoring, a metric matters only in the context of users. If low free memory is a condition you notice but it's not impacting users, it's not worth firing an alert.
In all my years of operations and system administration, I've not once seen an OS metric directly indicate active user impact. A metric sometimes can be an indirect indicator, but I've never seen it directly indicate an issue.
Which brings me to the next point. With all of these metrics and logs from the infrastructure, why is your monitoring not better off? The reason is because Ops can solve only half the problem. While monitoring nginx workers, Tomcat garbage collection or Redis key evictions are all important metrics for understanding infrastructure performance, none of them help you understand the software your business runs. The biggest value of monitoring comes from instrumenting the applications on which your users rely. (Unless, of course, your business provides infrastructure as a service—then, by all means, carry on.)
Nowhere is this more clear than in a SaaS company, so let's consider that as an example.
Let's say you have an application that is a standard three-tier web app: nginx on the front end, Rails application servers and PostgreSQL on the back end. Every action on the site hits the PostgreSQL database.
You have all the standard data: access and error logs, nginx metrics, Rails logs, Postgres metrics. All of that is great.
You know what's even better? Knowing how long it takes for a user to log in. Or how many logins occur per minute. Or even better: how many login failures occur per minute.
The reason this information is so valuable is that it tells you about the user experience directly. If login failures rose during the past five minutes, you know you have a problem on your hands.
But, you can't see this sort of information from the infrastructure perspective alone. If I were to pay attention only to the nginx/Rails/Postgres performance, I would miss this incident entirely. I would miss something like a recent code deployment that changed some login-related code, which caused logins to fail.
To solve this, become closer friends with your engineering team. Help them identify useful instrumentation points in the code and implement more metrics and logging. I'm a big fan of the statsd protocol for this sort of thing; most every monitoring vendor supports it (or their own implementation of it).
If you're the only one who cares about monitoring, system performance and useful metrics will never meaningfully improve. You can't do this alone. You can't even do this if only your team cares. I can't begin to count how many times I've seen Ops teams put in the effort to make improvements, only to realize no one outside the team paid attention or thought it mattered.
Improving monitoring requires company-wide buy-in. Everyone from the receptionist to the CEO has to believe in the value of what you're doing. Everyone in the company knows the business needs to make a profit. Similarly, it requires a company-wide understanding that improving monitoring improves the bottom line and protects the company's profit.
Ask yourself: why do you care about monitoring?
Is it because it helps you catch and resolve incidents faster? Why is that important to you?
Why should that be important to your manager? To your manager's manager? Why should the CEO care?
You need to answer those questions. When you do so, you can start making compelling business arguments for the investments required (including in the best new tools).
Need a starting point? Here are a few ideas why the business might care about improving monitoring:
I recommend having a candid conversation with your team on why they care about monitoring. Be sure to involve management as well. Once you've had those conversations, repeat them again with your engineering team. And your product management team. And marketing. And sales. And customer support.
Monitoring impacts the entire company, and often in different ways. By the time you find yourself in a conversation with executives to request an investment in monitoring, you will be able to speak their language. Go forth and fix your monitoring. I hope you found at least a few ideas to improve your monitoring. Becoming world-class in this is a long, hard, expensive road, but the good news is that you don't really need to be among the best to see massive benefits. A few straightforward changes, added over time, can radically improve your company's monitoring.
To recap:
Good luck, and happy monitoring.
Mike Julian is the Editor of the Monitoring Weekly newsletter, author of O'Reilly's Practical Monitoring, and an independent monitoring consultant at AsterLabs.io. Before embarking off as a consultant, he worked as an Ops Engineer for Taos Consulting, Peak Hosting and Oak Ridge National Laboratory, and others. You can follow him on Twitter at @mike_julian.