The first thing administrators want to monitor is their resources: file servers, database servers, app servers, etc. That’s understandable, since they’re system administrators. In order to be really successful with monitoring, though, we need to think like customer administrators – for both our internal and external customers. Forget the hardware, and think about the people. How do our customers interact with our systems?
Let’s say we’re administering an ecommerce site. As a Customer Administrator, the basic customers are:
- Outside money-waving customers who place orders
- The shipping department who sends out the orders
- The inventory people who replenish our stock
How do each of these groups interact with our system? The outside customers are the easiest to diagnose: they go to our site, browse our items, and hopefully place orders. We want to set up monitoring on each of those activities: using our server monitoring software, we need to make sure that our site works, that we have items in our store, and that people are placing orders.
Network Admin crying out in pain: “That’s not my responsibility! Why should I monitor whether people are placing orders or not?”
Well, network admins should care about the number of orders placed because it’s a critical customer statistic that will get the IT department into a whole lot of trouble if something goes wrong. You can monitor everything in the shop, but if somebody makes a programming error and the money stops coming in, everybody needs to know as soon as possible – and that’s where server uptime monitoring can step in and make you a hero.
Even better, when monitoring is done right, the IT department can catch when people place an unusually high number of orders – like if a pricing error discounts 19″ flat panel monitors to $5.99 instead of $599. Is it the network department’s job? Nope. But if you can catch it before anybody else, and if you can get the right information to the right people to fix the problem, then it’s easy to justify the IT department as a vital part of doing business online.
So our order monitoring needs to have at least three parts:
- Alert when people haven’t placed any orders in 1 hour
- Alert when we’ve gotten more than X orders this hour (may be 100 or 1,000 or 10,000 depending on the size of the organization)
- Alert when one item’s sales have been more this hour than the last 72 hours combined (or appropriate timing rules)
Start with a small set of monitoring triggers, like 3, and then hone your alerts as you gain experience with your statistics. Get as curious about order statistics as you would about database server memory use: figure out new ways to analyze it. For example, you may want to alert when any one customer orders more than 3 of an item priced over $200. Even if it doesn’t indicate a problem, you can still suddenly find yourself in-the-know about your company’s business pulse. Imagine emailing one of your sales crew and saying, “Hey, a guy in Mississippi just ordered fifty flat panels. Maybe he’s setting up a temporary office for people displaced from the hurricane. You wanna contact him and see what else he’s shopping for?”
The more you know about the way your customers interact with your IT systems, the more you can help other departments – and the other departments will remember you as a valuable, in-the-know guy.
Network Admin with a frown: “Yeah, but that’s an easy example. That just relates to people with online stores. We don’t sell stuff online.”
And neither do I, but I can’t give you my exact examples here. I’d have to kill you. It really does work with any company.
Let’s say you run a fantasy sports site. Your customers:
- Start new teams
- Make trades
- Play games against each other
Then set up an alert to tell you when no new teams have been started in X hours. Granted, you’re going to want to put this check on maintenance in the offseason, but we can write that into our query.
Set up another alert when no trades have been made in X hours. This would indicate a problem with your trading system.
Set up another alert when any games have a combined score of zero. Depending on the sport, we’d need timing rules in here – for example, in football, we would only want to alert on Sundays after 5pm. Zero scores would indicate a problem with your scoring system.
This kind of analysis helps alert us when the system is running, but it’s having specific problems. Likewise, when monitoring a mail server, don’t just monitor to see that it’s up, that it’s accepting connections on the SMTP port, and that it’s accepting connections on the POP3 port. Instead, look at how your customers interact with the mail server: they send and receive emails. Therefore, monitor the number of incoming and outgoing emails per hour, and send alerts when either sinks to zero. (Of course, we’ll need business logic on the outgoing emails – we may not want to alert on that outside of business hours.)
Network Admin starting to see the light: “Yeah, but my users will still call me first. I’ve got these really intense guys who live by their email, and they call me the instant anything’s wrong.”
Users won’t notice when they get no incoming emails from outside the building for an hour. Some days are just quiet. It’s the Customer Administrator’s job to make sure that the interactions are going properly, and only the company-wide Customer Administrator will know that NOBODY got incoming emails from outside the building for an hour, therefore indicating a system-wide problem.
Monitoring is never done. People who run uptime monitoring systems have to continually refine their alerting methods to strike a balance: we don’t want false alarms when something isn’t really down, and we don’t want the system ignoring an actual outage. Let’s say we set up our alerting so that it’s a little paranoid, and it gives us just 1 false alarm per month. When we’ve got 30 alerts set up, we’re going to get a false alarm every day. That gets a little annoying, and worse, false alarms make the IT staff think that every incoming alert is a false alarm. We can’t have that, and we’ll talk about that more in later blogs.
Back to our e-commerce site example: we still have our inventory and shipping departments. Remember, Customer Administrators need to handle all of their customers – internal customers are just as important as external customers. The shipping department is easy to analyze: they interact with our system by shipping out every order that gets placed. Our alerting needs to detect when unusual things are happening in the shipping department, so we want to:
- Alert when we’re backlogged by more than 1 day (with some business logic for weekends)
- Alert during business hours when no packages have been shipped in X hours, but some are waiting
That second alert may seem like overkill: after all, the shipping department will probably alert IT directly if they have a problem with the shipping systems and they can’t get packages out. Probably – but maybe not, and that’s what uptime monitoring is all about. After all, most of your alerts should almost always be up.
That’s the point of uptime monitoring: you don’t want to catch the 999 times out of 1,000 when the system is working. You want to catch the 1 time in 1,000 when the system doesn’t work. These systems may be IT, or they may be operational, like customers ordering huge quantities of mispriced items. The more you get to know your customers, the more effective your monitoring can be, and the more valuable the IT department is to the organization as a whole.
Network Admin: “Ah, so I can get a raise?”
Now you’re thinkin’.