Dawn: Why do we monitor what we monitor? Part 2 of 3 - Apache and MySQL
21 Jul 2010
This article was written from an interview with Sam Minnée
(Continued from part one.)
The Webserver Level
Apache monitoring in Dawn is done using an Apache plugin called mod_status. (Administrators of Apache will probably be familiar with it.) It gives information about the current Apache processes. We take two metrics from that - the number of working processes, and the number of idle processes.
As you might expect, if you've got too many working processes that might be a problem, but similarly, insufficient idle processes is also a problem.
Apache will basically grow the number of idle processes so that there's a certain number free all the time. Typically, an Apache will spin up and try to keep about 10 idle processes running and ready to serve requests.
If Apache can't spawn the idle process, obviously, the number of idle processes will start going down. So if you're down to, say, three or so, it's a fairly clear indication something is amiss.
If you have zero idle processes, it means your Apache server has no capability to serve another request. So we have a low value warning threshold for the idle processes and a high value warning threshold for the working processes. And this can be configured on a site-by-site basis.
For the working Apache processes, that tends to be a lot more host specific. It really depends on the amount of memory, CPUs, etc. The rough guide we use currently is that we start warning if you've got more than eight processes per core, and consider it critical at 16 processes per core.
In practice, having too many working processes and too few idle processes typically go together; but we monitor both for specifics. If there's a reason other than load that causes too few idle processes to spawn - say an Apache misconfiguration - you'll have a low idle process count even while having a low working process count.
On the other hand, large numbers of working processes impact the amount of memory you use and the amount of CPU utilisation. It is a good measure of overall load. When you see overloading, it will almost always include a large number of working Apache processes.
Because Dawn monitors multiple levels at once, you can see the combination of factors that lead to a quicker diagnosis. For example, if a disk is failing, and it is taking a long time to write to it, you'll have alot of Apache processes waiting for a long time and taking a long time to run - which means that there are many Apache processes starting. In this case, you can rule out CPU issues by looking at the CPU utilisation metric.
Similarly, if you're connecting to a third party service on every page request and it's taking a few seconds to respond, all of a sudden the number of Apache processes you have running is going to go up, even though the amount of CPU usage is fine. Each of these Apache processes, by the way, is going to use an amount of memory, and you might start to hit memory limits, which means you start swapping to disk, which will compound the problem. If you can look and say: "There are a lot of Apache processes running" before it eats up all of your memory, it gives you the chance to address the issue more quickly than you otherwise might.
Another way to look at the Apache processes is to look at the average. If the average is too high, it could indicate a systemic problem with your code. For example, this can happen if you call a third party service that takes too long to respond on every page request.
Then you've got the conditions where you have excess load. Then you're looking not so much at "is this higher than it should be" but rather "are we running out of server resources. Is our server about to break down under the load, and if so, what should we do about that?
The Database Level
Currently, at the database level, Dawn collects data from MySQL, and also looks at the number of MySQL processes.
A time-consuming Apache request usually, but not always, makes a connection to the database, so there's going to be a fairly close relationship between the number of Apache processes and the number of database processes. In unusual situations, however, they may differ, and it's exactly those unusual situations that we're looking to address.
For instance, if you had a high number of Apache working processes but a low number of database processes, that would indicate that Apache is spending most of it's time serving static requests. Which are unlikely to cause nearly as much load as dynamic requests - so if that's causing problems, it would be unusual and therefore warranting investigation.
If you have many more database requests than Apache workers, that would mean there would be other systems using the database, likely backup systems or offline processes. If this is increasing load, you may have to ask yourself, if these being run at the right time, or if there's a way we can run the processes when there is less load?
Either way, having all the data in one place builds a richer picture, and that's the common thread between all these things. Though there's a certain amount of redundancy in the different metrics we collect, by collecting them all and comparing them we have a richer picture of what our webserver is doing.
(Concluded in Part 3.)