This article was written from an interview with Sam Minnée

Overview

One of the big things we have done with Dawn is to make it flexible. We don't expect that version 1.0 will be exactly right for everyone who wants to use it, and we've built a flexible framework that we can grow as our knowledge of what people need improves.

The current Dawn product is based on our experiences and the experiences of our partners and clients. We chose what we monitor based on how we all have resolved issues over the past few years building, deploying, and maintaining SilverStripe sites. What have we used on an ad-hoc basis to try and resolve issues? Where have we gone, what have we looked at, where have we discovered issues? We added the best indicators to the Dawn monitoring system so that it would be very easy to look straight at that information all in one place.

Over time, other Dawn customers' experiences will be combined to shape how our monitoring works, but for right now, simply collecting all the data in the same place helps immensely. Rather than having one number you can look at, and know something is wrong, but not be sure what, we collect all the pieces of data you need, so you can look at all the metrics visually and figure out what's going on by the relationships among them.

The Operating System Level

At the OS level, we monitor load average, CPU usage, and memory. Load average and CPU usage are pretty closely related, although they tend to present errors in different ways. You can have quite a high load average even when the CPU usage might not be very high. That often happens in situations where something other than the CPU is holding up processes. Quite frequently, if a process is waiting for the disk, then the process may be very slow, and processes may queue up even though the CPU usage isn't really high.

Seeing load average and CPU data side by side gives you that information. If your load average is high but your CPU usage isn't, you know it's not a CPU-bound issue. It's probably related to memory or disk.

We also monitor free memory to determine if there are any issues. Running out of memory is a common pitfall, particularly on a webserver under heavy load. Unless it's carefully configured, the server gets traffic beyond the load it can handle and it runs out of memory and then things start going bad.

(To be continued in Part 2.)