Dawn: Why do we monitor what we monitor? Part 3 of 3 - The SilverStripe Layer
4 Aug 2010
This article was written from an interview with Sam Minnée
(Continued from part two.)
The PHP/SilverStripe Level
When we were considering what information Dawn needed to gather from the SilverStripe layer, we decided upon two types. The first are errors and warnings generated by PHP. Unlike metric data, this data gives you information about specific points in time, so you can correlate certain errors with different metrics at different times. In addition to the error message, we also collect a full backtrace of the error so you can see where the error has come from.
The other piece of information we collect, which is SilverStripe specific, is the page generation time. This is the length of time that PHP spends executing.
When you request a page, there's a number of things that will happen. The client will talk to the webserver, and take a certain amount of time to get across the network. Apache will then take the request and figure out what to do with it, (and that will be relatively quick), determine it needs PHP, tells ModPHP to process the script, and then the SilverStripe bootstrap (our main PHP script), will kick off. From that point, that's when we start timing.
And then the relevant Sapphire code will be loaded, your page will process, the database will be connected, it will render templates, spit the page out to the user, and then it will get to the end of the PHP execution time. That's when we stop timing. (Specifically, we use a Register Shutdown function to trigger the end of the PHP execution, and then we write to the logfile at that time.
We collect this data in milliseconds, so this is an indication of clock time, rather than CPU time. If there are lots of processes running concurrently, the execution time may go up.
We collect execution time for different URLs but group URLs by "family." If you have a forum installed for example, there are likely many different URLs that all are considered part of the forum. Because it's the same piece of code and likely to have the same performance characteristics, we collect all of those requests under the same URL family.
Minimising the number of URL families we collect helps us with generating the roll-up data, which is averaged every five minutes, and averaged 20 minutes, average hour, average day data, etc. We also ignore all requests to the same URL with different query perimeters - that's all considered to be the same URL family. If we monitored for every URL, we'd be looking at hundreds of thousands of URLs instead of tens of thousands of URLs, for example. And that would start putting undue load on the Dawn services.
Internally at SilverStripe, we're using Dawn to help us monitor various sites we've done. But we're really keen to hear what our customers think, how Dawn can be improved, ways that our customers see Dawn could be used. We want to hear what you think.