Introducing ‘Design Patterns for Monitoring’ - Monitoring Email
By Vladimir Vuksan ~ December 4th, 2009. Filed under: Design Patterns for Monitoring, Monitoring Email.
In a recent MonitoringForge advisory board e-mail thread Tara Spalding initiated
a discussion on ways to spark content contributions to MonitoringForge, specifically Wiki documentation. She was at LISA ‘09 and discussed with David Nalley of Fedora project what would be most valuable initial contributions. To cut the long story short the opinion coalesced over the idea to organize the Wiki in form of monitoring design recipes for individual services e.g. email monitoring design, web monitoring design etc. These design documents would cover both types of monitoring ie.
- Service availability/validation ie. e-mail service up and down, etc.
- Performance monitoring/trending
Within those individual areas we would start with simple/easy approaches and work our way up ie. most people will want to know how to monitor whether their e-mail server is up (usually easy to do) however may not need or want to monitor SMTP authentication. David Nalley did a great job of outlining a sample mail monitoring design recipe as follows:
*Mail
**SMTP
***Availability Monitoring
****Port 25 open
****Port 25 providing expected response
****MTA accepting mail
***Performance Monitoring
****Messages Received/time unit
****Messages Sent/time unit
****4xx errors /time unit
****5xx errors/time unit
****Length of time from successful send till message received by MDA
**IMAP
***Availability Monitoring
****blah blah
***Performance Monitoring
**** blah blah
Where things get tricky is that there are a number of monitoring tools so the question is how do we deal with describing individual implementations. Do we first describe concept in abstract then address individual monitoring tools implementations or is there are better way ?
We would like to get consensus whether the approach outlined is a good approach and are seeking input on how to make this process more effective and more useful.
|
|
|
|
|
|

December 8th, 2009 at 10:12 pm
Under SMTP you have “Length of time from successful send till message received by MDA” - what about expanding that a bit?
How about domains with deferred delivery? Bounce rates? I’m more interested in a status table of who is accepting our email and from which of my outbound mailers. For businesses that send a lot of email (and no, I’m not talking about spammers
this can be a big issue. How difficult would this be to implement?
We haven’t even touched on the subject of content-based filtering. And ultimately, if we monitor everything under the sun, we reach the law of diminishing returns, where adding more metrics to be monitored not only slows down the monitoring tool, but also the MTA itself. I would assume that everyone has a different threshold for what is “good enough” when it comes to what metrics you’re monitoring for email, and I’d like to get into a discussion of that.
December 10th, 2009 at 3:07 am
There are numerous things that can be monitored
depending on your needs. You may be monitoring tons of things which don’t provide you with any actionable data on day-to-day basis however may be crucial in root-cause analysis. I guess you can say they are vital stats and not so vital stats.
As far as law of diminishing returns you should be careful what kind of tests you employ. If the tests are stressing the service excessively they should be either thrown out or the collection period should be changed ie. poll every 5 minutes instead of every minute. Also your threshold of metrics may change over time as you learn more and more about the service you are monitoring.