Automated DBA monitoring for SQL Server: a primer, part 3

  sqlserver

This is the third in a series of posts on the “big picture” of automated DBA monitoring. Part 1 started by describing general considerations for monitoring. Part 2 added alerting.

With some of the broader ideas out of the way, if there’s no monitoring in place and I was the “accidental”/”default” DBA, there are some metrics I’d definitely want to monitor. However, there’s a ton of real-time metrics I haven’t included below like CPU & memory use, number of current connections, disk I/O as at right now. They’re probably not good candidates for “roll-your-own”-type monitoring.

I consider “alerts” in the table below to include an e-mail, SMS or dashboard item that aims to trigger a response to fix. The collection frequency especially is just a guide - closer to real-time might be desirable, but the trade-off is impact on servers and connected apps.

Metric Description Alert? Collection frequency Review daily, weekly, monthly
Data & log file sizes Size of database data & log files   Daily When needed e.g. capacity planning
Server disk free space Free space on drives - although could be covered by other monitoring tools, may be beneficial to monitor for visibility to DBA Less than 5GB free Daily  
Backups Size of backups, duration of backups (useful if standardising backus across fleet e.g. Ola Hallengren) Failed backups, backups created in business hours Daily  
New databases created New databases created, need to be added to inventory (non-production new databases can be helpful indicator of what’s coming in production)   Daily Weekly
Number of databases per server From inventory, helpful for capacity planning   Daily Monthly or less often
Server properties For example: CPU, memory, authentication mode setting, login auditing setting, SP or CU level, Windows Firewall etc.   Weekly, maybe just monitor for changes compared to last week  
SQL Agent job history Number of runs, duration for each run, runs that exceed average duration   Weekly Review exceptions weekly
Errors in SQL Server error log SQL Server error log is fairly verbose, with everything from status messages to critical errors. Over time I’ll develop filters to screen out unimportant messages - I suggest de-duplicating same messages in the same minute, and limiting to 10,000 (or some other sensible number) of messages per collection   Weekly Review errors weekly
Windows error log Critical and error messages, helpful to know what’s happening on the server outside SQL Server   Weekly Review errors weekly
SQL Server Windows services Status (running, stopped), autostart setting, whether enabled; useful for checking service restart successfully after overnight patching Service stopped Daily  
SQL Server service restarts This one’s pretty self-explanatory :-) Notification when SQL Server service restarts    
Index fragmentation Might be best just to schedule Ola Hallengren’s IndexOptimize weekly      
Unused indexes, missing indexes We’re getting down past the server and database level, to table and view level - I’ll need to know a bit more about the database and how it works to best work with indexes, as far as I know there’s no “one size fits all” approach      
Blocking processes Processes running for longer than x minutes, that are blocking more than a threshold of y other processes Maybe, depending on the server and database Could collect and report using sp_WhoIsActive http://whoisactive.com  

Ideally, monitoring is linked to an inventory of servers and databases; that’s a topic for another blog post, though.