This is the third in a series of posts on the “big picture” of automated DBA monitoring. Part 1 started by describing general considerations for monitoring. Part 2 added alerting.
With some of the broader ideas out of the way, if there’s no monitoring in place and I was the “accidental”/”default” DBA, there are some metrics I’d definitely want to monitor. However, there’s a ton of real-time metrics I haven’t included below like CPU & memory use, number of current connections, disk I/O as at right now. They’re probably not good candidates for “roll-your-own”-type monitoring.
I consider “alerts” in the table below to include an e-mail, SMS or dashboard item that aims to trigger a response to fix. The collection frequency especially is just a guide - closer to real-time might be desirable, but the trade-off is impact on servers and connected apps.
Metric | Description | Alert? | Collection frequency | Review daily, weekly, monthly |
---|---|---|---|---|
Data & log file sizes | Size of database data & log files | Daily | When needed e.g. capacity planning | |
Server disk free space | Free space on drives - although could be covered by other monitoring tools, may be beneficial to monitor for visibility to DBA | Less than 5GB free | Daily | |
Backups | Size of backups, duration of backups (useful if standardising backus across fleet e.g. Ola Hallengren) | Failed backups, backups created in business hours | Daily | |
New databases created | New databases created, need to be added to inventory (non-production new databases can be helpful indicator of what’s coming in production) | Daily | Weekly | |
Number of databases per server | From inventory, helpful for capacity planning | Daily | Monthly or less often | |
Server properties | For example: CPU, memory, authentication mode setting, login auditing setting, SP or CU level, Windows Firewall etc. | Weekly, maybe just monitor for changes compared to last week | ||
SQL Agent job history | Number of runs, duration for each run, runs that exceed average duration | Weekly | Review exceptions weekly | |
Errors in SQL Server error log | SQL Server error log is fairly verbose, with everything from status messages to critical errors. Over time I’ll develop filters to screen out unimportant messages - I suggest de-duplicating same messages in the same minute, and limiting to 10,000 (or some other sensible number) of messages per collection | Weekly | Review errors weekly | |
Windows error log | Critical and error messages, helpful to know what’s happening on the server outside SQL Server | Weekly | Review errors weekly | |
SQL Server Windows services | Status (running, stopped), autostart setting, whether enabled; useful for checking service restart successfully after overnight patching | Service stopped | Daily | |
SQL Server service restarts | This one’s pretty self-explanatory :-) | Notification when SQL Server service restarts | ||
Index fragmentation | Might be best just to schedule Ola Hallengren’s IndexOptimize weekly | |||
Unused indexes, missing indexes | We’re getting down past the server and database level, to table and view level - I’ll need to know a bit more about the database and how it works to best work with indexes, as far as I know there’s no “one size fits all” approach | |||
Blocking processes | Processes running for longer than x minutes, that are blocking more than a threshold of y other processes | Maybe, depending on the server and database | Could collect and report using sp_WhoIsActive http://whoisactive.com |
Ideally, monitoring is linked to an inventory of servers and databases; that’s a topic for another blog post, though.