Monitoring Metrics – PAGEIOLATCH_XX and associated metrics

For my writeups of monitoring metrics I will add what I see as necessary correlation metrics to have a more complete description of how I would use it.

There are many metrics available to monitor read/write activity and latency on your system. In fact, there are several DMVs/DMFs dedicated to this, for instance the sys.dm_io_virtual_file_stats DMF. However, in the era of cheap commonly available terabyte sized NVMe flash drives the responsiveness of the drives should not be of immediate concern. Moreover, what you want to monitor is how much I/O activity SQL server is waiting on in general, not the latency of the drive itself (unless you have reason to suspect the disk performance, then these targeted metrics would be of great interest).

I believe that in many cases three metrics are sufficient for high level monitoring of the relationship between SQL server and the storage subsystem. These are the PAGEIOLATCH_XX wait types, the WRITELOG wait type and the buffer cache hit ratio. PAGEIOLATCH and WRITELOG can be found in the sys.dm_os_wait_stats while the buffer cache hit ratio can be calculated from values in sys.dm_os_performance_counters or extracted form the performance counter via perfmon. A fourth metric, Page Life Expectancy, can play a useful supporting role as well for correlation.

If these metrics indicate an issue, then it is time to go hunting for the root cause using more fine-grained/targeted metrics  

PAGEIOLATCH_XX

All data pages are read or altered in memory. PAGEIOLATCH_XX is a wait type that means that a task is waiting for the completion of loading pages from disk into the buffer pool where it can be read and/or manipulated. The PAGEIOLATCH_XX waits measure how long SQL server has been waiting for I/O on datafile pages specifically. The PAGEIOLATCH_EX and PAGEIOLATCH_SH are most commonly seen for writes and reads respectively.

Simplified, I can see two reasons why pages would reside on disk instead of in memory: either they are so seldom used that they are rightly not in the buffer pool, or they have been evicted from the buffer pool due to memory pressure (internal or external). This means that the PAGEIOLATCH_XX wait type can give information regarding the state of your storage, indicate if you need more RAM or if you have queries in need of tuning to use less RAM.

So, is your storage experiencing problems or is it a memory issue? Well, it depends on what it correlates with. To get an indication of storage or memory issues PAGEIOLATCH should be correlated with WRITELOG waits, buffer cache hit ratio, page life expectancy and uptime. If the correlation of these metrics indicates an issue, a more thorough investigation using other more fine grained metrics is needed.

WRITELOG

Writelog waits indicate how often and how long SQL server waits for the information to be flushed from the log buffer (also known as the log cache) to the transaction log file. It accumulates over time, as most wait types do, so it must be seen in correlation with uptime (as most wait types must). It is very sensitive to the responsiveness of the storage system.

Buffer Cache Hit Ratio (BCHR)

The buffer cache hit ratio is a Buffer Manager Performance Object in SQL server. This means that it is accessible thought Performance Monitor outside SQL server itself. To access it from inside SQL server you need to calculate it from sys.dm_os_performance_counters. To get a ratio we would need to divide the total number of cache hits (buffer cache hit ratio) by the total number of cache lookups (Buffer cache hit ratio base).

The buffer cache hit ratio should be as close to 1 (that is 100%) as possible. This is a direct measure of how often SQL server finds the page it is looking for in the buffer pool.

Uptime

As always, uptime is a crucial correlation metric. The reason being that monitoring metrics are reset at each SQLserver restart, so you need to know how relevant the data is. In this instance, when monitoring PAGEIOLATCHES and associated metrics there will be a warm up period for the buffer cache after a restart as the buffer pool is empty, which will affect the metrics quite a bit right after startup.

PLE

Since PLE is an indicator of memory pressure and the state of the buffer pool, it serves as a good correlation metric. I have written more about PLE specifically here.