Monitoring Metrics – Page Life Expectancy (PLE) and associated metrics

For my writeups of monitoring metrics I will add what I see as necessary correlation metrics to have a more complete description of how I would use it.

Page Life Expectancy (PLE)

A few years back, the metric many people would mention when asked what counters they monitor was PLE. In recent years it has been thrown into disfavor by many. The reason being that when you see the PLE drop, it is already too late to find/catch the offending query/action as by nature it is a lagging indicator as the pages are already evicted from the buffer pool when it drops. However, I still believe it is worth monitoring, and I will explain why.

There are two reasons for PLE to drop during normal operations as far as I know (not counting maintenance, manual flushes, restarts etc):

  • One or more queries are doing large operations on a large number of pages and required a large chunk of the buffer pool to put these pages in, hence many of the current pages in the buffer pool are evicted.
  • A query was given a large memory grant, forcing sqlserver to “steal” memory from the buffer pool to include in the memory grant resulting in evicted pages. This will get especially bad if the memory grant ends up in a frequently used cached plan.

Microsoft had a recommendation in the old days that the PLE should be at least 300 (that is 300 seconds). This is fairly old advice and Paul Randall updates this advice here and here with the help of a post by Jonathan Kehayias. Here he explains that it is essential to look at the PLE for the NUMA nodes individually and not at the harmonic mean, which is what standard PLE counter represents.

The real benefit of the PLE as I see it is to track it over time. If the PLE suddenly and unexpectedly decreases to a fairly constant lower level than previously, it can indicate an undesired or unexpected change in the usage of the application that should be investigated. You would probably use other tools and metrics for the investigation itself.

Personally, I would monitor PLE, both from the buffer manager and from the buffer nodes. They are both available through the sys.dm_os_performance_counters dmv. I would use them as one of the “big picture” trackers of my applications behavior over time. It will indicate changes in memory pressure and should be seen together with Lazy Writer activity and instance uptime. This is especially useful when implementing changes, for instance when enabling new features, changing workflows or increasing the number of users and of course when upgrading to a newer version or even patching the OS or SQL server itself. A persistent decrease in PLE might just indicate the offending change warrants more RAM allocated to your instance, or that a specific feature or upgrade has a negative impact and that needs further investigation.

Lazy Writer Activity

The Lazy Writer is responsible for “clearing memory” when sql server detects memory pressure or whenever the pages in the buffer pool must be evicted for other reasons. It evicts pages from memory based on last page activity, so the most recently used pages are kept in memory the longest. Increased activity of Lazy Writer will correlate with a decreased PLE. Lazy Writer activity can be tracked through the sys.dm_exec_requests dmv (look for ‘LAZY WRITER’ in the command column)

Instance Uptime

We need this for the simple reason that many values must be correlated towards uptime to have any meaning. PLE is a good example of this.