A question that I’ve seen come up frequently just recently is, how to track CPU use over time. Further, like a disk filling up, people want to know how to predict their CPU usage, so that they can easily decide “now is when I upgrade the hardware”.
Well, the bad news is, that ain’t easy.
CPU Use Over Time
There are a bunch of ways to look at processor usage. The simplest, and probably most common, is to use the Performance Monitor counters such as ‘% Processor Time’. Query this, you can get an average of the processor usage at a moment in time.
Ta-da! Fixed it. I thought you said this was hard Grant.
Well, hang on. Are you running on a single processor machine? If so, cool, maybe this may work for you. Are you running on a multi-processor machine? Ah, probably most of you said yes. The average is no longer useful. One processor is sitting at 90% and the other is sitting at 20%, meeting in the middle, we don’t have a problem, right? Well, no, we do. So we have to track this at the individual processor level. Let’s not even talk about how perfmon.exe itself adds CPU overhead. Let’s just capture the individual CPU counters, so that we’re not looking at a broad, inaccurate, average.
Fixed it. See, this is easy.
I suspect more people than not are running on VMs these days. Whether you’re local, hosted, in Azure or AWS, you now can’t trust the Performance Monitor counters in any way. Instead, depending on your hypervisor, you have to look to other counters. For example, Hyper-V uses a % Total Run Time to show how much CPU is in use.
Simply tracking processor use and then attempting to extrapolate that out to say, hey, we’re going to run out of processor overhead, just won’t work for most of these systems, and is especially true in a VM environment.
Oh, and did you notice how I assumed that we’re running on Windows. What if we’re running in Linux? What if we’re running in a container?
Ah, see, measuring the CPU use over time isn’t simply difficult. Understanding where and how the CPU even exists becomes a little nuts.
Well, the key is, a single measure, especially something as amorphous as average CPU use over time, frequently is not enough to tell us if a system is suffering. Instead, we have to look at additional measures to know if we’re in pain. Processor queue length is useful, but not complete. Context switches and batch requests can give us indications of load, but still not enough. Compilations, frequently one of the most costly CPU issues, and recompilations certainly can show us that we’re under CPU load, but still don’t complete the story.
We can add in wait statistics to see if we’re waiting on the CPU. We can also look to additional measures through Extended Events or the Query Store (here’s a fun one to spot the query that uses the most CPU). They can tell us which queries are using up the CPU.
The issue here is really simple. CPU, for most of us, won’t simply fill in a predictable manner the way a disk will. The reason for all this is because CPU use is so dependent on other factors.
Factors Affecting CPU Use
The core of this is simple, what do your queries look like? Tuning queries, getting the right indexes in place, data structures, etc., all affect CPU usage directly. This means, no matter how we measure CPU, any and all code variations are likely to directly impact how it behaves (up or down), possibly in unpredictable ways. This will completely interfere with any mechanism of plotting out a “my CPU is filling up like a hard drive” calculation.
Then we have to factor in compilations and recompilations. As these go up, or down, so does CPU use. How we use and reuse the plans in cache isn’t just a memory problem. It directly impacts CPU as well.
Now, you may find yourself in a situation where you are seeing excessive Processor Queues, in combination with CPU waits and, maybe even high percent processor use. This could be caused by your workload, certainly. Or, you may simply have under-powered processors, or an inadequate number of processors.
Also, speaking of compilation, variability within execution plans can readily blow any kind of simple prediction of CPU use out of the water. Bad parameter sniffing can lead to high CPU use plans some of the time and not others. Watching your CPU intermittently spike or crater without apparent rhyme or reason is certainly going to make that extrapolation difficult.
Finally, and this is the fun one, maybe you’re running other software on the server. The favorite monster for many of us is the inclusion of anti-virus software on our servers. However, it could be almost anything. Heck, I remember when screen savers would smack the CPU around.
With all this variability, straight-line progressions on CPU use are not only difficult to gather, but can easily be wrong. Yes, disk growth can also suffer from odd variability, especially as we implement new or different code. However, for most of us, most of the time, we don’t see the wild variability within disk and disk management that we see on CPU. It’s just the nature of how data growth occurs that makes predictions there easy. It’s also the nature of how CPU use is highly variable that makes predictions there much less useful, and even downright inaccurate.