Determining the Cost Threshold for Parallelism

In the previous post, I showed how you can get full listings of your execution plan costs. Knowing what the values you’re dealing with for the estimated costs on your execution plans can help you determine what the Cost Threshold on your system should be. However, we don’t want to just take the average and use that. You need to understand the data you’re looking at. Let’s explore this just a little using R.

Mean, Median, Range and Standard Deviation

I’ve used the queries in the previous blog post to generate a full listing of costs for my plans. With that, I can start to query the information. Here’s how I could use R to begin to explore the data:

library("RODBC", lib.loc="~/R/win-library/3.2")

query <- "SELECT * FROM dbo.QueryCost;"

dbhandle <-
  odbcDriverConnect(
    'driver={SQL Server};server=WIN-3SRG45GBF97\\dojo;database=AdventureWorks2014;trusted_connection=true'
  )
data <- sqlQuery(dbhandle,query)

##data

mean(data$EstimatedCost)

median(sort(data$EstimatedCost))

maxcost = max(data$EstimatedCost)
mincost = min(data$EstimatedCost)
costrange = maxcost - mincost
costrange

sd(data$EstimatedCost)

The mean function is going to get me my average value, which, in this case, is 0.8755985. If I just accept the average as a starting point for determining my Cost Threshold for Parallelism, I guess I can just leave it at the default value of 5 and feel quite comfortable. This is further supported by the median value of .0544886 from my data. However, let’s check out the costrange value. Knowing an average, a mean, or even a median (literally, the middle number of the set), doesn’t give you an indication of just how distributed the data is. My costrange, the max minus the min, comes out to 165.567. In other words, there is a pretty wide variation on costs and suddenly, I’m less convinced that I know what my Cost Threshold should be.

The next value that matters is the Standard Deviation. This gives you an idea of how distributed your data is. I’m not going to get into explaining the math behind it. My standard deviation value is 8.301819. With this, I know that a pretty healthy chunk of all my values are less than a cost estimated value of 8, since a single standard deviation would be 8.301819 on top of my average value of .8755985.

With this knowledge, I can start to make informed choices. I’m not relying simply on an average. I can begin to think through the process using statistics.

Just to help out with the thought process, let’s plot the values too.

Histogram

My first thought for any kind of data is statistics, so let’s see what a histogram would look like. This is really easy to do using R:

hist(data$EstimatedCost)

The output looks like this:

Clearly, this doesn’t give me enough to work on. Most of my data, nearly 1500 distinct values, is at one end of the distribution, and all the rest is elsewhere. I can’t use this to judge any kind of decision around my Cost Threshold.

Scatter Plot

The histogram isn’t telling me enough, so let’s try throwing the data into a Scatter plot. Again, this is silly easy in R:

plot(data$EstimatedCost)

The output is a lot more useful:

Now I can begin to visually see what the standard deviation value was telling me. The vast majority of my costs are well below two standard deviations, or approximately 16. However, let’s clean up the data just a little bit and make this as clear as we can.

Density Plus Values

Instead of just plotting the values, let’s get the density, or more precisely, a kernel density estimation, basically a smooth graph of the distribution of the data, and plot that:

plot(density(data$EstimatedCost))
rug(data$EstimatedCost,col='red')

I went ahead and added the values down below so that you can see how the distribution goes as well as showing the smooth curve:

That one pretty much tells the tale. The vast majority of the values are clumped up at one end, along with a scattering of cost estimates above the value of 5, but not by huge margins.

Conclusion

With the Standard Deviation in hand, and a quick rule of thumb that says 68% of all values are going to be within two standard deviations of the data set, I can determine that a value of 16 on my Cost Threshold for Parallelism is going to cover most cases, and will ensure that only a small percentage of queries go parallel on my system, but that those which do go parallel are actually costly queries, not some that just fall outside the default value of 5.

I’ve made a couple of assumptions that are not completely held up by the data. Using the two, or even three, standard deviations to cover just enough of the data isn’t actually supported in this case because I don’t have a normal distribution of data. In fact, the distribution here is quite heavily skewed to one end of the chart. There’s also no data on the frequency of these calls. You may want to add that into your plans for setting your Cost Threshold.

However, using the math that I can quickly take advantage of, and the ability to plot out the data, I can, with a much higher degree of confidence, make choices on how I want my Cost Threshold for Parallelism to be set on my servers. From there, I measure the affects of my choices and adjust as necessary.

19 thoughts on “Determining the Cost Threshold for Parallelism

  • Randall Petty

    Nice. Wonder if this can be done in powershell? I’m a little reluctant to add yet another programming language. Might be like those instrument flight rule ( IFR ) pilots who only put in the minimum time per year flying on instruments — most likely unsafe when they really need to use the skill.

  • Joerg

    OK, fine. But I don’t know why I should raise my Cost Threshold at all. Nearly all of my statements have higher costs when I run them with the query hint “Option(maxdop 1)”. Beside that they are slower with that hint as well. I assume that setting this hint has the same effect on my query as setting the Cost Thresold up.

    • Yeah, they probably will have higher costs because setting that hint will completely change the plan that gets generated. You can’t compare the two plans costs at that point.

      The primary difference is, some plans will benefit from parallelism. By setting the MaxDOP to 1 (hints or through server settings, and I’d suggest server settings if you’re really going to do that, easier to undo later), you’re not adjusting the cost threshold to allow some queries to run parallel and others not to based on their estimated costs, you’re just forcing serial plans on everything. The reason you would raise your Cost Threshold is precisely where you find yourself, the plans are performing poorly with parallel execution. Raise the Cost Threshold so that those plans get serial plans naturally instead of through hints. Higher cost plans, the ones most likely to benefit, can then still get a parallel plan without you having to muddy your code with hints. You adjust one setting, in one place, that can be easily change later, rather than having to go to X number of queries and add a hint. It’s a much more efficient mechanism for controlling your servers.

  • manuel simone

    Hello Grant, very good article.
    I have a problem with a instalation. The standard desviation is very high (4432.966) because there is a plan with 2385560 of estimated cost (!!!!!!). The curiosity is that its CPUCost = 18.2881 and Estimate Rows = 182881000
    Is normal very high values in estimated cost?
    The instance is a SQL SERVER 2008 R2 SP3.
    Thank you very much.

  • Right, so in your case the distribution isn’t in any way normal. You’d be best served by tossing that value out and see how the rest behave. From there you can make a determination on a good choice for cost threshold. Regardless, you sure would want that query to go parallel.

    And no, that’s an extremely high estimated cost. It sounds like you’re moving lots of data with either no filtering at all or your filters aren’t using the indexes you have (if you have them). Nothing but guesses as to the cause.

  • manuel simone

    Hello Grant, thanks for your answer.
    A question.
    Is it possible that a plan with Parallel=0 when executed, the plan used more than one CPU?
    The number the plans with parallel=1 is much more litlle than plans with parallel=0, 138 for parallel=1 and 296000 with parallel=0
    It is the result to collect plans executed in 12 hours of work along 5 months. This information is collected out of hours’s work. and it sent to my computer.
    The CXPACKET is very high and the “cost threshold for parallelism” makes that only plans with estimated cost > “cost threshold for parallelism” have parallel=1
    Thanks for all.

    • Hello again Manuel,

      No. If a plan didn’t go parallel, it’s just using a single thread. Parallel plans use multiple threads. You are correct, the cost threshold for parallelism ensures that only plans with an estimated cost greater than that value will go parallel.

      Don’t worry about cxpackets in isolation. You have look also at your other wait statistics. cxpackets just means that you have parallelism occurring. It doesn’t mean it’s a problem. You have to see if you also have issues with CPU waits that would indicate excessive CPU use. Here’s more info: https://www.sqlskills.com/help/waits/cxpacket/

      • manuel simone

        Hello Grant, a last question, please.
        Is normal that more of 9000 query plans with estimated cost > “cost threshold for parallelism” have parallel = 0 ?
        The 66% of those query plans are prepared or adhoc (it not was a decision mine…)
        Thank you.

        • It’s hard to define what “normal” looks like. Normal for your system is abnormal for mine. However, the question is can something exceed the cost threshold and still not go parallel. The answer is yes.

          When a query exceeds the cost threshold it doesn’t automatically go parallel. Instead, the optimizer considers parallel plans. Will a parallel plan result in a lower cost than the serial plan? If so, yeah, you’ll get parallel execution. If not, you won’t see parallel execution even though the cost threshold was exceeded.

          Now, I’m not looking at your system, so I can’t say what “normal” looks like there. Based on the information I have, your plans are exceeding the threshold but going parallel won’t make them run faster in the optimizer’s opinion, so you still have a single threaded execution plan.

          NOTE: If you have a lot of ad hoc queries, you could be seeing stuff that’s all over the map for queries that are essentially very similar or even largely the same (there have to some differences or you wouldn’t have more than one plan). This is where you also have to concern yourself with how many times each of those queries is being executed so that you have some understanding of what is causing load on your system.

  • SQLDBA

    HI all,
    A standard deviation function should not be used to calculate ‘Cost Threshold for Parallelism’ setting value, because it will give imprecise result. Instead, you need to use a weighted standard deviation function and use execution count value as a weight. Here is a standard deviation explanation video: https://www.youtube.com/watch?v=duKaOx4_TmU

Please let me know what you think about this article or any questions:

This site uses Akismet to reduce spam. Learn how your comment data is processed.