Jun 01 2015

Parallelism and Columnstore Indexes

Columnstore indexes are fascinating and really cool. Unfortunately, they’re adding an interesting new wrinkle to an old problem.

What’s the Cost Threshold for Parallelism set to on your server? If you just said “The whatsis of whositz?” then the value is 5. The cost threshold is the point at which the estimated cost of an execution plan goes from definitely serial to possibly parallel. This default was set for SQL Server 2000 and hasn’t been changed since. I’ve long argued, loudly, that it’s too low. I’ve suggested changing it to a much higher value. My advice has gone from 35 to 50 and several places in between. You could just look at the median or the mode of costs on your system and use the higher of those values as starting point. Getting a cost higher than 5 in a query is insanely simple. Which means that queries that are absolutely not going to benefit from parallelism go parallel. This causes unnecessary CPU load on your system and slow performance of the queries. So, listen to my advice and run out there and change this value immediately.


Columnstore indexes have two methods of processing. The first is the row mode. That’s actually the same method used by all your standard indexes and tables. You can write a query that does aggregate calculations against a table that has a Columnstore index. You’ll then see a performance benefit. However, check the properties on the Columnstore operator in the execution plan. Does it say “Actual/Estimated Execution Mode = Row?” Then you’re getting row level processing. That can be fine and you can see a performance improvement using row mode. There is another level of processing though.

Batch mode processing is where it takes 1000 rows at a time and processes the aggregations simultaneously across multiple threads. Wait, multiple threads? Yes. You have to have parallelism in order to get batch mode. Joe Sack talks about it here. You can read more about it in the SQL Server documentation. Batch mode processing is faster (depending, etc.). Especially if you’re dealing with larger data sets (more than tens of thousands of rows), you’re going to want to see Batch as the Execution Mode in the Columnstore operators.

There is an issue here though. That’s right, we just bumped our Cost Threshold for Parallelism up to 5000 (if 50 is good, 500 is better, and we’ll add a zero just to be sure). OK, 50. How high does the cost have to be in order to see benefits from parallelism and batch mode processing? Ah, well, there’s the rub. When I was testing this, I had about 100,000 rows in a table. Nothing large. Execution times went from 100ms to 50ms. Yep, getting parallel execution, batch mode processing, of the nonclustered columnstore index cut my execution time in half.

Next question should be, oh, well, what was the estimated cost of that query? 30? 20? 15?

Try 3.7.

Yeah, I got a 50% improvement in performance by changing the Cost Threshold from 50 to 3.

In other tests I had a plan with an estimated cost of .77. Execution time went from 75ms to 57ms and reads went from 115 to 62 because of batch mode processing.

Now, I am not advocating dropping everyone’s Cost Threshold to 3 (or less). What I am absolutely saying is that you have a new wrinkle that you must add to your calculations. Are you using Columnstore indexes anywhere? Then you need to examine those execution plans. What are their costs? If you adjust your Cost Threshold below them, do you see a benefit? If so, how low did you just go? Now, is that negatively impacting your other queries? You could leave your Cost Threshold as is and use a query hint with TRACEFLAG 8649 (documented by the Sensei Paul White) on your queries against Columnstore indexes, but that comes with a list of caveats as long as your arm. The other alternative is to, again, leave the Cost Threshold alone and use Adam Machanic’s method of forcing parallelism just for the specified queries. These options can be unattractive, but you have to deal with getting parallelism for Columnstore indexes somehow.

Welcome to the quandary. Tuning in and around parallelism just got harder because of Columnstore Indexes.

Want to talk query tuning some more? I’m doing a live, all day session at Connections on query tuning using execution plans this September, 2015. We’ll go over examples with columnstore and lots more. Click here now to sign up.


May 27 2015

PASS Board Update: May 2015

I’ve had a busy month on the Board.

You may have seen some tweets going by about our compliance with international laws regarding SPAM. This has been something of an education, so anything I say wrong below, it’s because I’m wrong, but not malevolent (at least in regards to this topic), so please be kind. It appears that the mechanisms we had on our emails for showing where the email came from and how to unsubscribe from it, weren’t completely in compliance… IF… we were sending emails that involved advertising. By we, I mean Chapters, not HQ. Now, that IF, could mean that we could skip out of meeting this requirement when our emails didn’t involve advertisements, but then it’d have to be in compliance when it did and we’d have to adjust our footers depending on the type of email… blah, blah, blah. I decided that we can get in compliance, now. Stay that way. Sure, many, most, of our Chapter emails don’t have to meet these regulations, but, if we set it up so that we do, then we never have to worry. I’ve worked with HQ. We’re in compliance. We’re getting the word out to the Chapters so if they use email other than ours, they too can be in compliance. We’re also making it part of the documentation so that future people will also be in compliance. We have some other work to do in IT to make an adjustment to the unsubscribe process, but that’s going to happen to. In short, this is almost completely fixed. Many thanks to Karen Lopez for all her help through this process. We couldn’t have done it with you.

I’ll bet that’s a lot more sausage making on display than most of you bargained for. This is what being on the Board looks like.

There was also a Board meeting this month. I was traveling so I only got in the last half of the meeting (on the phone in a shared van sitting in traffic at Logan Airport I might add, oh the glamorous life of a Board member). Good information was exchanged, largely setting us up for the in-person Board meeting next month.

Other than that, standard stuff, meeting with the HQ people regularly so that we keep the Chapter side of things running. A few minor decisions to move things forward. Still pushing on the goals and I hope to figure out how to get IT support for what we need to meet a couple of them.

I’ll report back after the in-person meeting next month. Please, please, please, any feedback on me, the Board, PASS, Chapters, my updates and other Board-related blog posts, the whole magilla, I want to hear it.

May 26 2015

Database Engine Tuning Advisor

I would love to see the Database Engine Tuning Advisor (DTA) pulled from the product. Completely. Heck, I feel bad that I included a chapter on it in my query tuning book (all updated for SQL Server 2014 by the way). Let me tell you why we need to pull this tool.

First, I understand its purpose. It’s supposed to be a fast and easy way to get some performance tuning done for people who just don’t have the time or knowledge to go through the full process of gathering metrics, evaluating poor performers, understanding root causes and applying indexes to fix those causes. I also readily acknowledge that it actually is an amazing piece of software. If you don’t agree with that, go read this white paper. With those acknowledgements in place, I still think there are two distinct problems with the approach and a serious fundamental issue with the premise. Let’s talk about the problems of the approach first.

Many, probably even most, of your query performance problems are in your T-SQL. Some of your query performance problems are in your table structure. Some of your query performance problems are in your indexes and statistics, but not a majority. The DTA can suggest indexes, statistics, materialized views, and partitioning (partitioning for performance I might add, not data management, and that opens up a completely different can of worms we can’t address here, I don’t have time). In short, it can’t address your number one problem, your code. Since it can’t address the number one problem, all it can do is work around the edges. You’ll run this thing, think you’ve addressed your issues, but your issues are still there and now you’re just as stuck as before you found the DTA and ran it. It just doesn’t fix the core issue, your code.

The second problem I see with it is that it doesn’t have enough intelligence built into it’s functionality. I say that with some reluctance because it really is a pretty amazing piece of functional code. But it suffers from a couple of shortcomings there. It’s completely dependent on the load provided to it. If that load is off, it’s recommendations are off because it just can’t have the intelligence to recognize that a poorly representative load has been provided. This lack of intelligence is supposed to be offset by the person running the DTA to ensure that they are gathering the right information and that they can interpret and test the resultant recommendations. Which brings us to the fundamental issue with the premise.

This is supposed to be run by people with no internals knowledge. Right? BUT! These people are also supposed to make a judgement based on the recommendations whether or not they should be accepted. Further, they should test all the recommendations prior to applying them to their production server. They also must gather a well structured and meaningful representative load from their production system in order to supply the DTA with good information. Further, the very sophisticated set of tests around the DTA actually makes a fundamental assumption that could be radically off, that the person designing the database has done a good and thorough job of correctly normalizing the structures. You agree with all these assumptions on the part of the DTA? Am I alone in thinking that we have a problem here? If people lack any understanding of the internals they won’t be able to judge those recommendations. If people don’t have the ability to gather and interpret performance metrics they won’t be able to test the suggestions of the DTA or provide it with the necessary test load. In short, the DTA can’t be relied on to solve the problem it’s supposed to solve because of the root cause of that problem, peoples lack of knowledge and understanding.

I won’t even get into finding indexes with names like this:


Instead of evaluating the suggestions made by the DTA and applying just those that make sense and will have a positive impact, people assume that every single suggestion from the tool is Gospel. They apply them all, without thinking, without knowledge, without appreciation of the possibility, sometimes even the likelihood, of serious negative impact.

I recognize that many people are stuck. They have a SQL Server instance that’s causing them pain and they don’t have the knowledge necessary to fix it. Further, I know a few of you have used this tool successfully in some situations. I agree that there ought to be some way to mechanically and automatically tune the server. However, the DTA is not that tool, despite it’s clear and obvious sophistication. Let’s get rid of it.

Want to learn how to tune queries? I’m putting on an all day seminar at Connections in September. Click here right now to register. We won’t use the DTA.

May 19 2015

Book Review: Connected

I heard about Connected from a show on NPR (Yes, I listen to NPR, why do people keep saying that?). It was right after another segment talking about how positivity affects your mood and your ability to think and act in a clear fashion. I’ve long been a believer in the ability of your network to impact you, but I really didn’t think about it beyond that. Hearing about the book Connected changed my thinking, so I ran out (meaning, connected to Amazon) and got a copy.

The premise of the book is pretty simple. You have close friends and acquaintances. Your close friends and acquaintances also have friends and acquaintances, that may or may not over lap with yours. Those people also have another set of friends and acquaintances. And here’s the kicker, that third layer, not your friend, or your friend’s friend, but your friends friends friend can affect your daily mood, the amount of exercise you do, whether or not you smoke, your involvement in crime, all sorts of things. The book sets out to prove it. Along the way you also learn about things like why you probably only have somewhere between 3-8 close friends. Why you probably don’t have more than about 100 people that you communicate with regularly (uh, but what about my 7,000+ Twitter followers?). How these are to a degree biological factors hardwired into you. Most interesting of all is how the ripples just fade away at the third layer, over and over again throughout their studies and their testing.

The book was just filled with highly interesting facts about how your network influences you. Also, how you can influence your network. It also matters the type of network that you have. Are you connected to lots of people that aren’t connected to each other, weak ties, or are you connected to lots of people that are all connected to one another, strong ties. Each of these types of networks influences you differently. Your behavior within a network is probably following one of three paths; cooperator, you’re willing to help others, free rider, you’re letting others do the heavy lifting, enforcer, you’re making sure everyone follows the rules. Your behavior is also likely to shift between those roles depending on who you’re interacting with and when.

In short, a fascinating book. I do have a nit to pick with it though. At the end of it all, I have a great set of information about what a strong network would look like. I get a good sense of why I would want to have a strong network. Nothing about how to really get a strong network other than making sure my friends are connected with my friends and that my friends, and as much as possible their friends and their friends, are all on a positive path. Right. I’m sure that’s easy to work out. Guidance around this network thing would have been nice.

My own takeaway, be positive, act positive, strive, earn my Trident every day, and, at least according to Connected, that should go out into my network like ripples in a pond. Further, I should see those same ripples coming back, reinforcing my own approaches.

I have no idea how to measure this. Ha!

May 18 2015

Come to Us or We’ll Come to You

No, that’s not a threat. It’s an offer to help out.

Redgate Software is very serious about the efforts we’re putting into creating the tools needed to support your Database Lifecycle Management (DLM) processes. DLM is a vital part of supporting both Agile development methods and moving towards an automated DevOps style of systems management that tightly integrates your software development and deployments with your database development and deployments. I’ve said it before and I’ll say it again, you can get really good at performing manual tasks, or you can get really good at automation. You want to get good at automation and we’re here to help. Getting a smooth process from source control, continuous integration, continuous delivery and continuous deployment can be a lot of work, but work with huge benefits. We can make it easier.

How you may ask?

To start with, how about a short consulting engagement where we come to your place of work and perform a detailed assessment of your current database development and deployment processes with the goal of delivering a document explaining how you could move towards a more fully automated DLM methodology? We’re calling that a DLM Health Check. Or, if you’re ready to take the big leap, we have consultants who can come in and help with a long-term engagement to assist you in building out your DLM processes.

Maybe you’re more of a hands-on type who wants to do things for themselves. Not a problem. We have a series of hands-on labs that will teach you what you need to know to set up your own DLM processes. Here’s the schedule of upcoming classes:

  • London, UK (May 20)
  • Philadelphia (June 4)
  • Belfast, Northern Ireland (June 26)
  • London (July 8)
  • San Diego, CA (July 16)
  • Manchester, UK (July 24)
  • Baton Rouge, LA (July 31)

There will be more to come, and, if we get enough requests to deliver a class somewhere, we’ll make that happen too. These classes will teach the skills and methods you need to start the process of implementing your own automated DLM automation.

As I said, come to us or we’ll come to you. Either way, we can help you create automation around your database deployments so you can move fast but still protect your production servers.

May 15 2015

I Am Grateful For…

I’m almost done reading the book Connected (will be by the time this is published). It’s fascinating although, at least so far, little in the way of suggestions for how to take advantage of this information. That however doesn’t change my plans around working on positive thoughts and emotions as a way to hack your own mind and increase productivity. I’m also trying to figure out how to reinforce my network. In short, these are completely self-serving posts.

I am grateful for Thomas LaRock (b|t).

Tom is hard to miss. If you look at the floor all the time, the green shoes might be a giveaway. If you look up a lot, keep looking up because he’s quite tall. Oh yeah, and he’s the President of PASS (for a couple of more weeks) so he’s frequently up on stage. All that’s not why I’m grateful for Tom (though I do think he’s done an outstanding job as PASS President, if you agree you might want to thank him). I’m grateful for Tom because he’s just always been there. When I first hooked into the SQL Server community (on the very 2nd night of SQL Karaoke, before it was called that), Tom was there. He worked for a big investment company when I was working for a big insurance company. We had a lot in common (despite his terrible youth). We were both working on SCOM and presenting on it. We both started working in Azure together. At no point was any of this competition either. He would discover something really cool and before putting it up on his blog or writing an article, he’d call me and walk me through it. I learned how to do things better because of Tom. I also learned, well, still working on this one, how to be better at supporting others because that’s just what he does. Tom got a job as an evangelist, so I had to run and get one too. Tom was on the Board at PASS, so I had to run out and do that too. I’m kind of hoping Tom wins the Lottery because I’m assuming I will shortly after. Tom is blazing trails all the time, but he makes sure that he reaches back and helps others down the same path. That’s pretty special.

Oh yeah, and I should mention, bacon, Jager and a lopsided football helmet.

Nice work Tom. Thanks.

May 11 2015

Is Performance Better With LEFT JOIN or RIGHT JOIN?

I tend to write my queries using LEFT JOIN. Why? Because logically I see it in my head like this:

Give me all the rows from this table and only those rows that match from the other table.

But, wouldn’t this logic work just as well:

Give me only the rows in this table that match the rows from this other table where I’m selecting all of them.

I know. If I worked on it some more I could make that a better sentence, but I’m pretty sure the logic is still sound. Only matching rows from one data set, all the rows from another data set. In short, RIGHT JOIN.

I read recently that we ought to be making everything into a LEFT JOIN because it performs better. I suspect someone had a bad day, wrote the JOIN criteria poorly, and a new bit of cargo cult wisdom was discovered. Because, well, let’s look at the queries:

FROM    Sales.SalesOrderHeader AS soh
LEFT JOIN Sales.SalesOrderDetail AS sod
        ON sod.SalesOrderID = soh.SalesOrderID;

FROM    Sales.SalesOrderDetail AS sod
RIGHT JOIN Sales.SalesOrderHeader AS soh
        ON soh.SalesOrderID = sod.SalesOrderID;

Yeah, I know. Not exactly the height of query optimization and elegance. But, the point is made pretty easily. When these queries run, both return 121317 rows. They have almost identical performance on my system at 2.8 seconds and 2.7 seconds with identical reads and CPU. One run. You can run it a million times if you want. I’m not going to and here’s why:


Those execution plans are the same. Just because they look the same? How do I know that for sure? Because they have the same query_plan_hash value: 0x857588F9A4394676.

Now, if we changed the logic so that SalesOrderDetail was on the “all” side of the query, of course that would change things, but, that changes to logic, not just the JOIN. The query optimizer is able to deal with what is basically just syntactic sugar. No. LEFT JOIN is not better than RIGHT JOIN. It’s just different logic that is dealt with by the optimizer.

Want to talk query tuning and execution plans? Let’s get together for an all day seminar at Connections this year. Check it out here.

May 05 2015

Effects of Persisted Columns on Performance

I live for questions. And my favorite questions are the ones where I’m not completely sure of the answer. Those are the questions that make me stop presenting in order to take a note so I can try to answer the question later, usually in a blog post. Guess where we are today?

I was asked at SQL Bits in London about the direct impact of the PERSISTED operator on calculated columns, both inserts and selects. I didn’t have a specific answer, so I wrote it down for later (and asked the, self-described, persisting Dane, to email me to remind me. He did, so I put together a few tests to try to answer his question.

First, I created three tables:

CREATE TABLE dbo.PersistTest (
RandomValue1 DECIMAL(5,3),
RandomValue2 DECIMAL(5,3),
CalcValue AS (RandomValue1 + RandomValue2),
PersistValue AS (RandomValue1 + RandomValue2) PERSISTED

CREATE TABLE dbo.PersistTest2 (
RandomValue1 DECIMAL(5,3),
RandomValue2 DECIMAL(5,3),
CalcValue AS (RandomValue1 + RandomValue2)

CREATE TABLE dbo.PersistTest3 (
RandomValue1 DECIMAL(5,3),
RandomValue2 DECIMAL(5,3),
PersistValue AS (RandomValue1 + RandomValue2) PERSISTED

The first table has two calculated columns, one PERSISTED and one not. The second table has a non-persisted column only. The third table has a persisted column only. This way, I can see the behavior across all these tables and compare them to see where PERSISTED adds to, or removes from, performance.

I took a very simple set of queries and ran these a few times on each table to get an average execution time:

INSERT dbo.PersistTest (
RandomValue2) VALUES (
42.2 , -- RandomValue1 - decimal
44.4  -- RandomValue2 - decimal

The results were:

Persist Persist 2 Persist 3
3580.5 1626 2260.5

On average, across about five executions, you can see a distinct difference of about 600 microseconds between Persist 2 and Persist 3, and an even larger average on the single Persist table. Let’s try the same thing with a few SELECT statements:

SELECT  pt.CalcValue,
FROM    dbo.PersistTest AS pt
WHERE   pt.PersistTestID = 3;
SELECT  pt.CalcValue
FROM    dbo.PersistTest2 AS pt
WHERE   pt.PersistTestID = 3;
SELECT  pt.PersistValue
FROM    dbo.PersistTest3 AS pt
WHERE   pt.PersistTestID = 3;

While the results are spectular:

Persist Persist 2 Persist 3
115.5 123.5 109.5

You can see that making the calculation on the fly is more costly with both Persist and Persist2 being higher than Persist3, which had no calculations to retrieve data.

Now, let’s run the same thing, but, how about, 10,000 times so we can really see what the differences are over time. First, the inserts:

Persist Persist 2 Persist 3
84854 68783 73817

Still measured in microseconds, you can see that the cumulative totals are higher for the persisted columns than they are for the non-persisted column. So, there is an overhead for storing the extra information. But, what about queries?

Persist Persist 2 Persist 3
Average 221.25 222.16 100.54
Cumulative 2203431 2254238 1005469

I’d say that’s pretty conclusive. When I hit the SELECT statement 10,000 times instead just five or six, we clearly have differences in behavior. Interestingly enough, the logical reads were identical at 20,000 each. So it really did come down to the time spent calculating the values versus the time spent just retrieving them.

Caveats. This is a very simple test with simple calculations and small persisted values. Even doing something 10,000 times isn’t the same as doing it one million or one hundred million. In short, your mileage may vary, some assembly required, for queries lasting longer than four hours call your physician, it depends.

You can see that a PERSISTED column is going to negatively impact INSERT, but, it’s going to positively impact SELECT. The question is, which one is giving you pain at the moment.

UPDATE: Transposed the data in the final table between Persist 2 & Persist 3. Fixed it.

May 01 2015

Speaker of the Month: May 2015

I finally got out to some community events last month and got to see a bunch of great speakers. Not that you don’t see great speakers at paid events, you do. It’s just that they’re usually professional speakers and I want to encourage others when I can.

The Speaker of the Month for May 2015 is John Sterrett (b|t) and his presentation Automate Your Daily Checklist With PBM and CMS at SQL Saturday Boston.

The very first thing that impressed me about this presentation was how John entered the room and immediately started interacting with the attendees. You see a lot of speakers hiding behind the lectern, fiddling with cables, nervously sipping water. Not John. He was chatting as he set up, immediately getting control of the room. It’s a great way to start. Then, it was a very informative lecture. He showed how to use Policy Based Management and the Central Management Server together to create really meaningful monitoring and double checks on the servers. It’s a technique I’ve used, so I really enjoyed seeing it presented. Plus, I learned a few things. For example, I hadn’t realized you can run the Best Practices Analyzer from PBM. His slides were clear and informative. A little too much documentation for my taste, but I know some people love that type of deck, so I’m not going to ding people for it. His demos were very clear and illustrative of the points he was making. John handled questions well and liberally and carefully used Zoomit to make everything clear. It was a fantastic presentation that was very well delivered.

A couple of things I wasn’t entirely crazy about. John kept checking the slides on the projected screen. Come to find out, it was because he couldn’t see them on his laptop. OK, I forgive that. While he was pretty good at repeating the questions, a few times he forgot. It’s so easy to do when you’re speaking (guilty). Also, John didn’t have a necessary dongle for connecting his machine. Never assume you’ll have VGA/HDMI because one of them will always be wrong. But that’s it.

John’s presentation manner is a lot like talking to him. He’s real calm, looks you right in the eye, listens carefully to questions and then gives direct answers. It was a real pleasure to watch. I suggest checking out his sessions when he presents next.

Apr 27 2015

Benefits for Some, All or Only a Few

As a member of the PASS Board of Directors I attended the PASS Business Analytics Conference (BAC) recently. You can read more about it here and here (as well as here).

Let me start with an important note: I am voicing my opinion here as an individual, not an official stance of the PASS organization.

There is controversy around the BAC because of a whole bunch of things, but one question in particular bothered me. It was suggested that the people attending the BAC were just consuming the worth or value that other people who paid for the Summit generated. At first, I just dismissed this concept. It stuck in the back of my mind though. Suddenly I realized why.

Yes, the BAC was partly paid for by Summit. The attendees at the BAC were not all people who would have attended Summit. There were, maybe, 1/3, who have attended Summit, are going to attend Summit, or who might attend Summit. That means, a majority will not.


Money from Summit is used to support Chapters. Anyone ever canvassed their attendees at a local user group for who has gone or will go to Summit? I have. Most of the time, far less than 1/3. Do we cut funding for Chapters?

Money from Summit is used to support SQL Saturday. Once again, I’ve canvassed several of these for people who were going to be attending Summit. Again, way less than 1/3. No more funding for SQL Saturday?

How about the Virtual Chapters that money from Summit pays for? How many of those people are attending Summit? I don’t know, but I’d be shocked if it’s 100% or anything close to that. Are we cutting Virtual Chapters?

24 Hours of PASS is also paid for by Summit.

You know, everything that PASS does, whether you like it, and attend it, or not, is paid for by Summit. There are good arguments to be made that we should not be doing the BAC (and arguments that we should). Where the money comes from is absolutely not a part of that argument. Otherwise, we must pull funding from anything and everything that is done by PASS that doesn’t translate to 100% benefits for the people who paid for it, Summit attendees.

I believe that we, the members of PASS, should be open and accepting and willing to try new things, both from a technical perspective and from a personal one. Providing training and community is what we do. Let’s focus on that.