Nov 23 2016

PowerShell to Test a Query

powershellSo you want to do some tuning, but you’re not sure how to test a query on it’s performance. Not a problem. Here’s a very rough script that I use to do some recent testing.

This script to test a query is post #11 of the #enterylevel #iwanttohelp effort started by Tim Ford (b|t). Read about it here.

The Script

The goal here is to load a bunch of parameter values from one table and then use those values to run a query to test it. To do this I connect up to my SQL Server instance, naturally. Then I retrieve the values I’m interested in. I set up the query I want to test. Finally a loop through the data set, calling the query once for each value.

[reflection.assembly]::LoadWithPartialName("Microsoft.SqlServer.Smo") | out-null
# Get the connection
$SqlConnection = New-Object System.Data.SqlClient.SqlConnection
$SqlConnection.ConnectionString = 'Server=WIN-3SRG45GBF97\DOJO;Database=WideWorldImporters;trusted_connection=true'

# Retrieve test data
$BillToCustomerCmd = New-Object System.Data.SqlClient.SqlCommand
$BillToCustomerCmd.CommandText = "SELECT  DISTINCT i.BillToCustomerID
FROM Sales.Invoices as i;"
$BillToCustomerCmd.Connection = $SqlConnection
$SqlAdapter = New-Object System.Data.SqlClient.SqlDataAdapter
$SqlAdapter.SelectCommand = $BillToCustomerCmd
$BillToCustomerList = New-Object System.Data.DataSet
$SqlAdapter.Fill($BillToCustomerList)

# Set up test query
$SQLCmd = New-Object System.Data.SqlClient.SqlCommand
$SQLCmd.Connection = $SqlConnection
$SQLCmd.CommandText = "DECLARE @sqlquery NVARCHAR(MAX);
SET @sqlquery
   = N'SELECT si.StockItemName,
   i.InvoiceDate,
   i.SalespersonPersonID
FROM Sales.Invoices AS i
JOIN Sales.InvoiceLines AS il
   ON il.InvoiceID = i.InvoiceID
JOIN Warehouse.StockItems AS si
   ON si.StockItemID = il.StockItemID
WHERE i.BillToCustomerID = @BillToCustomerID;';

DECLARE @parms NVARCHAR(MAX);
SET @parms = '@BillToCustomerID int';

EXEC sys.sp_executesql @stmt = @sqlquery,
   @params = @parms,
   @BillToCustomerID = @btc;"
$SQLCmd.Parameters.Add("@btc",[System.Data.SqlDbType]"Int")

# Run the tests
foreach($row in $BillToCustomerList.Tables[0])
{
    $SqlConnection.Open()
    $SQLCmd.Parameters["@btc"].Value = $row[0]    
    $SQLCmd.ExecuteNonQuery() | Out-Null
    $sqlconnection.Close()
    
}

I’m using ExecuteNonQuery here so I can ignore the result set because, in this case, I don’t care about it. I just want to be able to capture the query metrics (using Extended Events naturally). If I wanted the results to come back I could just use ExecuteQuery.

Some Explanation

This is a very simple and simplistic way to do testing. I’m not providing this as a mechanism for all your tests. I’m not suggesting this should be your primary testing tool. This is just a simple way to do some basic testing.

You can easily mix this up to get more realistic tests or add to the tests. Throw in a command to pull the query out of the cache after each call. Now you’ll see how the compile works. Change the order of the retrieved data to make it random. Toss in other queries. Run a set of other queries on a loop in a different PowerShell script to generate load. The sky is the limit once you start playing with this.

The reason I go to PowerShell for this instead of running all these commands as T-SQL through SSMS is because of the more direct control on behavior I get with PowerShell. The ability to ignore the result set is just one example.

Conclusion

If you really want to do load testing and evaluation, I’d suggest setting up Distributed Replay and putting it to work. I’ve used it very successfully for that kind of thorough and complete testing of a system. If you really just want to know how this one query is going to fare, the PowerShell script above will enable you to test a query through this basic test. Just remember to capture the metrics when you’re doing any kind of test so that you can compare the results.


Want to play some more with execution plans and query tuning? I’ll be doing an all day seminar on execution plans and query tuning before SQLSaturday Providence in Rhode Island, December 2016, therefore, if you’re interested, sign up here.

Nov 07 2016

sp_executesql Is Not Faster Than an Ad Hoc Query

This requires an immediate caveat. You should absolutely be using sp_executesql over any type of non-parameterized execution of T-SQL. You must parameterize your T-SQL because the lack of parameters in building up and executing strings is a classic SQL Injection attack vector. Using straight ad hoc T-SQL is an extremely poor coding choice because of SQL Injection, not because there is something that makes one method faster than the other.

Yet, I see in performance checklists that you should be using sp_executesql over straight ad hoc T-SQL because it will perform faster. That statement is incorrect.

Some Discussion

Let me reiterate the caveat before we continue. I 100% advocate for the use of sp_executesql. This function is preferred over ad hoc SQL because, used properly (and isn’t that usually one of the main problems, always), you can both build an ad hoc query and use parameters in order to avoid SQL Injection. The security implications of SQL Injection are kind of hard to over-emphasize. SQL Injection has been a primary vector for hacking for close on to twenty years now. We know the best way to avoid it is to use parameterized queries with data validation around the parameters. Why this is continually ignored is hard for me to understand.

However, despite the importance of using sp_executesql, I’m not advocating for it’s use as a performance improvement mechanism. I’m unclear as to how this comes to be on a performance checklist, with no discussion of taking advantage of Parameter Sniffing and/or plan reuse (possible performance advantages). I can only assume this is yet another example of Cargo Cult Programming. People know that they are supposed to use sp_executesql (and yes, you are supposed to use it), but don’t really understand why, so they start guessing.

The tests are going to run primarily from T-SQL in order to compare a straight EXECUTE of a query string to sp_executesql. However, for the sake of protecting against SQL Injection, let me also mention that calling to your database strictly through code, you can use two approaches (well, several, but we’ll focus on two in order to keep this blog post to a minimal size, I can’t caveat and explain every single possible permutation of all possible database access methods while still making anything approaching a coherent point), building up ad hoc T-SQL and executing that against the server directly, or, using a mechanism to parameterize your queries. You absolutely should be using the parameterized methods in order to validate your input and avoid SQL Injection.

The Simplest Test

Let’s start with a very simple, and simplified, query in order to illustrate the point:

DECLARE @adhocquery NVARCHAR(max) 
SET @adhocquery = N'SELECT si.StockItemName,
   i.InvoiceDate,
   il.Description
FROM Sales.Invoices AS i
JOIN Sales.InvoiceLines AS il
   ON il.InvoiceID = i.InvoiceID
JOIN Warehouse.StockItems AS si
ON si.StockItemID = il.StockItemID;'

EXEC (@adhocquery);


DECLARE @sqlquery NVARCHAR(max) 
SET @sqlquery = N'SELECT si.StockItemName,
   i.InvoiceDate,
   il.Description
FROM Sales.Invoices AS i
JOIN Sales.InvoiceLines AS il
   ON il.InvoiceID = i.InvoiceID
JOIN Warehouse.StockItems AS si
ON si.StockItemID = il.StockItemID;'

EXEC sys.sp_executesql @stmt = @sqlquery;

That’s the same query executed using the two methods in question. The results are an identical execution plan and exactly the same number of reads. If I execute either of them thousands of times then the execution times don’t vary. They have matching query hash and plan hash values. These are identical queries in every possible way. Even if I compare the performance across thousands of executions and include the compile time there is no difference in the outcome.

At the simplest possible level, these are identical mechanisms for executing a basic query. The only performance difference comes about because of parameters.

Test With Parameters

Instead of just running the query over and over again, I really want to test actual, meaningful, behavior this time. I’m going to load all the values for the BillToCustomerID column of the Invoices table using a PowerShell script. Then, I’ll execute the queries once for each of these values, using the two different execution methods.

To make aggregating the results easier, I put each query into a procedure:

CREATE PROCEDURE dbo.AdHoc (@BillToCustomerID INT)
AS
   DECLARE @Query NVARCHAR(MAX);

   SET @Query
      = N'SELECT si.StockItemName,
   i.InvoiceDate,
   il.Description
FROM Sales.Invoices AS i
JOIN Sales.InvoiceLines AS il
   ON il.InvoiceID = i.InvoiceID
JOIN Warehouse.StockItems AS si
   ON si.StockItemID = il.StockItemID
WHERE i.BillToCustomerID = ' + CAST(@BillToCustomerID AS NVARCHAR(10)) + ';';

   EXEC (@Query);
GO


CREATE PROCEDURE dbo.ExecSQL (@BillToCustomerID INT)
AS
   DECLARE @sqlquery NVARCHAR(MAX);

   SET @sqlquery
      = N'SELECT si.StockItemName,
   i.InvoiceDate,
   i.SalespersonPersonID
FROM Sales.Invoices AS i
JOIN Sales.InvoiceLines AS il
   ON il.InvoiceID = i.InvoiceID
JOIN Warehouse.StockItems AS si
   ON si.StockItemID = il.StockItemID
WHERE i.BillToCustomerID = @BillToCustomerID;';

   DECLARE @parms NVARCHAR(MAX);

   SET @parms = '@BillToCustomerID int';

   EXEC sys.sp_executesql @stmt = @sqlquery,
      @params = @parms,
      @BillToCustomerID = @BillToCustomerID;
GO

The results are fun.

Execution Type Average Duration
sp_executesql AVG: 57946.03187251
Ad Hoc AVG: 14788.8924302789

What’s going on? Is the conclusion that, in fact, ad hoc queries are faster than sp_executesql?

Absolutely not.

I cheated.

I intentionally picked a data set with a pretty interesting distribution. Depending on the value passed for BillToCustomerID there is the possibility of one of three different execution plans:

sp_executesql

In fact, the data is such that the first value that would be called is going to generate the worst possible plan for all the other data sets because it leads to the plan that simply consists of three scans. Even if I choose to force one of the other plans first, something I did several times while testing, the fact that the ad hoc queries will always generate the best plan for the data set results in better overall performance for ad hoc, in this instance.

Please don’t mistake me. I could skew the data in another direction in order to make sp_executesql into the better performing mechanism. The initial premise was that you should use sp_executesql over ad hoc because it will be faster. That’s not the case. In fact, it completely depends on a number of factors as to which of these methods will be faster. That said, my preferred mechanism is to use sp_executesql because it creates parameterized queries where I can ensure, with a certainty, that I’m avoiding SQL Injection. To achieve parity on execution times, I could simply include the WITH RECOMPILE hint and then I would have the same speed as the ad hoc approach while still ensuring my security.

Oh, and to add another wrinkle, you could always turn on ‘Optimize For Ad Hoc’. That shaves a few more milliseconds off the ad hoc approach over the sp_executesql approach in this example.

Conclusion

I know I’ve over-emphasized this throughout this discussion, but I’m going to repeat it again, SQL Injection is dangerous and a purely ad hoc approach to queries leads to unsafe servers. You must validate your inputs and use a querying mechanism that ensures that SQL Injection won’t be an issue. This is accomplished by using parameterized queries, which includes sp_executesql.

However, should you be using sp_executesql over ad hoc queries because performance will improve? No. Clearly that’s not the case. Instead you should be using sp_executesql because it’s a safer, saner approach to writing your queries.


I love talking performance tuning. In fact, I’ll be doing an all day seminar on execution plans and query tuning before SQLSaturday Providence in Rhode Island, December 2016, therefore, if you’re interested, sign up here.

Nov 01 2016

Stored Procedures Are Not Faster Than Views

A performance tuning tip I saw recently said, “Views don’t perform as well as stored procedures.”

<sigh>

Let’s break this down, just a little.

Definitions

A view is nothing but a query. The definition given by Microsoft is that it’s a virtual table that’s defined by a query. It’s a query that is used to mask data or perform a complex join or similar behaviors. Views are queries that get stored in the database. Views can be easily referred to as if they were a tables. That’s it. I’ve written in the past about views, including how they can possibly perform poorly.

A stored procedure is also a query, or a series of queries, or, a whole lot more. Microsoft’s definition of a stored procedure basically defines it as programming object that can accept input through parameters, perform actions, and provide various types of output. Stored procedures are also stored in the database, but that’s about the end of the direct correlations to a view. Heck, you can call views from stored procedures, so I’m really at a loss as to where this tip comes from.

In short, yes, both these objects have in them queries, but these are fundamentally different objects. You can’t really say that using one or the other is faster because they each do different things. Further, you can write code that will perform poorly using either construct.

Test Setup To Compare Performance

Here’s the view definition I’m going to use for the tests:

CREATE VIEW dbo.CustomerDeliveryInfo
AS
SELECT c.CustomerName,
   c.DeliveryRun,
   c.RunPosition,
   dm.DeliveryMethodName,
   cid.CityName AS DeliveryCity,
   cip.CityName AS PostalCity,
   c.CustomerID
FROM Sales.Customers AS c
JOIN Application.DeliveryMethods AS dm
   ON dm.DeliveryMethodID = c.DeliveryMethodID
JOIN Application.Cities AS cid
   ON cid.CityID = c.DeliveryCityID
JOIN Application.Cities AS cip
   ON cip.CityID = c.PostalCityID;

We’re going to compare that with a stored procedure that uses the same query. The procedure also takes advantage of the fact that it is a stored procedure using a parameter for input values:

CREATE PROCEDURE dbo.CustomerDeliveryInformation 
(@CustomerID INT)
AS
BEGIN
   SELECT c.CustomerName,
      c.DeliveryRun,
      c.RunPosition,
      dm.DeliveryMethodName,
      cid.CityName AS DeliveryCity,
      cip.CityName AS PostalCity,
      c.CustomerID
   FROM Sales.Customers AS c
   JOIN Application.DeliveryMethods AS dm
      ON dm.DeliveryMethodID = c.DeliveryMethodID
   JOIN Application.Cities AS cid
      ON cid.CityID = c.DeliveryCityID
   JOIN Application.Cities AS cip
      ON cip.CityID = c.PostalCityID
   WHERE c.CustomerID = @CustomerID;
END;

We’re also going to create another stored procedure that uses the view:

CREATE PROCEDURE dbo.InfoCustomerDelivery 
(@CustomerID INT)
AS
BEGIN
   SELECT * FROM dbo.CustomerDeliveryInfo AS cdi
   WHERE cdi.CustomerID = @CustomerID;
END;

Because stored procedures and views are different, we’ll have to call these different objects in different ways:

SELECT *
FROM dbo.CustomerDeliveryInfo AS cdi
WHERE cdi.CustomerID = 556;

EXEC dbo.CustomerDeliveryInformation @CustomerID = 556;

EXEC dbo.InfoCustomerDelivery @CustomerID = 556;

In this way we can run each of these queries independently and compare the results between them.

Results Comparing Stored Procedures With Views

If you run each of the queries above, you will find that they all create a nearly identical execution plan:

Views exec plan

You can click on that to make it bigger. If we compare all the different plans, one set of details does stand out:

plandifferences

There is a difference in the compile time between the view by itself and the stored procedures (they were almost identical). Let’s look at performance over a few thousand executions:

Query duration
View AVG: 210.431431431431
Stored Proc w/ View AVG: 190.641641641642
Stored Proc AVG: 200.171171171171

This is measured in microsends, so the variation we’re seeing is likely just some disparity on I/O, CPU or something else since the differences are trivial at 10mc or 5%. While that may seem like the view is suffering, please note that the view inside the procedure actually ran faster by 5%. Again, this is explained by the fact that we’re only talking about a 10 microsecond difference. I’m not sure if that’s within the margin for error on the Extended Event sql_batch_complete or not (I couldn’t find documentation stating what it might be), but I’ll bet it’s close. I believe it’s safe to say that the average performance of these queries is identical.

All three queries had 8 logical reads.

What about execution time including compile time, since there is a difference:

Query duration
View AVG: 10089.3226452906
Stored Proc AVG: 9314.38877755511
Stored Proc w/ View AVG: 9938.05410821643

The difference in the performance including compile time for the procedure alone is 700mc better on average than the view. That’s an 8% difference. It was almost that high for the view that used the procedure at 7%.

If we’re just talking compile time then, there is a significant win if we avoid the view. This is no doubt because of the extra work involved in unpacking the view and going through the simplification process within the optimizer. Plus, the view alone in our query was parameterized by the optimizer in order to assist it’s performance over time (as we saw in the average results without the recompile). All that extra work explains the 8% difference.

Let’s Break It

What if we change the query around a little. I decide that all I want to see right now from the view is the CustomerID:

SELECT cdi.CustomerID
FROM dbo.CustomerDeliveryInfo AS cdi
WHERE cdi.CustomerID = 556;

When I execute this, I get a whole new execution plan:

viewsimple

The execution time drops a little to around 190mc on average and the reads go from 8 to 2. The stored procedure would have to get rewritten to only return CustomerID. Does that mean that views are faster than stored procs? Absolutely not. It just means that there is some degree of flexibility built into the view, as a construct, that’s not there in a stored procedure, as a construct. These are fundamentally different objects.

What if we change the query against the view again:

SELECT *
FROM dbo.CustomerDeliveryInfo AS cdi
WHERE cdi.CustomerName = 'Om Yadav';

Once more the execution plan will change to something different than before:

viewcomplex

Performance drops to about 300mc and we get 10 reads instead of 8. Does that mean that views are slower than stored procedures? No. We’re attempting to compare two different objects that perform two different functions within SQL Server.

Conclusion

Since a stored procedure can actually query a view, suggesting that we use stored procedures instead of views becomes quite problematic. With the exception of the differences in compile time, we see that views actually perform exactly the same as stored procedures, if the query in question is the same. There are reasons to use views as well as reasons to not use them. There are reasons to use stored procedures as well as reasons to not use them. Neither of these objects is preferred above the other because of performance concerns.


Want to play some more with execution plans and query tuning? I’ll be doing an all day seminar on execution plans and query tuning before SQLSaturday Providence in Rhode Island, December 2016, therefore, if you’re interested, sign up here.

Oct 24 2016

A Sub-Query Does Not Hurt Performance

The things you read on the internet, for example, “don’t use a sub-query because that hurts performance.”

Truly?

Where do people get these things?

Let’s Test It

I’ve written before about the concept of cargo cult data professionals. They see one issue, one time, and consequently extrapolate that to all issues, all the time. It’s the best explanation I have for why someone would suggest that a sub-query is flat out wrong and will hurt performance.

Let me put a caveat up front (which I will reiterate in the conclusion, just so we’re clear), there’s nothing magically good about sub-queries just like there is nothing magically evil about sub-queries. You can absolutely write a sub-query that performs horribly, does horrible things, runs badly, and therefore absolutely screws up your system. Just as you can with any kind of query. I am addressing the bad advice that a sub-query is to be avoided because they will inherently lead to poor performance.

Let’s start with a simple test, just to validate the concept of how a sub-query performs within SQL Server:

SELECT sd.OrderQty,
   pr.Name
FROM
   (SELECT *
    FROM Sales.SalesOrderDetail AS sod
   ) AS sd
JOIN
   (SELECT *
    FROM Production.Product AS p
   ) AS pr
   ON pr.ProductID = sd.ProductID
WHERE sd.SalesOrderID = 52777;

SELECT sod.OrderQty,
   p.Name
FROM Sales.SalesOrderDetail AS sod
JOIN Production.Product AS p
   ON p.ProductID = sod.ProductID
WHERE sod.SalesOrderID = 52777;

If there is something inherently wrong with a sub-query, then there is something twice as wrong with two sub-queries. Here are the resulting execution plans:

sub-query plan matches query plan

Huh, look sort of, I don’t know, almost identical. Let’s compare the plans using the new SSMS plan comparison utility:

only slight diffences in sub-query plan

Well, darn. Displayed in pink are the common sets of operations between the two plans. In other words, for these plans, everything except the properties of the SELECT operator are exactly the same. Let’s take a look at those properties:

sub-query SELECT properties

OK. Now we have some interesting differences, and especially, some interesting similarities. Let’s start with the similarities. First of all, we have exactly the same QueryPlanHash value in both plans. In addition, we also have identical estimated rows and costs. In short, the optimizer created two identical execution plans. Now, this is where things get a little bit interesting. See, the optimizer actually worked a little harder to create the first plan than the second. It took an extra tic on the CPU and just a little more CompileMemory and CompileTime. Interesting.

What about execution times? With a few runs on average, the execution times were identical at about 149mc with 11 reads. However, running a query once or twice isn’t testing. Let’s get a few thousand runs of both queries. The average results from the Extended Events sql_batch_completed event were 75.9 microseconds for both queries.

However, what about that extra little bit of compile time in the query that used sub-queries? Let’s add in a statement to free the procedure cache on each run and retry the queries. There is a measurable difference now:

Query duration
Sub-query AVG: 5790.20864172835
Query AVG: 4539.49289857972

More work is done by the optimizer on the sub-query to compile the same execution plan. We’re adding work to the optimizer, requiring it to unpack the, admittedly, silly query written above.  When we refer only to the compile time and not the execution time, there is a performance hit. Once the query is compiled, the performance is identical. Whether or not you get a performance hit from a sub-query then, in part, depends on the degree to which you’re experiencing compiles or recompiles. Without the recompile, there is no performance hit. At least in this example.

Let’s Test It Again, Harder

I firmly believe in the old adage; if you ain’t cheatin’, you ain’t fightin’. It’s time to put the boot in.

Let’s go with much more interesting queries that are more likely to be written than the silly example above. Let’s assume some versioned data like in this article on Simple-Talk. We could express a query to bring back a single version of one of the documents in one of three ways from the article. We’re just going to mess with two of them. One that uses a sub-query, and one that does not:

--no sub-query
SELECT TOP 1 d.DocumentName,
   d.DocumentID,
   v.VersionDescription,
   v.VersionID,
   ROW_NUMBER() OVER (ORDER BY v.VersionID DESC) AS RowNum
FROM dbo.Document d
JOIN dbo.Version v
   ON d.DocumentID = v.DocumentID
WHERE d.DocumentID = 9729;

--sub-query
SELECT  d.[DocumentName],
        d.[DocumentId],
        v.[VersionDescription],
        v.[VersionId]
FROM    dbo.[Document] d
        CROSS APPLY (SELECT TOP (1)
                            v2.VersionId,
                            v2.VersionDescription
                     FROM   dbo.[Version] v2
                     WHERE  v2.DocumentId = d.DocumentId
                     ORDER BY v2.DocumentId,
                            v2.VersionId DESC
                    ) v
WHERE   d.[DocumentId] = 9729;

As per usual, we can run these once and compare results, but that’s not really meaningful. We’ll run them thousands of times. Also, to be sure we’re comparing apples to apples, we’ll force a recompile on every run, just like in the first set of tests. The results this time:

Query duration
Sub-query AVG: 1852.14114114114
Query AVG: 2022.62162162162

You’ll note that, even with the compile on each execution, the query using a sub-query actually out-performed the query that was not using a sub-query. The results are even more dramatic when we take away the compile time:

Query duration
Sub-query AVG: 50.8368368368368
Query AVG: 63.3103103103103

We can also look to the execution plans to get an understanding of how these queries are being resolved:

differentplans

The plan on top is the sub-query plan, and the plan on the bottom is the plan for just the plain query. You can see that the regular query is doing a lot more work to arrive at an identical set of data. The differences are visible in the average execution time, about a 20% improvement.

You could argue that we’re comparing two completely different queries, but that’s not true. Both queries return exactly the same result set. It just so happens that the query using the sub-query performs better overall in this instance. In short, there’s no reason to be scared of using a sub-query.

Sub-Query Conclusion

Is it possible for you to write horrid code inside of a sub-query that seriously negatively impacts performance? Yes. Absolutely. I’m not arguing that you can’t screw up your system with poor coding practices. You absolutely can. The query optimization process within SQL Server deals well with common coding practices. Therefore, the queries you write can be fairly sophisticated before, by nature of that sophistication, you begin to get serious performance degradation.

You need to have a method of validation for some of what you read on the internet. People should provide both the queries they are testing with and the numbers that their tests showed. If you’re just seeing completely unsupported, wildly egregious statements, they’re probably not true.

In conclusion, it’s safe to use sub-queries. Just be careful with them.


If you’re finding any of this useful and you’d like to dig down a little more, you can, because I’ll be putting on an all day seminar on execution plans and query tuning. The event takes place before SQLSaturday Providence in Rhode Island, December 2016, therefore, if you’re interested, sign up here.

Oct 17 2016

SELECT * Does Not Hurt Performance

SELECT *I read all the time how SELECT * hurts performance. I even see where people have said that you just have to supply a column list instead of SELECT * to get a performance improvement. Let’s test it, because I think this is bunkum.

The Test

I have here two queries:

SELECT *
FROM Warehouse.StockItemTransactions AS sit;

--and

SELECT sit.StockItemTransactionID,
       sit.StockItemID,
       sit.TransactionTypeID,
       sit.CustomerID,
       sit.InvoiceID,
       sit.SupplierID,
       sit.PurchaseOrderID,
       sit.TransactionOccurredWhen,
       sit.Quantity,
       sit.LastEditedBy,
       sit.LastEditedWhen
FROM Warehouse.StockItemTransactions AS sit;

I’m basically going to run this a few hundred times each from PowerShell. I’ll capture the executions using Extended Events and we’ll aggregate the results.

The Results

I ran the test multiple times because, funny enough, I kept seeing some disparity in the results. One test would show a clear bias for one method, another test would show the opposite. However, averaging the averages we see that things broke down as follows:

* 167.247ms
Column List 165.500ms

That’s after about 2000 separate executions of each query. There’s a 2ms bias towards the Column List query as opposed to the *. That’s an improvement, if you want to call it that, of 1%. It’s hardly worth the bother, assuming that with more testing this continued to hold true. In multiple tests, the SELECT * ran faster. I just feel honor bound to put up the full results. They show an improvement, but not one I’d get excited about. Oh, and the reads, the execution plan, everything else… identical.

SELECT * Conclusion

Don’t get me wrong, there are lots of reasons to not use SELECT *. Yes, performance is one of the reasons to not use SELECT *. However, when most people suggest that maybe using SELECT * is a bad idea for performance reasons, what they’re saying is you ought to only move the columns you need and the data you are actually using, not everything. I’m not aware of anyone with experience and knowledge suggesting that using the complete column list instead of SELECT * is faster. As we can see in the tests above, it isn’t (or is by so small a margin, who cares).


I love talking performance tuning. In fact, I’ll be doing an all day seminar on execution plans and query tuning before SQLSaturday Providence in Rhode Island, December 2016, therefore, if you’re interested, sign up here.

Oct 03 2016

Correlated Datetime Columns

SQL Server is a deep and complex product. There’s always more to learn. For example, I had never heard of Correlated Datetime Columns. They were evidently introduced as a database option in SQL Server 2005 to help support data warehousing style queries (frequently using dates and times as join criteria or filter criteria). You can read up on the concept here from this older article from 2008 on MSDN. However, doing a search online I didn’t find much else explaining how this  stuff worked (one article here, that didn’t break this down in a way I could easily understand). Time for me to get my learn on.

The concept is simple, turning this on for your database means that dates which have a relationship, the example from MSDN uses OrderDate and DueDate from the Purchasing.PurchaseOrderHeader and Purchasing.PurchaseOrderDetail tables respectively. Clearly, the DueDate would be near the OrderDate in general terms. Cool. Makes sense. Let’s see how it works, if it works.

Correlated Datetime Columns At Work

I’m going to use the example code from the Microsoft article, partly because there are aspects of it I don’t understand and exploring it will help me learn. Let’s start with the (slightly modified) base query that we hope will benefit from the Correlated Datetime setting:

SELECT *
FROM Purchasing.PurchaseOrderHeader AS poh
JOIN Purchasing.PurchaseOrderDetail AS pod
   ON poh.PurchaseOrderID = pod.PurchaseOrderID
WHERE poh.OrderDate
BETWEEN '20130801' AND '20130901';

The changes I made were the aliases and the values in the WHERE clause. I left everything else the same. Performance on this was 115ms on average with 59 reads. This is the execution plan:

ExecPlanNoIndexNoSetting

Click to embiggen. The missing index suggestion is on OrderDate using an INCLUDE of all the columns (the joy of SELECT * and Missing Index Hints, I should post a rant on Missing Index Hints at some point). The plan itself uses two Clustered Index Scan operations to retrieve the data. Since the scans are Ordered (check the properties in each operator to validate this, but the lack of a Sort operator is also a good hint), a Merge Join was used to put the data from the two tables together.

The example code drops and recreates the clustered index on the PurchaseOrderDetail table because one of the tables must have a clustered index with a datetime column as the first (or only) column in the key for Correlated Datetime Columns. However, I like to see how things behave. Let’s first create a non-clustered index on PurchaseOrderDetail.

CREATE INDEX IX_PurchaseOrderDetail_DueDate
ON Purchasing.PurchaseOrderDetail(DueDate);
GO

I don’t expect this to do anything for the above query, but I want to test each step. The non-clustered index doesn’t change anything. The execution plan and performance are the same as above. Fine. Let’s enable Correlated Datetime Columns:

ALTER DATABASE AdventureWorks2014
   SET DATE_CORRELATION_OPTIMIZATION ON;
GO

Running the query again, I can’t detect any differences. The execution plans are identical and the performance is the same. In case the plan wasn’t removed from cache because of the database setting change, I sure would expect it to be, but I could be wrong, I’m going to remove it from the cache:

DECLARE @PlanHandle VARBINARY(64);

SELECT  @PlanHandle = deps.plan_handle
FROM    sys.dm_exec_procedure_stats AS deps
CROSS APPLY sys.dm_exec_sql_text(deps.sql_handle) AS dest
WHERE   dest.text LIKE 'SELECT *
FROM Purchasing.PurchaseOrderHeader AS poh%';

IF @PlanHandle IS NOT NULL
    BEGIN
        DBCC FREEPROCCACHE(@PlanHandle);
    END
GO

Nope. It’s not because of a cached plan. The Correlated Datetime Column setting just isn’t digging how I’ve got the clustered index set up. However, let’s also test this. I’ll create an index on the PurchaseOrderHeader.OrderDate as well:

CREATE INDEX IX_PurchaseOrderHeader_OrderDate
ON Purchasing.PurchaseOrderHeader (OrderDate);
GO

Nope. No joy. All right. Clean up the test indexes and follow Microsoft’s lead:

DROP INDEX IX_PurchaseOrderDetail_DueDate
ON Purchasing.PurchaseOrderDetail;
GO
DROP INDEX IX_PurchaseOrderHeader_OrderDate
ON Purchasing.PurchaseOrderHeader;
GO
CREATE UNIQUE NONCLUSTERED INDEX
IX_PurchaseOrderDetail_PurchaseOrderID_PurchaseOrderDetailID 
ON Purchasing.PurchaseOrderDetail(PurchaseOrderID,PurchaseOrderDetailID);
GO
ALTER TABLE Purchasing.PurchaseOrderDetail
DROP CONSTRAINT PK_PurchaseOrderDetail_PurchaseOrderID_PurchaseOrderDetailID;
GO
CREATE CLUSTERED INDEX IX_PurchaseOrderDetail_DueDate
ON Purchasing.PurchaseOrderDetail(DueDate);
GO

Rerunning the original query I immediately see a difference. The reads dropped from 59 to 57. The execution time is a little better at 105ms. The execution plan is very different:

ExecPlanClusteredIndexSetting

We’re making use of that clustered index and it’s changed to a Hash Join. Let’s see the predicate for the Seek operation:

Seek Keys[1]: Start: [AdventureWorks2014].[Purchasing].[PurchaseOrderDetail].DueDate >= Scalar Operator(‘2013-07-07 00:00:00.000’), End: [AdventureWorks2014].[Purchasing].[PurchaseOrderDetail].DueDate < Scalar Operator(‘2013-10-05 00:00:00.000’)

So, the optimizer has picked a range of values that are near to the values that have been passed in, assuming a correlation between the data. According to the documentation the optimizer also ensures that using this correlation will still result in the same data being returned.

The original scan of PurchaseOrderHeader is still the same. In fact, we can see that when comparing the plans:

ExecPlanCompare

You’ll note that I have the similarities highlighted, so you can tell that the Scan is the same. Yes, it’s cost percentage within the plan has changed, but it’s still the same basic operation.

The magic comes from the optimizer creating statistics in the form of a materialized view that are created and maintained around the data in tables that qualify for the behavior of the Correlated Datetime Columns. You can even see these in your views. They have a common naming standard as outlined in the documentation: _MPStats_Sys_<constraint_object_id>_<GUID>_<FK_constraint_name> (no word on if MP is anything as crazy as the WA of system generated statistics).

In short, Correlated Datetime Columns worked… Or did it. We’re comparing apples to hammers at the moment. I’ve got a new clustered index on one of the tables. That changes all the choices, whether or not I’ve changed some database setting. Let’s remove Correlated Datetime Columns, pull the plan from cache and then rerun the query.

Oops. Performance is now at 145ms on average with 118 reads. The execution plan has changed yet again:

ExecPlanClusteredIndexNoSetting

In short, it works.

I could go farther and look to replace the clustered index on the PurchaseOrderHeader table, but you get the idea.

Conclusion

Correlated Datetime Columns works. Clearly it’s not something you’re going to enable on all your databases. Probably most of your databases don’t have clustered indexes on datetime columns let alone enough tables with correlation between the data stored in them. However, when you do have that type of data correlation, enabling Correlated Datetime Columns and ensuring you have a clustered index on the datetime column is a viable tuning mechanism. Further, this is a mechanism that has been around since 2005. Just so you know, I did all my testing in SQL Server 2016, so this something that anyone in the right situation can take advantage of. Just remember that TANSTAAFL always applies. Maintaining the statistics needed for the Correlated Datetime Columns is done through materialized views that are automatically created through the optimization process. You can see the views in SSMS and any queries against the objects. You’ll need to take this into account during your statistics maintenance. However, if Correlated Datetime Columns is something you need, this is really going to help with this, fairly narrow, aspect of query tuning.

Sep 06 2016

Query Store, Force Plan and “Better” Plans

I am endlessly fascinated by how the Query Store works. I love teaching it at every opportunity too. Plus, almost every time I teach it, I get a new question about the behavior that makes me delve into the Query Store just a little bit more, enabling me to better understand how it works. I received just such a question at SQLSaturday Norway:

If you are forcing a plan, and the physical structure changes such that a “better” plan is possible, what happens with plan forcing?

Let’s answer a different question first. What happens when the plan gets invalidated, when the index being used gets dropped or some other structural change occurs so that the plan is no longer valid? I answered that question in this blog post. The plan being forced, after the object is dropped, becomes invalid, so that plan can no longer be used. The Query Store still attempts to apply the plan during any recompile or compile event of the query in question, but it fails and a proper plan is used. All this means, I think, the Query Store is going to ignore the new index, since a new index doesn’t invalidate an existing plan. A new index just makes new plans possible. However, when I was asked this question, this wasn’t something I had tested, so I gave a speculative, best guess, answer with plenty of caveats and the promise to provide a tested answer ASAP. Here we go.

I’ll start with the same sample query:

SELECT  sit.Quantity,
        sit.TransactionOccurredWhen,
        i.InvoiceDate,
        si.StockItemName
FROM    Warehouse.StockItemTransactions AS sit
JOIN    Sales.Invoices AS i
        ON i.InvoiceID = sit.InvoiceID
JOIN    Warehouse.StockItems AS si
        ON si.StockItemID = sit.StockItemID
WHERE   sit.TransactionOccurredWhen BETWEEN '3/1/2015'
                                    AND     '3/5/2015';

The results are returned in about 53ms with 4850 reads and this execution plan:

PlanWithoutIndex

As you can see, there’s the suggestion of a possible missing index. We’ll apply that in a moment. First though, I’m going to get the plan and query identifiers from the Query Store:

SELECT  qsq.query_id,
        qsp.plan_id,
        CAST(qsp.query_plan AS XML)
FROM    sys.query_store_query AS qsq
JOIN    sys.query_store_query_text AS qsqt
        ON qsqt.query_text_id = qsq.query_text_id
JOIN    sys.query_store_plan AS qsp
        ON qsp.query_id = qsq.query_id
WHERE   qsqt.query_sql_text LIKE 'SELECT  sit.Quantity,
        sit.TransactionOccurredWhen,%';

With this information, I’ll use Force Plan to ensure this is the plan used going forward.

EXEC sys.sp_query_store_force_plan 42995,487;

With that done, I’ll create the index:

CREATE INDEX TransactionOccurredWhen
ON Warehouse.StockItemTransactions
(TransactionOccurredWhen)
INCLUDE (StockItemID,InvoiceID,Quantity);

When I do this same set of operations, run the query, identify a missing index, create a new index, rerun the query, when there is no plan forcing occurring, the execution plan above is replaced with one that uses the index and results in about 32ms execution time with 2942 reads, a significant improvement. You get a recompile event because the schema involved with the query has changed. With the change, a new index is available, so the recompile event uses that new index. What happens when you force the plan?

The recompile event after running CREATE INDEX still occurs. However, because we have elected to force a plan, that plan is what is used. In this instance, a recompile to a new plan would result in a faster query using fewer resources. However, as long as we’re forcing a plan and that plan stays valid, the plan will be forced.

In short, the behavior is exactly as I expected. Choosing to force a plan in the Query Store results in that plan being forced. While I think that the Query Store and plan forcing are wonderful new tools in our tool box, I am concerned that plan forcing will become far too easy a thing to implement. I worry that people will implement it without thinking through the implications and potential impacts.

It gets worse. If I change the query, let’s say I make it into a stored procedure and parameterize the query, and, instead of a very limited date range, I send in a whole month, the execution plan is quite different (with or without the index). Forcing the plan that is expecting less than 1,000 rows onto a query that is retrieving 10,000 rows results in pretty horrific performance. We really are going to have to be careful about using plan forcing appropriately because, as in so much of the rest of SQL Server, and in all of programming for that matter, the code is going to do exactly what we tell it do, whether that’s what we really want it to do or not.

Aug 15 2016

Query Store, Force Plan and Dropped Objects

I love the Query Store. Seriously. It’s a huge leap forward in the capabilities of Azure SQL Database and SQL Server in support of performance monitoring and query optimization. One of my favorite aspects of the Query Store is the ability to force plans. Frankly though, it’s also the scariest part of the Query Store. I do believe that plan forcing will be one of the most ill-used functions in SQL Server since the multi-statement table-valued user-defined function (don’t get me started). However, unlike the UDF, this ill-use will be because of poor understanding on the part of the user, not a fundamental design issue. No, plan forcing and the Query Store are very well constructed. Let me give you an example of just how well constructed they are.

Let’s imagine that have a situation such as bad parameter sniffing where you’ve determined that from the more than one possible execution plans against a table, there is a preferred plan. Enabling plan forcing to ensure that plan gets used is a no-brainer. Let’s further imagine that you have a junior DBA who is… let’s just say overly aggressive in their duties such that they do silly things occasionally. What happens when your pretty plan, which uses a particular index meets your junior DBA who just dropped that index?

Here’s the setup. We’re using the WideWorldImporters database and we have this query:

SELECT  *
FROM    Warehouse.StockItemTransactions AS sit
WHERE   sit.TransactionOccurredWhen BETWEEN '9/9/2015'
                                    AND     '9/11/2015';

This query, with the default configuration, will scan the existing table, so I’ll add an index:

CREATE INDEX TransactionOccurredWhenNCI
ON Warehouse.StockItemTransactions
(TransactionOccurredWhen);

For a limited range such as the one I’m passing above, I’ll get a plan with a key lookup operation which runs faster than the scan, so I’m happy. For a broader range, I’m likely to see a scan again, but since most of my queries have a very narrow range, I’d sure like to be able to force the plan to always compile to the seek and key lookup. To do this I need to find the query_id and plan_id from the Query Store (assuming I’m not using the GUI):

SELECT  qsp.plan_id,
        qsp.query_id,
		qsqt.query_sql_text,
		qsp.count_compiles
FROM    sys.query_store_plan AS qsp
JOIN    sys.query_store_query AS qsq
        ON qsq.query_id = qsp.query_id
JOIN    sys.query_store_query_text AS qsqt
        ON qsqt.query_text_id = qsq.query_text_id
WHERE   qsqt.query_sql_text LIKE 'SELECT  *
FROM    Warehouse.StockItemTransactions AS sit%';

With those values, I can force the execution plan so that it will always use the plan I want:

EXEC sys.sp_query_store_force_plan 42460,463;

That’s it. I’m happy because I’m going to see the execution plan used over and over, despite any values passed during a recompile.

Then…

Along comes our aggressive junior DBA who decides that there are “too many” indexes on the server. No, I don’t know what that means either, but they evidently read it on the internet or something so they drop the index we created before:

DROP INDEX TransactionOccurredWhenNCI ON Warehouse.StockItemTransactions;

What now happens to our lovely execution plan and the plan forcing? We’ll take a look at two events in Extended Events, sql_statement_recompile and query_store_plan_forcing_failed. Nothing happens immediately on dropping the index. The plans associated with that object, if any, are marked as invalid in the cache. The next time we call the query it’s going to recompile and we can see the event:

recompile_event

The most important part of the event is the recompile_cause which is “Schema changed”. However, I would also note the attach_activity_id.guid. I’ve chosen to enable causality tracking in this Extended Event session. This will cause all events associated with a common activity to get a GUID and then a sequence. This is interested because, after the recompile event, we get the query_store_plan_forcing_failed event:

plan_forcing_failed

The guid value is the same as the event above and the *.seq number is now 2, showing that, for these events, the recompile event occurred and then this event occurred. That makes perfect sense. The plan is marked for recompile, so, it’s going to be recompiled. I have enabled plan forcing though, so I have a particular plan that I want the optimizer to use. However, thanks to my “helpful” junior DBA, the plan is now invalid. You even get the description of what happened in the message field for the event:

Index ‘WideWorldImporters.Warehouse.StockItemTransactions.TransactionOccurredWhenNCI’, specified in the USE PLAN hint, does not exist. Specify an existing index, or create an index with the specified name.

The first question now is, what happens with this query and the execution plan? Does the new plan generated now that the index is missing get stored in cache? Yes, it does. We can validate that by querying the cache, or, when capturing the actual execution plan, checking the “Retrieved from cache” property.

Because plan forcing is enabled, do we see a recompile every time this query is called? The answer to that question is slightly complex. Under normal circumstances, no. As long as that plan remains in cache, it’s simply reused. No other recompiles occur. A normal recompile event will cause another attempt at applying the invalid execution plan and we would see yet another query_store_plan_forcing_failed event for each recompile on the query. However, during testing, Joey D’Antoni (who was helping me play with this when we discussed what would happen when a plan was made invalid) had severe memory pressure on his server. He saw intermittent recompiles with a cause message that said plan forcing had failed. So if your server is under extreme stress and you cause this issue, you might see different messages. Just remember, the cause of the recompiles was not the plan forcing, but the memory pressure.

The fun thing is, as long as I don’t remove the plan forcing or take the query and plan out of the Query Store manually, if I recreate the index on my table with the same name and definition as that expected by the plan, the Query Store will simply reapply the plan and then successfully force it during any subsequent recompile situation. This is because Query Store is persisted with the database and barring outside activity, the information there will remain, just like the rest of the data in the database.

All of this means that Query Store works exactly the way we would expect, not forcing additional recompiles when you, or your junior DBA, inadvertently invalidate a plan. It also works as expected in that forcing a plan is stored with your database so that, assuming you don’t remove that plan from the Query Store, it will simply be reapplied after you fix the problem. It’s fun to see the thought that went behind the design of the behavior of Query Store. However, please, use plan forcing judiciously.

Jul 25 2016

Monitor Query Performance

Blog post #7 in support of Tim Ford’s (b|t) #iwanttohelp, #entrylevel. Read about it here.

Sooner or later when you’re working with SQL Server, someone is going to complain that the server is slow. I already pointed out the first place you should look when this comes up. But what if they’re more precise? What if, you know, or at least suspect, you have a problem with a query? How do you get information about how queries are behaving in SQL Server?

Choices For Query Metrics

It’s not enough to know that you have a slow query or queries. You need to know exactly how slow they are. You must measure. You need to know how long they take to run and you need to know how many resources are used while they run. You need to know these numbers in order to be able to determine if, after you do something to try to help the query, you’ll know whether or not you’ve improved performance. To measure the performance of queries, you have a number of choices. Each choice has positives and negatives associated with them. I’m going to run through my preferred mechanisms for measuring query performance and outline why. I’ll also list some of the other mechanisms you have available and tell you why I don’t like them. Let’s get started.

Dynamic Management Views

Since SQL Server 2005, Dynamic Management Views (DMV) and Functions (DMF) have been available for access all sorts of information about the server. Specifically there are a few DMVs that are focused on queries and query performance. If you go back through my blog, you can find tons of examples where I illustrate their use. You can also see them at work in commercial tools and free tools. Adam Machanic’s sp_WhoIsActive, a free tool, makes extensive use of DMVs. To learn more about DMVs, you can download a free book, Performance Tuning with SQL Server Dynamic Management Views. DMVs are available in Azure SQL Database, Azure SQL Data Warehouse, and all editions of SQL Server.

The information captured by DMVs is an aggregation of all the times the query has been run. This means you can’t find how long the query ran at 3PM yesterday. You can though see the minimum and maximum time the query took as well as the average. The ability to see this information is what makes DMVs useful. However, another important point about DMVs is that they only collect information while a query is in memory. As soon as it leaves the cache (the area of memory it is stored in), so does all the aggregated information about the query in the DMVs.

You use the DMVs for a general understanding of how a query is behaving. They’re not meant for detailed for long term collection of information about queries. For that we use other tools.

Extended Events

Introduced in SQL Server 2008, Extended Events (ExEvents) are a mechanism for capturing detailed information about SQL Server and the processes within. One of those processes is how queries behave. I have multiple examples on this blog on using ExEvents. You can’t go wrong reading about them on Jonathan Kehayias’ blog. Extended events are available in Azure SQL Database and all editions of SQL Server.

When you need to know every query against a database, or each time a particular query is called, and all the details associated with the query (reads, writes, duration), ExEvents are the way to go. ExEvents are very lightweight on the server (but not free) and can be filtered so that you capture just the information you need. The information is detailed and not aggregated. Instead it’s raw. The real issue with capturing this data is the amount of data you’ll be capturing. Testing and careful filtering to ensure you’re dealing with too much information is important. Prior to SQL Server 2012, there was no graphical user interface for reading ExEvent data, so you would have been forced to run queries against the XML that the information is captured within. With the tools available in SQL Server Management Studio, this is no longer the case.

You use ExEvents when you need specific and detailed information about a query. ExEvents are not so good for generalized monitoring.

Query Store

Introduced in Azure SQL Database, and first released in SQL Server with 2016, Query Store is another mechanism for capturing aggregated information about queries. As before, I have examples on how to work with Query Store on my blog. You can also find quite a bit on it over at Simple-Talk. Query Store is pretty specialized still and only available in Azure and SQL Server 2016, but it is in all editions of SQL Server 2016.

Query Store captures information similar to what is available in the DMVs. However, unlike the DMVs, the information that Query Store captures is kept around, even after a query ages out or is removed from cache. This persistence makes Query Store very exciting. You do have to choose to turn it on for each database you wish to capture queries for. It’s not automatic like DMVs. The capture processes are asynchronous, so they should be relatively light weight for most databases.

You use the Query Store when you need to capture query metrics over the long term, but you don’t need detailed information and aggregations works well for you.

Others

There are other ways to measure query performance. You can use the Profiler GUI, but that actually seriously negatively impacts the server. You can bring a server down by using it, so it should be avoided. Profiler generates scripts called trace, which can be used to monitor your server. However, they have a much higher impact than ExEvents and they’re on the deprecation list. Microsoft is not added new trace events for new functionality, so they’re becoming less and less useful with each release. You also can’t use trace against Azure. If you’re writing a query and you just want to see how long it takes to run, you can use SET STATISTICS TIME ON, to capture the execution time. This is a handy way to quickly measure performance. There is also the ability to capture reads and writes using SET STATISTICS IO ON, but, while this does capture the metrics we need, it adds considerable overhead to the query, skewing performance measurement. This is why I stick to ExEvents when I need an accurate measure.

Conclusion

Honest people can disagree about the best way to capture query performance. I have my preferences as you can see. However, I’m fairly certain that everyone would agree that it’s important to know how to capture performance metrics in order to be able to assert that performance has increased or decreased in a measured fashion. You don’t want to guess at query performance, you want to know.

Jul 18 2016

Common Table Expression, Just a Name

The Common Table Expression (CTE) is a great tool in T-SQL. The CTE provides a mechanism to define a query that can be easily reused over and over within another query. The CTE also provides a mechanism for recursion which, though a little dangerous and overused, is extremely handy for certain types of queries. However, the CTE has a very unfortunate name. Over and over I’ve had to walk people back from the “Table” in Common Table Expression. The CTE is just a query. It’s not a table. It’s not providing a temporary storage space like a table variable or a temporary table. It’s just a query. Think of it more like a temporary view, which is also just a query.

Every time I explain this, there are people who don’t believe me. They point to the “Table” in the name, “See. Says so right there. It’s a table.”

It’s not and I can prove it. Let’s create a relatively simple CTE and use it in a query:

WITH    MyCTE
          AS (SELECT    c.CustomerName,
                        cc.CustomerCategoryName
              FROM      Sales.Customers AS c
              JOIN      Sales.CustomerCategories AS cc
              ON        cc.CustomerCategoryID = c.CustomerCategoryID
              WHERE     c.CustomerCategoryID = 4)
    SELECT  *
    FROM    MyCTE;

Now, I’m going to run the query within the CTE and the CTE together as two statements in a batch and capture the execution plans:

ExecPlans

On the top, the CTE, on the bottom, the query. You’ll note that the execution plans are identical. They each have the exact same Query Plan Hash value in the properties, 0x88EFD2B7C165E667, even though they have different Query Hash values, 0x192FFC125A08CC35 and 0xFEB7F2BCAC853CD5, respectively. Further, if I capture the query metrics using extended events, I get identical reads and, on average, identical execution times:

duration

This is because, there is no table being created. The data is not treated differently. A CTE is just a query, not some type of temporary storage.

Heck, let’s do one more thing. Let’s use the latest SSMS plan comparison tool and highlight one of the operators to see what differences there are internally in the plan:

 

plancompare

I don’t see a lot of differences. In fact, I don’t see any. That’s because the optimizer recognizes these two queries as identical. If it was loading data into temporary storage, you would see differences in something. We don’t. This is because, despite the somewhat unfortunate emphasis that gets placed on the Table portion of the name, the emphasis of the name, Common Table Expression, should be on the word Expression.

I will point out an interesting difference, especially useful for those who plug in CTEs everywhere, whether it’s needed or not. Let’s look at the properties of the two plans:

peroperties

You can see the similarities and differences that I pointed out earlier in the Statement, Query Hash and Query Plan Hash, as well as the Estimated Subtree Cost and others. What’s truly interesting is that the CompileCPU, CompileMemory and CompileTime for the CTE is higher than the regular query. While the CTE is just a query, it’s a query that adds a non-zero overhead when used, and therefore, should only be used where appropriate (good gosh, I’ve seen people put it EVERWHERE, on every single query, don’t do that).

Hopefully, this is enough to establish, truly, completely, and thoroughly, that the Common Table Expression is an expression, not a table.

Yeah, I did this before, but it keeps coming up, so I tried a different approach. Let’s see if the word gets out. Your Common Table Expression is not a table.


I love talking about execution plans and query tuning. I’ll be doing this at an all day pre-con at SQLServer Geeks Annual Summit in Bangalore India.

Don’t  miss your chance to an all day training course on execution plans before SQL Saturday Oslo in September.