Category: SQL Server 2005

Dec 05 2016

Presentations for SQL Server Beginners

#iwanttohelp

Tired from all those blog posts

For my final blog post in the #gettingstarted, #iwanttohelp series, I decided to curate a set of presentations from the PASS Virtual Chapters. This content is available online. It’s free. Most importantly for the person just getting started as a SQL Server data pro, it’s good. I’m going to marry each of the presentations with my eleven blog posts in this series.

  1. The Importance of a Full Backup in SQL Server
    For this one I’m going to recommend Tim Radney’s session Understanding SQL Server Backup and Restore. I know Tim personally and guarantee this is a good session.
  2. Why Is The Server Slow
    Jes Borland is a very close personal friend and an absolutely outstanding presenter (and person). She has a session all about getting you started on wait statistics Beyond CXPacket: Understanding Wait Statistics.
  3. SQL Server Backups Are a Business Decision
    Paul Randal is, well, Paul Randal. If you’re not familiar with his work, you really should be. He presented a session called Building the Right Backup Strategy that should coincide nicely with my blog post.
  4. A View Is Not A Table
    OK, I couldn’t find a single session talking about views, so I’m going to come at this from the side, just a little (and I might start working on a presentation on views & other structures). This session by Kenneth Ureña, who I’ve talked with several times, is about structures and performance. It should do the trick. Indexing Strategies and Good Physical Designs for Performance Tuning.
  5. Use The Correct Data Type
    Silent Killers Lurking in Your Schema by (yes, a good friend) Mickey Steuwe is absolutely going to cover data types, among other things.
  6. Choosing the Right SQL Server Edition
    John Martin (yeah, another friend, what can I say, I try to stay on the good side of smart & capable people) has a presentation that’s a bit more focused on SQL Server internals but will absolutely address the edition of SQL Server as part of the work. Get SQL Server Set Up Right! The First Time.
    I’m also going to add a session by Jes Borland, again, that introduces Azure, especially talking about Azure in a hybrid scenario. Azure and SQL Server. Plus, I just want to share more Azure links. If you’re just getting started with the Microsoft Data Platform, start with Azure.
  7. Monitor Query Performance
    There are tons of choices here. I’m going with Kicking and Screaming: Replacing Profiler With Extended Events because I know that most people learn Profiler first. I also know that we need to get people to start using Extended Events instead. I will further add that Erin Stellato really knows here stuff on this topic, as well as many others. She is also a wonderful presenter (and a friend).
  8. Azure SQL Database For Your First Database
    As I said, I really believe in Azure as an entry into the land of Data Platform. It’s easy. It’s inexpensive. It’s available almost anywhere. Tim Radney has a session called Azure SQL Database for the Production DBA that will help to get you started.
  9. The Clustered Index is Vital To Your Database Design
    The clustered index is so important that everyone presents on it in one fashion or another. I’ve met Ayman El-Ghazali a few times. We’re not friends yet, but we get along (he’s very smart, see above). His session Indexing Fundamentals seems ideal for this slot.
  10. Statistics Are Vital For Performance
    Erin Stellato has a session called Statistics Starters. It is so important to understand what statistics are within SQL Server, how to use them, how to maintain them, it’s hard to over-emphasize it.
  11. PowerShell To Test a Query
    If you’re just getting going with Powershell, you’re going to need tons of help. Luckily it’s out there. I could have picked any number of excellent sessions by amazing people (many of them friends). I’m going with the PowerShell Tips and Tricks for SQL Server Administration by Mike Fal. Mike (a friend) recently helped me with some of my own PowerShell scripts, so he’s kind of top of mind at the moment.

Conclusion

For my final post I wanted to do three things. First, a recap of the last year’s worth of #getttingstarted blog posts. Believe it or not, I put a little work into each of these and it’s nice to see them all collected like this. Second, I wanted to illustrate the vast quantity of high quality learning that is available through PASS and the PASS Virtual Chapters. I pulled these sessions from a bunch of different chapters. There are alternatives I could have picked for every topic (except for views, something to work on). Which leads to my third, and final reason for doing my last post in the series like this, I wanted to keep helping. You now know a great resource to go to and search through for more information and learning.

Before we go, I want launch a small challenge of my own. If you’re a blogger, pick a topic (or a collection of topics like I did), and put together a blog post that curates the content using only PASS resources. Link back to this blog. Use the hash tag, #PASScurated. Let’s see what other information can be put together.

 

 

Nov 23 2016

PowerShell to Test a Query

powershellSo you want to do some tuning, but you’re not sure how to test a query on it’s performance. Not a problem. Here’s a very rough script that I use to do some recent testing.

This script to test a query is post #11 of the #enterylevel #iwanttohelp effort started by Tim Ford (b|t). Read about it here.

The Script

The goal here is to load a bunch of parameter values from one table and then use those values to run a query to test it. To do this I connect up to my SQL Server instance, naturally. Then I retrieve the values I’m interested in. I set up the query I want to test. Finally a loop through the data set, calling the query once for each value.

[reflection.assembly]::LoadWithPartialName("Microsoft.SqlServer.Smo") | out-null
# Get the connection
$SqlConnection = New-Object System.Data.SqlClient.SqlConnection
$SqlConnection.ConnectionString = 'Server=WIN-3SRG45GBF97\DOJO;Database=WideWorldImporters;trusted_connection=true'

# Retrieve test data
$BillToCustomerCmd = New-Object System.Data.SqlClient.SqlCommand
$BillToCustomerCmd.CommandText = "SELECT  DISTINCT i.BillToCustomerID
FROM Sales.Invoices as i;"
$BillToCustomerCmd.Connection = $SqlConnection
$SqlAdapter = New-Object System.Data.SqlClient.SqlDataAdapter
$SqlAdapter.SelectCommand = $BillToCustomerCmd
$BillToCustomerList = New-Object System.Data.DataSet
$SqlAdapter.Fill($BillToCustomerList)

# Set up test query
$SQLCmd = New-Object System.Data.SqlClient.SqlCommand
$SQLCmd.Connection = $SqlConnection
$SQLCmd.CommandText = "DECLARE @sqlquery NVARCHAR(MAX);
SET @sqlquery
   = N'SELECT si.StockItemName,
   i.InvoiceDate,
   i.SalespersonPersonID
FROM Sales.Invoices AS i
JOIN Sales.InvoiceLines AS il
   ON il.InvoiceID = i.InvoiceID
JOIN Warehouse.StockItems AS si
   ON si.StockItemID = il.StockItemID
WHERE i.BillToCustomerID = @BillToCustomerID;';

DECLARE @parms NVARCHAR(MAX);
SET @parms = '@BillToCustomerID int';

EXEC sys.sp_executesql @stmt = @sqlquery,
   @params = @parms,
   @BillToCustomerID = @btc;"
$SQLCmd.Parameters.Add("@btc",[System.Data.SqlDbType]"Int")

# Run the tests
foreach($row in $BillToCustomerList.Tables[0])
{
    $SqlConnection.Open()
    $SQLCmd.Parameters["@btc"].Value = $row[0]    
    $SQLCmd.ExecuteNonQuery() | Out-Null
    $sqlconnection.Close()
    
}

I’m using ExecuteNonQuery here so I can ignore the result set because, in this case, I don’t care about it. I just want to be able to capture the query metrics (using Extended Events naturally). If I wanted the results to come back I could just use ExecuteQuery.

Some Explanation

This is a very simple and simplistic way to do testing. I’m not providing this as a mechanism for all your tests. I’m not suggesting this should be your primary testing tool. This is just a simple way to do some basic testing.

You can easily mix this up to get more realistic tests or add to the tests. Throw in a command to pull the query out of the cache after each call. Now you’ll see how the compile works. Change the order of the retrieved data to make it random. Toss in other queries. Run a set of other queries on a loop in a different PowerShell script to generate load. The sky is the limit once you start playing with this.

The reason I go to PowerShell for this instead of running all these commands as T-SQL through SSMS is because of the more direct control on behavior I get with PowerShell. The ability to ignore the result set is just one example.

Conclusion

If you really want to do load testing and evaluation, I’d suggest setting up Distributed Replay and putting it to work. I’ve used it very successfully for that kind of thorough and complete testing of a system. If you really just want to know how this one query is going to fare, the PowerShell script above will enable you to test a query through this basic test. Just remember to capture the metrics when you’re doing any kind of test so that you can compare the results.


Want to play some more with execution plans and query tuning? I’ll be doing an all day seminar on execution plans and query tuning before SQLSaturday Providence in Rhode Island, December 2016, therefore, if you’re interested, sign up here.

Nov 07 2016

sp_executesql Is Not Faster Than an Ad Hoc Query

This requires an immediate caveat. You should absolutely be using sp_executesql over any type of non-parameterized execution of T-SQL. You must parameterize your T-SQL because the lack of parameters in building up and executing strings is a classic SQL Injection attack vector. Using straight ad hoc T-SQL is an extremely poor coding choice because of SQL Injection, not because there is something that makes one method faster than the other.

Yet, I see in performance checklists that you should be using sp_executesql over straight ad hoc T-SQL because it will perform faster. That statement is incorrect.

Some Discussion

Let me reiterate the caveat before we continue. I 100% advocate for the use of sp_executesql. This function is preferred over ad hoc SQL because, used properly (and isn’t that usually one of the main problems, always), you can both build an ad hoc query and use parameters in order to avoid SQL Injection. The security implications of SQL Injection are kind of hard to over-emphasize. SQL Injection has been a primary vector for hacking for close on to twenty years now. We know the best way to avoid it is to use parameterized queries with data validation around the parameters. Why this is continually ignored is hard for me to understand.

However, despite the importance of using sp_executesql, I’m not advocating for it’s use as a performance improvement mechanism. I’m unclear as to how this comes to be on a performance checklist, with no discussion of taking advantage of Parameter Sniffing and/or plan reuse (possible performance advantages). I can only assume this is yet another example of Cargo Cult Programming. People know that they are supposed to use sp_executesql (and yes, you are supposed to use it), but don’t really understand why, so they start guessing.

The tests are going to run primarily from T-SQL in order to compare a straight EXECUTE of a query string to sp_executesql. However, for the sake of protecting against SQL Injection, let me also mention that calling to your database strictly through code, you can use two approaches (well, several, but we’ll focus on two in order to keep this blog post to a minimal size, I can’t caveat and explain every single possible permutation of all possible database access methods while still making anything approaching a coherent point), building up ad hoc T-SQL and executing that against the server directly, or, using a mechanism to parameterize your queries. You absolutely should be using the parameterized methods in order to validate your input and avoid SQL Injection.

The Simplest Test

Let’s start with a very simple, and simplified, query in order to illustrate the point:

DECLARE @adhocquery NVARCHAR(max) 
SET @adhocquery = N'SELECT si.StockItemName,
   i.InvoiceDate,
   il.Description
FROM Sales.Invoices AS i
JOIN Sales.InvoiceLines AS il
   ON il.InvoiceID = i.InvoiceID
JOIN Warehouse.StockItems AS si
ON si.StockItemID = il.StockItemID;'

EXEC (@adhocquery);


DECLARE @sqlquery NVARCHAR(max) 
SET @sqlquery = N'SELECT si.StockItemName,
   i.InvoiceDate,
   il.Description
FROM Sales.Invoices AS i
JOIN Sales.InvoiceLines AS il
   ON il.InvoiceID = i.InvoiceID
JOIN Warehouse.StockItems AS si
ON si.StockItemID = il.StockItemID;'

EXEC sys.sp_executesql @stmt = @sqlquery;

That’s the same query executed using the two methods in question. The results are an identical execution plan and exactly the same number of reads. If I execute either of them thousands of times then the execution times don’t vary. They have matching query hash and plan hash values. These are identical queries in every possible way. Even if I compare the performance across thousands of executions and include the compile time there is no difference in the outcome.

At the simplest possible level, these are identical mechanisms for executing a basic query. The only performance difference comes about because of parameters.

Test With Parameters

Instead of just running the query over and over again, I really want to test actual, meaningful, behavior this time. I’m going to load all the values for the BillToCustomerID column of the Invoices table using a PowerShell script. Then, I’ll execute the queries once for each of these values, using the two different execution methods.

To make aggregating the results easier, I put each query into a procedure:

CREATE PROCEDURE dbo.AdHoc (@BillToCustomerID INT)
AS
   DECLARE @Query NVARCHAR(MAX);

   SET @Query
      = N'SELECT si.StockItemName,
   i.InvoiceDate,
   il.Description
FROM Sales.Invoices AS i
JOIN Sales.InvoiceLines AS il
   ON il.InvoiceID = i.InvoiceID
JOIN Warehouse.StockItems AS si
   ON si.StockItemID = il.StockItemID
WHERE i.BillToCustomerID = ' + CAST(@BillToCustomerID AS NVARCHAR(10)) + ';';

   EXEC (@Query);
GO


CREATE PROCEDURE dbo.ExecSQL (@BillToCustomerID INT)
AS
   DECLARE @sqlquery NVARCHAR(MAX);

   SET @sqlquery
      = N'SELECT si.StockItemName,
   i.InvoiceDate,
   i.SalespersonPersonID
FROM Sales.Invoices AS i
JOIN Sales.InvoiceLines AS il
   ON il.InvoiceID = i.InvoiceID
JOIN Warehouse.StockItems AS si
   ON si.StockItemID = il.StockItemID
WHERE i.BillToCustomerID = @BillToCustomerID;';

   DECLARE @parms NVARCHAR(MAX);

   SET @parms = '@BillToCustomerID int';

   EXEC sys.sp_executesql @stmt = @sqlquery,
      @params = @parms,
      @BillToCustomerID = @BillToCustomerID;
GO

The results are fun.

Execution Type Average Duration
sp_executesql AVG: 57946.03187251
Ad Hoc AVG: 14788.8924302789

What’s going on? Is the conclusion that, in fact, ad hoc queries are faster than sp_executesql?

Absolutely not.

I cheated.

I intentionally picked a data set with a pretty interesting distribution. Depending on the value passed for BillToCustomerID there is the possibility of one of three different execution plans:

sp_executesql

In fact, the data is such that the first value that would be called is going to generate the worst possible plan for all the other data sets because it leads to the plan that simply consists of three scans. Even if I choose to force one of the other plans first, something I did several times while testing, the fact that the ad hoc queries will always generate the best plan for the data set results in better overall performance for ad hoc, in this instance.

Please don’t mistake me. I could skew the data in another direction in order to make sp_executesql into the better performing mechanism. The initial premise was that you should use sp_executesql over ad hoc because it will be faster. That’s not the case. In fact, it completely depends on a number of factors as to which of these methods will be faster. That said, my preferred mechanism is to use sp_executesql because it creates parameterized queries where I can ensure, with a certainty, that I’m avoiding SQL Injection. To achieve parity on execution times, I could simply include the WITH RECOMPILE hint and then I would have the same speed as the ad hoc approach while still ensuring my security.

Oh, and to add another wrinkle, you could always turn on ‘Optimize For Ad Hoc’. That shaves a few more milliseconds off the ad hoc approach over the sp_executesql approach in this example.

Conclusion

I know I’ve over-emphasized this throughout this discussion, but I’m going to repeat it again, SQL Injection is dangerous and a purely ad hoc approach to queries leads to unsafe servers. You must validate your inputs and use a querying mechanism that ensures that SQL Injection won’t be an issue. This is accomplished by using parameterized queries, which includes sp_executesql.

However, should you be using sp_executesql over ad hoc queries because performance will improve? No. Clearly that’s not the case. Instead you should be using sp_executesql because it’s a safer, saner approach to writing your queries.


I love talking performance tuning. In fact, I’ll be doing an all day seminar on execution plans and query tuning before SQLSaturday Providence in Rhode Island, December 2016, therefore, if you’re interested, sign up here.

Nov 03 2016

Reinforcing the Importance of Statistics on Row Estimate

I recently wrote an introductory post about the importance of statistics. I just received a reinforcement on how important they are during my own work.

Bad Estimate

I hit a weird problem while I was setting up a query to illustrate a point (blog to be published next week). Let’s take the basis of the problem and explain it. I wanted data with distribution skew, so I ran this query to find out if there was a wide disparity between the top and bottom of the range:

SELECT i.BillToCustomerID,
   COUNT(i.BillToCustomerID) AS TestCount
FROM Sales.Invoices AS i
GROUP BY i.BillToCustomerID
ORDER BY TestCount ASC;

Sure enough, the bottom of the range returned three (3) rows and the top returned 21,551. If I then run a query to retrieve just a few rows like this:

SELECT *
FROM Sales.Invoices AS i
WHERE i.BillToCustomerID = 1048;

I get the following execution plan:

SimplePlan

I’m happy because this is the plan I expected. With this plan in hand, I don’t bother looking at anything else.

Creating a Problem

I expand out the query initially as follows:

SELECT i.InvoiceID,
   il.InvoiceLineID,
   si.StockItemName
FROM Sales.Invoices AS i
JOIN Sales.InvoiceLines AS il
   ON il.InvoiceID = i.InvoiceID
JOIN Warehouse.StockItems AS si
ON si.StockItemID = il.StockItemID
WHERE i.BillToCustomerID = 1048;

The execution plan now looks like this:

complexplan

Frankly, I’m puzzled. Why on earth did we go from a key lookup operation to a scan on the Invoices table? I rebuild the query a couple of times and it keeps going to a scan. Finally, I pause a moment and look at the row estimate (you know, like I should have done the first moment I was puzzled):

row estimate

258 rows? Wait, that’s wrong. The number of rows for this value is three. Why on earth would it be showing 258? There’s no reason. I haven’t done any kinds of calculations on the columns. I double check the structures. No hidden views or constraints, or anything that would explain why the estimate was so wrong. However, it’s clear that the estimate of 258.181 is causing the loops join and key lookup to go away in favor of a hash join and scan when I add complexity to the row estimate needed by the optimizer.

After thinking about it a while, I finally ran DBCC SHOW_STATISTICS:

statistics

Note the highest point on the histogram, 1047. Yet I’m passing in 1048.

So, what’s happening?

While the number of rows for 1048 was the lowest, at 3, unfortunately it seems that the 1048 values were added to the table after the statistics for the index had been updated. Instead of using something from the histogram, my value fell outside the values in the histogram. When the value is outside histogram the Cardinality Estimator uses the average value across the entire histogram, 258.181 (at least for any database that’s in SQL Server 2014 or greater and not running in a compatibility mode), as the row estimate.

I then change the query to use the value 1047, the execution plan then changed to look like this:

cleanplan

The new plan reflects the behavior I was going for when I was setting up the test. The row estimates are now accurate, and small, therefore I get a key lookup operation instead of a scan.

Conclusion

Statistics drive the decisions made by the optimizer. The very first moment you’re looking at an execution plan and you’re seeing a scan where you thought, for sure, you should have seen a seek, check the row estimates (OK, not the first moment, it could be a coding issue, structural issue, etc.). It could be that your statistics are off. I just received my own reminder to pay more attention to the row estimates and the statistics.


I love playing with statistics and execution plans and queries. As a result, I also like teaching how to do this stuff. If you’re interested, I’m putting on a class in Rhode Island, December 2016. Sign up here.

Nov 01 2016

Stored Procedures Are Not Faster Than Views

A performance tuning tip I saw recently said, “Views don’t perform as well as stored procedures.”

<sigh>

Let’s break this down, just a little.

Definitions

A view is nothing but a query. The definition given by Microsoft is that it’s a virtual table that’s defined by a query. It’s a query that is used to mask data or perform a complex join or similar behaviors. Views are queries that get stored in the database. Views can be easily referred to as if they were a tables. That’s it. I’ve written in the past about views, including how they can possibly perform poorly.

A stored procedure is also a query, or a series of queries, or, a whole lot more. Microsoft’s definition of a stored procedure basically defines it as programming object that can accept input through parameters, perform actions, and provide various types of output. Stored procedures are also stored in the database, but that’s about the end of the direct correlations to a view. Heck, you can call views from stored procedures, so I’m really at a loss as to where this tip comes from.

In short, yes, both these objects have in them queries, but these are fundamentally different objects. You can’t really say that using one or the other is faster because they each do different things. Further, you can write code that will perform poorly using either construct.

Test Setup To Compare Performance

Here’s the view definition I’m going to use for the tests:

CREATE VIEW dbo.CustomerDeliveryInfo
AS
SELECT c.CustomerName,
   c.DeliveryRun,
   c.RunPosition,
   dm.DeliveryMethodName,
   cid.CityName AS DeliveryCity,
   cip.CityName AS PostalCity,
   c.CustomerID
FROM Sales.Customers AS c
JOIN Application.DeliveryMethods AS dm
   ON dm.DeliveryMethodID = c.DeliveryMethodID
JOIN Application.Cities AS cid
   ON cid.CityID = c.DeliveryCityID
JOIN Application.Cities AS cip
   ON cip.CityID = c.PostalCityID;

We’re going to compare that with a stored procedure that uses the same query. The procedure also takes advantage of the fact that it is a stored procedure using a parameter for input values:

CREATE PROCEDURE dbo.CustomerDeliveryInformation 
(@CustomerID INT)
AS
BEGIN
   SELECT c.CustomerName,
      c.DeliveryRun,
      c.RunPosition,
      dm.DeliveryMethodName,
      cid.CityName AS DeliveryCity,
      cip.CityName AS PostalCity,
      c.CustomerID
   FROM Sales.Customers AS c
   JOIN Application.DeliveryMethods AS dm
      ON dm.DeliveryMethodID = c.DeliveryMethodID
   JOIN Application.Cities AS cid
      ON cid.CityID = c.DeliveryCityID
   JOIN Application.Cities AS cip
      ON cip.CityID = c.PostalCityID
   WHERE c.CustomerID = @CustomerID;
END;

We’re also going to create another stored procedure that uses the view:

CREATE PROCEDURE dbo.InfoCustomerDelivery 
(@CustomerID INT)
AS
BEGIN
   SELECT * FROM dbo.CustomerDeliveryInfo AS cdi
   WHERE cdi.CustomerID = @CustomerID;
END;

Because stored procedures and views are different, we’ll have to call these different objects in different ways:

SELECT *
FROM dbo.CustomerDeliveryInfo AS cdi
WHERE cdi.CustomerID = 556;

EXEC dbo.CustomerDeliveryInformation @CustomerID = 556;

EXEC dbo.InfoCustomerDelivery @CustomerID = 556;

In this way we can run each of these queries independently and compare the results between them.

Results Comparing Stored Procedures With Views

If you run each of the queries above, you will find that they all create a nearly identical execution plan:

Views exec plan

You can click on that to make it bigger. If we compare all the different plans, one set of details does stand out:

plandifferences

There is a difference in the compile time between the view by itself and the stored procedures (they were almost identical). Let’s look at performance over a few thousand executions:

Query duration
View AVG: 210.431431431431
Stored Proc w/ View AVG: 190.641641641642
Stored Proc AVG: 200.171171171171

This is measured in microsends, so the variation we’re seeing is likely just some disparity on I/O, CPU or something else since the differences are trivial at 10mc or 5%. While that may seem like the view is suffering, please note that the view inside the procedure actually ran faster by 5%. Again, this is explained by the fact that we’re only talking about a 10 microsecond difference. I’m not sure if that’s within the margin for error on the Extended Event sql_batch_complete or not (I couldn’t find documentation stating what it might be), but I’ll bet it’s close. I believe it’s safe to say that the average performance of these queries is identical.

All three queries had 8 logical reads.

What about execution time including compile time, since there is a difference:

Query duration
View AVG: 10089.3226452906
Stored Proc AVG: 9314.38877755511
Stored Proc w/ View AVG: 9938.05410821643

The difference in the performance including compile time for the procedure alone is 700mc better on average than the view. That’s an 8% difference. It was almost that high for the view that used the procedure at 7%.

If we’re just talking compile time then, there is a significant win if we avoid the view. This is no doubt because of the extra work involved in unpacking the view and going through the simplification process within the optimizer. Plus, the view alone in our query was parameterized by the optimizer in order to assist it’s performance over time (as we saw in the average results without the recompile). All that extra work explains the 8% difference.

Let’s Break It

What if we change the query around a little. I decide that all I want to see right now from the view is the CustomerID:

SELECT cdi.CustomerID
FROM dbo.CustomerDeliveryInfo AS cdi
WHERE cdi.CustomerID = 556;

When I execute this, I get a whole new execution plan:

viewsimple

The execution time drops a little to around 190mc on average and the reads go from 8 to 2. The stored procedure would have to get rewritten to only return CustomerID. Does that mean that views are faster than stored procs? Absolutely not. It just means that there is some degree of flexibility built into the view, as a construct, that’s not there in a stored procedure, as a construct. These are fundamentally different objects.

What if we change the query against the view again:

SELECT *
FROM dbo.CustomerDeliveryInfo AS cdi
WHERE cdi.CustomerName = 'Om Yadav';

Once more the execution plan will change to something different than before:

viewcomplex

Performance drops to about 300mc and we get 10 reads instead of 8. Does that mean that views are slower than stored procedures? No. We’re attempting to compare two different objects that perform two different functions within SQL Server.

Conclusion

Since a stored procedure can actually query a view, suggesting that we use stored procedures instead of views becomes quite problematic. With the exception of the differences in compile time, we see that views actually perform exactly the same as stored procedures, if the query in question is the same. There are reasons to use views as well as reasons to not use them. There are reasons to use stored procedures as well as reasons to not use them. Neither of these objects is preferred above the other because of performance concerns.


Want to play some more with execution plans and query tuning? I’ll be doing an all day seminar on execution plans and query tuning before SQLSaturday Providence in Rhode Island, December 2016, therefore, if you’re interested, sign up here.

Oct 24 2016

A Sub-Query Does Not Hurt Performance

The things you read on the internet, for example, “don’t use a sub-query because that hurts performance.”

Truly?

Where do people get these things?

Let’s Test It

I’ve written before about the concept of cargo cult data professionals. They see one issue, one time, and consequently extrapolate that to all issues, all the time. It’s the best explanation I have for why someone would suggest that a sub-query is flat out wrong and will hurt performance.

Let me put a caveat up front (which I will reiterate in the conclusion, just so we’re clear), there’s nothing magically good about sub-queries just like there is nothing magically evil about sub-queries. You can absolutely write a sub-query that performs horribly, does horrible things, runs badly, and therefore absolutely screws up your system. Just as you can with any kind of query. I am addressing the bad advice that a sub-query is to be avoided because they will inherently lead to poor performance.

Let’s start with a simple test, just to validate the concept of how a sub-query performs within SQL Server:

SELECT sd.OrderQty,
   pr.Name
FROM
   (SELECT *
    FROM Sales.SalesOrderDetail AS sod
   ) AS sd
JOIN
   (SELECT *
    FROM Production.Product AS p
   ) AS pr
   ON pr.ProductID = sd.ProductID
WHERE sd.SalesOrderID = 52777;

SELECT sod.OrderQty,
   p.Name
FROM Sales.SalesOrderDetail AS sod
JOIN Production.Product AS p
   ON p.ProductID = sod.ProductID
WHERE sod.SalesOrderID = 52777;

If there is something inherently wrong with a sub-query, then there is something twice as wrong with two sub-queries. Here are the resulting execution plans:

sub-query plan matches query plan

Huh, look sort of, I don’t know, almost identical. Let’s compare the plans using the new SSMS plan comparison utility:

only slight diffences in sub-query plan

Well, darn. Displayed in pink are the common sets of operations between the two plans. In other words, for these plans, everything except the properties of the SELECT operator are exactly the same. Let’s take a look at those properties:

sub-query SELECT properties

OK. Now we have some interesting differences, and especially, some interesting similarities. Let’s start with the similarities. First of all, we have exactly the same QueryPlanHash value in both plans. In addition, we also have identical estimated rows and costs. In short, the optimizer created two identical execution plans. Now, this is where things get a little bit interesting. See, the optimizer actually worked a little harder to create the first plan than the second. It took an extra tic on the CPU and just a little more CompileMemory and CompileTime. Interesting.

What about execution times? With a few runs on average, the execution times were identical at about 149mc with 11 reads. However, running a query once or twice isn’t testing. Let’s get a few thousand runs of both queries. The average results from the Extended Events sql_batch_completed event were 75.9 microseconds for both queries.

However, what about that extra little bit of compile time in the query that used sub-queries? Let’s add in a statement to free the procedure cache on each run and retry the queries. There is a measurable difference now:

Query duration
Sub-query AVG: 5790.20864172835
Query AVG: 4539.49289857972

More work is done by the optimizer on the sub-query to compile the same execution plan. We’re adding work to the optimizer, requiring it to unpack the, admittedly, silly query written above.  When we refer only to the compile time and not the execution time, there is a performance hit. Once the query is compiled, the performance is identical. Whether or not you get a performance hit from a sub-query then, in part, depends on the degree to which you’re experiencing compiles or recompiles. Without the recompile, there is no performance hit. At least in this example.

Let’s Test It Again, Harder

I firmly believe in the old adage; if you ain’t cheatin’, you ain’t fightin’. It’s time to put the boot in.

Let’s go with much more interesting queries that are more likely to be written than the silly example above. Let’s assume some versioned data like in this article on Simple-Talk. We could express a query to bring back a single version of one of the documents in one of three ways from the article. We’re just going to mess with two of them. One that uses a sub-query, and one that does not:

--no sub-query
SELECT TOP 1 d.DocumentName,
   d.DocumentID,
   v.VersionDescription,
   v.VersionID,
   ROW_NUMBER() OVER (ORDER BY v.VersionID DESC) AS RowNum
FROM dbo.Document d
JOIN dbo.Version v
   ON d.DocumentID = v.DocumentID
WHERE d.DocumentID = 9729;

--sub-query
SELECT  d.[DocumentName],
        d.[DocumentId],
        v.[VersionDescription],
        v.[VersionId]
FROM    dbo.[Document] d
        CROSS APPLY (SELECT TOP (1)
                            v2.VersionId,
                            v2.VersionDescription
                     FROM   dbo.[Version] v2
                     WHERE  v2.DocumentId = d.DocumentId
                     ORDER BY v2.DocumentId,
                            v2.VersionId DESC
                    ) v
WHERE   d.[DocumentId] = 9729;

As per usual, we can run these once and compare results, but that’s not really meaningful. We’ll run them thousands of times. Also, to be sure we’re comparing apples to apples, we’ll force a recompile on every run, just like in the first set of tests. The results this time:

Query duration
Sub-query AVG: 1852.14114114114
Query AVG: 2022.62162162162

You’ll note that, even with the compile on each execution, the query using a sub-query actually out-performed the query that was not using a sub-query. The results are even more dramatic when we take away the compile time:

Query duration
Sub-query AVG: 50.8368368368368
Query AVG: 63.3103103103103

We can also look to the execution plans to get an understanding of how these queries are being resolved:

differentplans

The plan on top is the sub-query plan, and the plan on the bottom is the plan for just the plain query. You can see that the regular query is doing a lot more work to arrive at an identical set of data. The differences are visible in the average execution time, about a 20% improvement.

You could argue that we’re comparing two completely different queries, but that’s not true. Both queries return exactly the same result set. It just so happens that the query using the sub-query performs better overall in this instance. In short, there’s no reason to be scared of using a sub-query.

Sub-Query Conclusion

Is it possible for you to write horrid code inside of a sub-query that seriously negatively impacts performance? Yes. Absolutely. I’m not arguing that you can’t screw up your system with poor coding practices. You absolutely can. The query optimization process within SQL Server deals well with common coding practices. Therefore, the queries you write can be fairly sophisticated before, by nature of that sophistication, you begin to get serious performance degradation.

You need to have a method of validation for some of what you read on the internet. People should provide both the queries they are testing with and the numbers that their tests showed. If you’re just seeing completely unsupported, wildly egregious statements, they’re probably not true.

In conclusion, it’s safe to use sub-queries. Just be careful with them.


If you’re finding any of this useful and you’d like to dig down a little more, you can, because I’ll be putting on an all day seminar on execution plans and query tuning. The event takes place before SQLSaturday Providence in Rhode Island, December 2016, therefore, if you’re interested, sign up here.

Oct 17 2016

SELECT * Does Not Hurt Performance

SELECT *I read all the time how SELECT * hurts performance. I even see where people have said that you just have to supply a column list instead of SELECT * to get a performance improvement. Let’s test it, because I think this is bunkum.

The Test

I have here two queries:

SELECT *
FROM Warehouse.StockItemTransactions AS sit;

--and

SELECT sit.StockItemTransactionID,
       sit.StockItemID,
       sit.TransactionTypeID,
       sit.CustomerID,
       sit.InvoiceID,
       sit.SupplierID,
       sit.PurchaseOrderID,
       sit.TransactionOccurredWhen,
       sit.Quantity,
       sit.LastEditedBy,
       sit.LastEditedWhen
FROM Warehouse.StockItemTransactions AS sit;

I’m basically going to run this a few hundred times each from PowerShell. I’ll capture the executions using Extended Events and we’ll aggregate the results.

The Results

I ran the test multiple times because, funny enough, I kept seeing some disparity in the results. One test would show a clear bias for one method, another test would show the opposite. However, averaging the averages we see that things broke down as follows:

* 167.247ms
Column List 165.500ms

That’s after about 2000 separate executions of each query. There’s a 2ms bias towards the Column List query as opposed to the *. That’s an improvement, if you want to call it that, of 1%. It’s hardly worth the bother, assuming that with more testing this continued to hold true. In multiple tests, the SELECT * ran faster. I just feel honor bound to put up the full results. They show an improvement, but not one I’d get excited about. Oh, and the reads, the execution plan, everything else… identical.

SELECT * Conclusion

Don’t get me wrong, there are lots of reasons to not use SELECT *. Yes, performance is one of the reasons to not use SELECT *. However, when most people suggest that maybe using SELECT * is a bad idea for performance reasons, what they’re saying is you ought to only move the columns you need and the data you are actually using, not everything. I’m not aware of anyone with experience and knowledge suggesting that using the complete column list instead of SELECT * is faster. As we can see in the tests above, it isn’t (or is by so small a margin, who cares).


I love talking performance tuning. In fact, I’ll be doing an all day seminar on execution plans and query tuning before SQLSaturday Providence in Rhode Island, December 2016, therefore, if you’re interested, sign up here.

Oct 10 2016

Statistics Are Vital For Query Performance

This is post #10 supporting  Tim Ford’s (b|t) initiative on #iwanttohelp, #entrylevel. Read about it here.

When you send a query to your SQL Server database (and this applies to Azure SQL Database, APS, and Azure SQL Data Warehouse), that query is going to go through a process known as query optimization. The query optimization process figures out if you can use indexes to assist the query, whether or not it can seek against those indexes or has to use a scan, and a whole bunch of other stuff. The primary driving force in making these decisions are the statistics available on the indexes and on your tables.

What Are Statistics

Statistics are a mathematical construct to represent the data in your tables. Instead of scanning through the data each and every time to determine how many possible rows are going to come back for the query you provided, SQL Server uses statistics, which are gathered and calculated automatically by default, to determine the likely number of rows. I cover the details behind statistics in this article on Simple-Talk. The important thing to remember is that statistics are what drives the optimizer. Without statistics, you would only ever see table scans and performance would be horrible most of the time. Statistics show how many rows are likely to be returned by a given value. Statistics are created on indexes automatically. They are also created on columns that do not have indexes, by default, when that column is referenced in a WHERE, ON or HAVING clause of a query.

It can’t be over-emphasized that you must have statistics for the optimizer to make good choices. You can find out more about statistics in this other article on Simple-Talk.

Automatic Settings For Statistics

Statistics are so important that SQL Server will automatically create them for you. There is a setting that allows you to partially change this behavior. AUTO_CREATE_STATISTICS is enabled by default on each database. You can disable it using an ALTER DATABASE command. What will then happen is that the only statistics you’ll have in your system are those that are created for indexes. For the overwhelming vast majority of databases out there, turning off the automatic creation of statistics is a very bad idea. You’re taking away the ability of the optimizer to determine the likely rows being returned by a query. That means the optimizer is likely to make poor choices.

Statistics are also automatically maintained by SQL Server. You can read about the automatic processes in this documentation at MSDN. There are a number of modifications you can make to this automated behavior, and you can turn off automatic statistic maintenance by changing the AUTO_UPDATE_STATISTICS setting on a database. Just like the automatic creation of statistics, most systems are benefitting from the automatic maintenance of statistics. Don’t turn this setting off unless you are ready to take direct control over your statistics. Interestingly enough though, the automatic statistics maintenance may not be enough for many systems. You might need to augment statistics maintenance.

Maintaining Your Statistics

For most systems, start with leaving the AUTO_UPDATE_STATISTICS setting to on. Then, as you find you might need to manually update statistics, you can do one of two things. An easy way to update your statistics is to run sp_updatestats. This command will update the statistics across an entire database. You can somewhat control how it behaves. Read more about it here on MSDN. If you want to take direct control over how any given set of statistics are maintained, you use UPDATE STATISTICS. This gives the maximum amount of control.

In addition to writing your own code to maintain statistics, you might look to third party choices. One very common option is to use the scripts provided by Ola Hollengren. Not only do they manage and maintain statistics, but they also help with index fragmentation. A newer and much more interesting choice is to look to Minion Reindex by the Midnight DBA team. This tool offers quite a lot of control and functionality that is not present in Ola’s scripts. I wrote a review of an earlier version of the tool here on SQL Server Central.

Conclusion

This blog post just skims the surface of statistics. For lots more detail, please follow and read the links mentioned above. You must ensure that your SQL Server databases have statistics. You must also ensure that your SQL Server databases maintain their statistics. Statistics are a vital part of query performance that ignoring will hurt.

Oct 03 2016

Correlated Datetime Columns

SQL Server is a deep and complex product. There’s always more to learn. For example, I had never heard of Correlated Datetime Columns. They were evidently introduced as a database option in SQL Server 2005 to help support data warehousing style queries (frequently using dates and times as join criteria or filter criteria). You can read up on the concept here from this older article from 2008 on MSDN. However, doing a search online I didn’t find much else explaining how this  stuff worked (one article here, that didn’t break this down in a way I could easily understand). Time for me to get my learn on.

The concept is simple, turning this on for your database means that dates which have a relationship, the example from MSDN uses OrderDate and DueDate from the Purchasing.PurchaseOrderHeader and Purchasing.PurchaseOrderDetail tables respectively. Clearly, the DueDate would be near the OrderDate in general terms. Cool. Makes sense. Let’s see how it works, if it works.

Correlated Datetime Columns At Work

I’m going to use the example code from the Microsoft article, partly because there are aspects of it I don’t understand and exploring it will help me learn. Let’s start with the (slightly modified) base query that we hope will benefit from the Correlated Datetime setting:

SELECT *
FROM Purchasing.PurchaseOrderHeader AS poh
JOIN Purchasing.PurchaseOrderDetail AS pod
   ON poh.PurchaseOrderID = pod.PurchaseOrderID
WHERE poh.OrderDate
BETWEEN '20130801' AND '20130901';

The changes I made were the aliases and the values in the WHERE clause. I left everything else the same. Performance on this was 115ms on average with 59 reads. This is the execution plan:

ExecPlanNoIndexNoSetting

Click to embiggen. The missing index suggestion is on OrderDate using an INCLUDE of all the columns (the joy of SELECT * and Missing Index Hints, I should post a rant on Missing Index Hints at some point). The plan itself uses two Clustered Index Scan operations to retrieve the data. Since the scans are Ordered (check the properties in each operator to validate this, but the lack of a Sort operator is also a good hint), a Merge Join was used to put the data from the two tables together.

The example code drops and recreates the clustered index on the PurchaseOrderDetail table because one of the tables must have a clustered index with a datetime column as the first (or only) column in the key for Correlated Datetime Columns. However, I like to see how things behave. Let’s first create a non-clustered index on PurchaseOrderDetail.

CREATE INDEX IX_PurchaseOrderDetail_DueDate
ON Purchasing.PurchaseOrderDetail(DueDate);
GO

I don’t expect this to do anything for the above query, but I want to test each step. The non-clustered index doesn’t change anything. The execution plan and performance are the same as above. Fine. Let’s enable Correlated Datetime Columns:

ALTER DATABASE AdventureWorks2014
   SET DATE_CORRELATION_OPTIMIZATION ON;
GO

Running the query again, I can’t detect any differences. The execution plans are identical and the performance is the same. In case the plan wasn’t removed from cache because of the database setting change, I sure would expect it to be, but I could be wrong, I’m going to remove it from the cache:

DECLARE @PlanHandle VARBINARY(64);

SELECT  @PlanHandle = deps.plan_handle
FROM    sys.dm_exec_procedure_stats AS deps
CROSS APPLY sys.dm_exec_sql_text(deps.sql_handle) AS dest
WHERE   dest.text LIKE 'SELECT *
FROM Purchasing.PurchaseOrderHeader AS poh%';

IF @PlanHandle IS NOT NULL
    BEGIN
        DBCC FREEPROCCACHE(@PlanHandle);
    END
GO

Nope. It’s not because of a cached plan. The Correlated Datetime Column setting just isn’t digging how I’ve got the clustered index set up. However, let’s also test this. I’ll create an index on the PurchaseOrderHeader.OrderDate as well:

CREATE INDEX IX_PurchaseOrderHeader_OrderDate
ON Purchasing.PurchaseOrderHeader (OrderDate);
GO

Nope. No joy. All right. Clean up the test indexes and follow Microsoft’s lead:

DROP INDEX IX_PurchaseOrderDetail_DueDate
ON Purchasing.PurchaseOrderDetail;
GO
DROP INDEX IX_PurchaseOrderHeader_OrderDate
ON Purchasing.PurchaseOrderHeader;
GO
CREATE UNIQUE NONCLUSTERED INDEX
IX_PurchaseOrderDetail_PurchaseOrderID_PurchaseOrderDetailID 
ON Purchasing.PurchaseOrderDetail(PurchaseOrderID,PurchaseOrderDetailID);
GO
ALTER TABLE Purchasing.PurchaseOrderDetail
DROP CONSTRAINT PK_PurchaseOrderDetail_PurchaseOrderID_PurchaseOrderDetailID;
GO
CREATE CLUSTERED INDEX IX_PurchaseOrderDetail_DueDate
ON Purchasing.PurchaseOrderDetail(DueDate);
GO

Rerunning the original query I immediately see a difference. The reads dropped from 59 to 57. The execution time is a little better at 105ms. The execution plan is very different:

ExecPlanClusteredIndexSetting

We’re making use of that clustered index and it’s changed to a Hash Join. Let’s see the predicate for the Seek operation:

Seek Keys[1]: Start: [AdventureWorks2014].[Purchasing].[PurchaseOrderDetail].DueDate >= Scalar Operator(‘2013-07-07 00:00:00.000’), End: [AdventureWorks2014].[Purchasing].[PurchaseOrderDetail].DueDate < Scalar Operator(‘2013-10-05 00:00:00.000’)

So, the optimizer has picked a range of values that are near to the values that have been passed in, assuming a correlation between the data. According to the documentation the optimizer also ensures that using this correlation will still result in the same data being returned.

The original scan of PurchaseOrderHeader is still the same. In fact, we can see that when comparing the plans:

ExecPlanCompare

You’ll note that I have the similarities highlighted, so you can tell that the Scan is the same. Yes, it’s cost percentage within the plan has changed, but it’s still the same basic operation.

The magic comes from the optimizer creating statistics in the form of a materialized view that are created and maintained around the data in tables that qualify for the behavior of the Correlated Datetime Columns. You can even see these in your views. They have a common naming standard as outlined in the documentation: _MPStats_Sys_<constraint_object_id>_<GUID>_<FK_constraint_name> (no word on if MP is anything as crazy as the WA of system generated statistics).

In short, Correlated Datetime Columns worked… Or did it. We’re comparing apples to hammers at the moment. I’ve got a new clustered index on one of the tables. That changes all the choices, whether or not I’ve changed some database setting. Let’s remove Correlated Datetime Columns, pull the plan from cache and then rerun the query.

Oops. Performance is now at 145ms on average with 118 reads. The execution plan has changed yet again:

ExecPlanClusteredIndexNoSetting

In short, it works.

I could go farther and look to replace the clustered index on the PurchaseOrderHeader table, but you get the idea.

Conclusion

Correlated Datetime Columns works. Clearly it’s not something you’re going to enable on all your databases. Probably most of your databases don’t have clustered indexes on datetime columns let alone enough tables with correlation between the data stored in them. However, when you do have that type of data correlation, enabling Correlated Datetime Columns and ensuring you have a clustered index on the datetime column is a viable tuning mechanism. Further, this is a mechanism that has been around since 2005. Just so you know, I did all my testing in SQL Server 2016, so this something that anyone in the right situation can take advantage of. Just remember that TANSTAAFL always applies. Maintaining the statistics needed for the Correlated Datetime Columns is done through materialized views that are automatically created through the optimization process. You can see the views in SSMS and any queries against the objects. You’ll need to take this into account during your statistics maintenance. However, if Correlated Datetime Columns is something you need, this is really going to help with this, fairly narrow, aspect of query tuning.

Sep 12 2016

The Clustered Index Is Vital To Your Database Design

This is post #9 supporting  Tim Ford’s (b|t) initiative on #iwanttohelp, #entrylevel. Read about it here.

You get one clustered index per table.

That bears repeating, you get one clustered index per table.

Choosing a clustered index is an extremely important and fundamental aspect of all your SQL Server work. The one clustered index that you get determines how the data in your table is stored. Because the clustered index determines how your data is stored, it also determines how your data is retrieved. Much of SQL Server is engineered around the clustered index because it is such a foundational object for the rest of all behavior. Without a clustered index, the data in your table is stored in what is called a heap. It is essentially a pile, a heap, of data, stored without internal structure. The data in a heap can be found fairly quickly when only a single value is needed, but as soon as you start to retrieve ranges of data, more than one row, heaps become extremely problematic. Meanwhile, the clustered index stores all your data at the bottom of a B+Tree, structured storage, that is used to retrieve this data through the key, or keys, that you give your clustered index. Further, each non-clustered index also uses the key value from your clustered index as it’s means of retrieving data from the clustered index.

All this means that the vast majority of your tables should have a clustered index. Yes, there are exceptions, but, exceptions should be exceptional.

The standard advice, and it’s not wrong, is to make the clustered key as narrow as possible. Next, clustered indexes are designed to support monotonically increasing values extremely well. Clustered indexes must be unique. You either have to create them on a unique value, or, internally you’ll get a UNIQUIFIER applied to the key to ensure that each value is unique. All this taken together means that they work nicely on IDENTITY columns. Because of this advice, you’ll frequently see all the clustered indexes in a database on the IDENTITY columns. However, this is a bit of a trap. You need to remember that the clustered index is not just about storage, but about retrieval. If the most common access path to your data is through, for example, a foreign key, that value might be a better choice for the clustered key. If you always retrieve the data through a date range, that might be a better choice for the clustered key. Keep in mind the standard advice for clustered keys; narrow, monotonically increase, unique. However, also remember, they are where your data is stored. It is frequently better to sacrifice one or more of the standard practices in order to be able to use your clustered index to retrieve the data in most of your queries.

Think through where and how to apply your clustered indexes, and don’t get stuck into using them in a single manner, everywhere, within your system. You get one clustered index on a table. Be sure you put it somewhere that it will work well and somewhere that it will be well used, both.