Oct 22 2012

Clustered Indexes Have Statistics Too

It may seem obvious, but I’ve heard more than one person suggest to me that statistics on a clustered index just don’t matter. That if the clustered index can satisfy a given query, it’s going to get selected. That just didn’t make any sense to me, but I haven’t seen anyone set up a test that shows how it might work one way or the other. Here you go.

First, I’m going to create a table and load it up with data. I’m intentionally using strings because I don’t want to confuse the ease of management of integers within indexes. I also went for one column that would have a very attractive set of statistics and one that would have a very ugly set. Also, because we’re only dealing with two columns at any given juncture, either a clustered or a non-clustered index would be a covering index. Finally, I didn’t mark the clustered index as unique because I wanted the non-selective clustered index and the highly selective clustered index to both have to deal with that extra bit of processing. Here’s how I set up the table and the data:

CREATE TABLE dbo.IndexTest (
SelectiveString VARCHAR(50),
NonSelectiveString VARCHAR(2))

WITH Nums
AS (SELECT TOP (100000)
ROW_NUMBER() OVER (ORDER BY (SELECT 1
)) AS n
FROM master.sys.all_columns AS ac
CROSS JOIN master.sys.all_columns AS ac2
)
INSERT INTO dbo.IndexTest
(SelectiveString,
NonSelectiveString
)
SELECT n,
CASE WHEN n % 3 = 0 THEN 'ab'
WHEN n % 5 = 0 THEN 'ac'
WHEN n % 7 = 0 THEN 'bd'
ELSE 'aa'
END
FROM Nums;

From there I created the first two indexes:

CREATE CLUSTERED INDEX ClusteredSelective ON dbo.IndexTest
(SelectiveString);

CREATE NONCLUSTERED INDEX NonClusteredNonSelective ON dbo.IndexTest
(NonSelectiveString);

Then I ran each of these queries, both of which are actually going after fairly selective bits of data, although largely relatively speaking in terms of the second query:

SELECT * FROM dbo.IndexTest AS it
WHERE SelectiveString = '2323';

SELECT * FROM dbo.IndexTest AS it
WHERE NonSelectiveString = 'aa';

This resulted in the following two execution plans:

As you can see, the clustered index was used in the first query. It makes sense because we’re querying against the clustered key and it’s a very highly selective key. The second query, despite being against a fairly non-selective key, 48,000 rows out of 100,000, used the non-clustered index. If I drop the non-clustered index and use just the cluster for the second query, the number of reads goes from 110 to 299 despite the fact that the same data is being returned. Clearly there’s a huge advantage to how data is ordered. Also, clearly, the fact that the statistics suggest that the cluster can’t immediately satisfy the query makes the optimizer choose other options. But, what happens if we change the indexes like this:

CREATE NONCLUSTERED INDEX NonClusteredSelective ON dbo.IndexTest
(SelectiveString);

CREATE CLUSTERED INDEX ClusteredNonSelective ON dbo.IndexTest
(NonSelectiveString);

Then, when I rerun my queries, I get these execution plans:

At least to my mind, it’s pretty clear. The statistics for the cluster clearly help the optimizer decide if that index is useful. Yeah, if I drop the nonclustered indexes and then run the queries the clustered index is always used, but that’s not because the cluster is selective or not, it’s because the cluster is the table.

I’m not sure where this concept that the statistics of a clustered do not matter, but, from these tests, it seems that they do.

And remember, in just a couple of weeks I’ll be doing 7 hours of query performance tuning instruction at the PASS Summit. You can sign up, as far as I know, right up to the day of the event. The name of the session is Query Performance Tuning: Start to Finish. I cover gathering metrics, understanding the optimizer, reading execution plans, and tuning queries. It’s a beginner’s level to intermediate course. It should be a lot of fun. Go here to register. I hope to see you there.

Oct 16 2012

RML Utilities and SQL Server 2012

I’m working through some code that I haven’t touched recently and I’m running it for the first time on a SQL Server 2012 server. The code is a way to load information into the RML utilities and I started hitting errors. First, I hit an error that my server couldn’t be connected to, but thanks to Erin Stellato (blog|twitter), I was able to quickly fix that. Then I hit this:

Number of processors: 2
Active proc mask: 0x00000003
Architecture: 9
Page size: 4096
Highest node: 0
Package mask: 0x00000001
Processor(s): 0x00000001 Function units: Separated
Package mask: 0x00000002
Processor(s): 0x00000002 Function units: Separated
Processors: 0x00000003 assigned to Numa node: 0
-Ic:\performancetuning\rml.trc
-oc:\bu
-SDOJO\RANDORI
Using language id (LCID): 1024 [English_United States.1252] for character formatting with NLS: 0x00060101 and Defined: 0x00060101
Attempting to cleanup existing RML files from previous execution
Using extended RowsetFastload synchronization
Establishing initial database connection:
Server: DOJO\RANDORI
Database: PerfAnalysis
Authentication: Windows
Using SQL Client version 10
Creating or clearing the performance database
The major version number (11) in the trace file header is not a supported file version.
At this time only Microsoft SQL Server 2000, 2005 and 2008 trace files are supported. The current trace version (11) is not supported.
ERROR: Read of file header for file c:\performancetuning\rml.trc failed with operating system error 0x8007000D (The data is invalid)
*** ERROR: Attempt to initialize trace file reader failed with operating system error 0x8007000D (The data is invalid)
Reads completed - Global Error Status 0xfffffffe
Shutting down the worker thread message queues. ...

That’s right, the trace header shows the version it was captured with and the 2012 version doesn’t work with RML Utilities. I’ve tried several different ways of defining the trace collection, but the header doesn’t seem to be something you can control. I’ve also tried opening it in an editor and finding the right bit of code to change (it’s just a trace file after all) but no luck there either.

<sigh>

Firing up some older hardware now in order to get at a 2008R2 or earlier server. I’ve already upgraded all my servers & virtuals.

Sep 07 2012

SQL Server vs. Oracle

Just so we’re clear, I use SQL Server. I like SQL Server. But, this doesn’t mean I have anything against Oracle. It’s fine. It’s good. But, I know very little about it. However, throughout my career I’ve found myself needing to understand it better. Either because I’m trying to train Oracle people to better use SQL Server and I need to be able to speak a little of their language to facilitate translation. Or, because I’m defending SQL Server on some technical point that the Oracle people don’t completely understand. Or, because I’ve said something stupid about Oracle in my ignorance.

Now, you know how busy you are, and I know how busy I am, so I doubt either of us has the time we really need to learn Oracle much. So, what do you do? Well, Red Gate Software, who straddles the worlds between Oracle & SQL Server like the Bifrost between Midgard & Asgard, has started a series of conversations between two people who know something about each platform, Jonathan Lewis (blog) and me.

We had our first conversation talking about clustered indexes. We covered how they work in both platforms (not that differently) and they’re used and abused. Interestingly enough, according to Jonathan, clustered indexes just aren’t used that much within Oracle, despite the fact that they really do behave mostly the same way as they do within SQL Server, where we use them on most every table (or at least so I maintain you should). It was a great discussion (NOTE: not a fight, no one was nasty or mean, we talked).

We’re going to have another discussion. We’re going to be talking about temporary tables. Again, I don’t know much about Oracle, so please, this is not an attack, but apparently they don’t have the same concept of temporary tables as we do in SQL Server. We’re going to cover a lot of the myths and misperceptions surrounding temp tables on both Oracle and SQL Server, how they work and how they affect performance. I learned a lot during the last conversation and I don’t doubt I’ll learn a lot during this one. If you’re interested, please go to this web page and register.

And, I’d be remiss if I didn’t mention again, if you like learning about performance in SQL Server that you should consider attending the PASS Summit 2012. If you register now, you save $500, which is just enough to pay for my pre-conference seminar, Query Performance Tuning: Start to Finish. I’ll be covering all aspects of performance tuning from gathering metrics to understand which queries are running slow, to reading execution plans to understand why, to addressing the issues to fix the performance and make your queries hum. Please consider taking part. It’ll be a lot of fun and I’ll try like crazy to make it useful.

 

May 31 2012

Never, Ever Use Clustered Indexes

This whole concept of the clustered index as a foundational structure within SQL Server is just plain nuts. Sure, I get the concept that if a table has a clustered index, then that index actually becomes the table. When you create a clustered index on a table, the data is now stored at the leaf level of the Balanced Tree (b-tree) page distribution for that index, and I understand that retrieving the data using a seek on that index is going be extremely fast because no additional reads are necessary. Unlike what would happen with a non-clustered index on a heap table.

Yes, I get that if I store my data in a heap, the only way to access the data is through the Index Allocation Mapping (IAM)  pages that define extents and this means that I don’t get the double-linked list of pages that occur within clustered indexes. I know that having to read the IAM leads to additional reads for a heap to look up within the IAM in order to find the locations of the data on the disk.

I realize that updating or deleting a clustered index is helped by being able to use the index itself to find the exact row that needs to be modified or removed. I’ve also seen the tests that show that clustered indexes work faster on inserts in the overwhelming majority of situations within SQL Server. But I still want you to stop using clustered indexes on all your tables within SQL Server. Why? Because that’s how Oracle databases are mostly designed.

I hope you’ve figured out by now that I’m joking about tossing out clustered indexes within your SQL Server databases. I do believe that, unless you have a very thoroughly tested exception, every table within SQL Server should have a clustered index for some of the reasons that I’ve listed above, as well as several others. But Oracle DBAs design their systems differently.

When I see a vendor that makes a product that is exactly the same on Oracle, SQL Server, and possibly DB2 or MySQL, I have to ask myself, just how well is that system going to perform. When I hear someone tell me to design the system using lowest common denominator T-SQL because “we don’t want to be locked into a particular vendor” I have to wonder, again, how are we going to make this system perform. Because, if Oracle likes heaps, but SQL Server likes clusters, how do you design for both? I’d say you can’t.

In fact, I’d argue that you need to design precisely for specific relational database management systems because, let’s face it, they don’t implement the fundamentals in the same way. If you mess up the fundamentals, you’ve just messed up your entire design.

May 07 2012

Which SELECT * Is Better?

The short answer is, of course, none of them, but testing is the only way to be sure.

I was asked, what happens when you run ‘SELECT *’ against a clustered index, a non-clustered index, and a columnstore index. The answer is somewhat dependent on whether or not you have a WHERE clause and whether or not the indexes are selective (well, the clustered & non-clustered indexes, columnstore is a little different).

Let’s start with the simplest:

SELECT    *
  FROM    Production.ProductListPriceHistory AS plph;

This query results in a clustered index scan and 5 logical reads. To do the same thing with a non-clustered index… well, we’ll have to cheat and it’ll look silly, but let’s be fair. Here’s my new index:

CREATE NONCLUSTERED INDEX TestIndex
  ON Production.ProductListPriceHistory
(ProductID,StartDate,EndDate,ListPrice,ModifiedDate);

When I rerun the query it results in an index scan, non-clustered, with 5 logical reads. Granted, this is stupid. Instead, let’s do this. We’ll create a meaningful non-clustered index and then force it’s use:

CREATE NONCLUSTERED INDEX TestIndex
  ON Production.ProductListPriceHistory
(ListPrice);

Then run this:

SELECT    *
  FROM    Production.ProductListPriceHistory AS plph
WITH (INDEX(TestIndex));

Now that’s using a non-clustered index in a ‘SELECT *’ situation. It results in 794 logical reads and this execution plan:

image

Clearly, this is not an improvement. Finally, let’s get rid of the non-clustered index and put this columnstore index in place:

CREATE NONCLUSTERED columnstore INDEX xtest
  ON production.ProductListPriceHistory
(productid,startdate,enddate,listprice,modifieddate);
GO

Now when I run the ‘SELECT *’ query I have 37 reads and this execution plan:

image

This means the columnstore index is being used, but, if you look at the properties, you’ll see that the execution mode on this one is Rows, which is not the preferred use you’ll want out of a columnstore index. You want to see the execution mode be Batch.

If we stopped here, the answer is simple, a clustered index scan is better. But what if we add filtering? Let’s modify the query to look like this:

SELECT    *
  FROM    Production.ProductListPriceHistory AS plph
WHERE    plph.ListPrice = 23.5481;

Now, I’ll go back and set up the table so that it has just the clustered index, a non-clustered index, or the columnstore index. The results for the clustered index are identical. Since the column, ListPrice, is not part of the clustered key, a scan is necessary and the results are 5 reads and an execution time of about 1ms (the data is cached). The non-clustered index resulted in the same execution plan as before, but only 8 reads. But, the execution time was 42ms, so the added processing of getting the data put together from the key lookup was a little costly. Finally, the columnstore index results in 42 reads and an execution time of 4ms. The execution mode of the columnstore index was still Row.

What’s all this mean? Not much since you shouldn’t be using ‘SELECT *’ anyway, but the main takeaway I’d suggest is that columnstore indexes are not magic. They don’t replace traditional indexing. Further, if you’re going to use them, be sure that you’re really using them correctly. Just because you see the columnstore operator in the execution plan doesn’t mean you’re taking advantage of all the fantastic benefits they offer. Dive down into the properties and check the execution mode to ensure you’re getting a Batch execution. Then you’ll know that you’re benefiting from the columnstore index.

Feb 07 2012

Avoiding Bad Query Performance

There’s a very old saying, “When you find yourself in a hole, stop digging.”

And my evidence today is:

Hairy

That’s certainly not the hairiest execution plan I’ve seen. In some ways, it’s not all that horrible. But it sure is evidence that someone was down in a hole and they were working that shovel hard.

If you’re interested, most of the operators are scans against a table variable that’s 11 million rows deep. There are also table spools chugging away in there. And the select statement only returns 1500 rows.

Please, stop digging.

Jan 30 2012

Execution Plans, What Do I Look At?

lookThe question came up, what are the top 3-5 things that you look at in a query. I realized then that I hadn’t really written them down. There are some areas that let me know pretty much immediately where problems lie within a plan. You don’t necessarily know the answer from looking at these things, but you know you have a problem. Being me, I couldn’t stick to 5, so here are 6:

  1. Immediately check the properties on the first operator (SELECT/DELETE/INSERT). There’s tons of information in here and people frequently just ignore it. You can see if the plan is based on full optimization or not. That immediately tells me if I’m working on the optimizer’s best estimate at a plan or I’m looking at a timeout. If it’s a timeout, I know I can’t count on this plan being good. Also I get the parameter compile time & run time values to help determine parameter sniffing issues in the properties.
  2. Warnings. If you see no join predicate warnings, that should jump up and poke you in the eye like some jumping eye-poking little monster. Same goes with missing statistics. The new warnings in plans in 2012 are equally important to know about. These are quick pieces of information that should immediately point you in a direction of inquiry within the plan.
  3. The most costly operations. Yes, I know you can’t trust these values because they are just estimates. Yes, the estimated operator cost is the same in both estimated and actual plans. No measurements of actual cost are taken by an execution plan. But these are the numbers available, so I use them. They’re accurate more often than not and quickly lead you to the possible source of the problem.
  4. Fat pipes. Now really, these are usually just an indication of volume and knowing that you’re moving lots of rows helps you read a plan (umpty-million rows joining umpty-million rows through a Loop might be an issue). But the real alarm bells go off when you see big fat pipes going to little skinny ones or skinny ones to big fat ones or even skinny-fat-skinny. That’s a huge indicator of something
  5. Extra operators. This is like that old statement about pornography “I can’t give you a precise definition, but I know it when I see it.” It’s looking for stuff that doesn’t belong. For example, you don’t have a single ORDER BY statement, but there sits a Sort operation. Why? That’s my “extra operator” indicator telling me to dig deeper.
  6. Scans. Scans are not necessarily bad and Seeks are not necessarily good. In general terms, with smaller data sets, you usually would expect to see a Seek over a Scan. Scans can be the right, good, and best choice, especially for very large data sets and in other situations, but they are an indicator of potential issues.

After that, you have a whole slew of things you can get worked up about. Table Spools in SELECT statements are usually not good. Look for indications of multi-statement UDF’s (Scan’s with zero cost). Loop joins when a Merge makes more sense, Merges where you ought to see a Hash, missing index information, mismatch between estimated & actual, blah, blah, blah… You get the point. There’s just tons & tons of information within execution plans. But that list of six are usually the first things I look for.

Oct 05 2011

Extended Events and Performance Tuning Knowledge

I’m working on updating my book, Query Performance Tuning Distilled, so that it reflects the new things available in SQL Server vNext:Denali. I’m going through the first chapters that are all about gathering information about your systems. Performance tuning is all about building up knowledge of how the system is working in order to understand what you need to change in order to improve it. I’m surprised by how much hasn’t changed. But some of the changes are fundamental and huge. Let’s talk huge. Extended Events is huge.

Extended Events came out in SQL Server 2008, but very few people, myself included, paid much attention. Those who did found the implementation awkward and confusing. Only a few people persevered enough to discover just how powerful and amazing these things are. Which is why most anyone who wants to learn about extended events should plan on starting at one place, Jonathan Kehayias’ blog. Yeah, the Books Online help get you started, but Jonathan really makes it all take off.

But why is this huge? I’m not going into a low-level discussion here, because frankly, I’m incapable, for that go back to Jonathan. What I will say is that you have a mechanism that captures events at their source instead of after the fact. That makes it faster and causes less impact on your system. Further, you can put filters in place at the event level that make it so that you only capture the information you want, when you want it. In short, it’s about control, reduced impact and increased knowledge.

So why write about it now? Because Denali has finally provided a GUI that simplifies the process of setting up extended events. There’s a wizard for building extended events sessions, and a session management GUI that you can use to manage them on your own. Basically, you have a way to get past the confusing and odd TSQL implementation and start using these things.

Check it out:

image

It really does provide everything you need to get going with extended events… or does it? There’s still the issue of knowing how these things work and what they do. The GUI above enables you, but it doesn’t provide much guidance. For example, there’s a very enticing little action (stuff collected from events are called actions or fields) called sql_text. When setting up my first, simple, session (the one you see above), I saw that action and thought, ah-ha, there’s my collection point for getting the sql batch and the procedure call. But then, on testing, it didn’t work. After a plaintive call for help on Twitter (hash tag #sqlhelp for those who don’t know), Jonathan rode to my rescue and informed me that despite the enticing name, that’s not what I wanted.

Instead, each event has event fields that are simply a part of the event. Global Fields (actions, whatever, told you extended events were confusing), have an added overhead, so you have to think twice about using them. Plus, sql_text didn’t return any data for sql batch calls. No, instead, for the rpc_completed, I needed to read the field, statement. Excellent! All set. Well, no. There’s still another wrinkle. For sql_batch_completed there is no statement field. Instead, I need to collect sql_batch_text. Ah, good…. but wait, now I have two fields that get returned and I have to figure out which of these I use instead of just going to a single source like I used to with trace…grrr… Why does Microsoft hate me? And you?

Anyway, I’ll be weaving extended event sessions throughout the book. You can use these things for looking at waits, deadlocks, ooh, all kinds of stuff. I did say powerful and huge right. Keep an eye out for a few more blog posts.

Also, if you want to talk to me about gathering knowledge for your performance tuning efforts, then you should make your way to SQL In the City: Los Angeles. I’m going to be there with a slew of other MVPs on October 28th. It’s free. Click here now to register. My session is called Performance Tuning With Knowledge, and it’s all about gathering the information you need to make your performance tuning decisions.

Sep 22 2011

SQL Server Execution Plans

PASS_2011_SpeakingButton_180x180-blackI write quite frequently about SQL Server Execution Plans. I started in that area just because that’s how you figure out what a query is doing and sooner or later, we all have to tune a query. I found I was doing it sooner and more frequently. When the opportunity came up to write a book , I jumped on it. Now I find myself presenting, rather frequently, on execution plans.

One of the people I’ve learned from over the last several years is Gail Shaw (blog|twitter). I first saw Gail on stage at the PASS Summit, I think it was 2007. A co-worker of mine was picked, along with Gail, to go on stage for the Quiz Bowl. Gail was answering all the questions. If you go over to SQL Server Central, Gail doesn’t answer all the questions, but she’s involved most of the important discussions. If you read her blog and articles, Gail has also been deeply involved in query tuning and execution plans for a long time.

Why am I telling you all this?

In just a little over three weeks, Gail and I will be presenting an all day seminar on SQL Server Execution Plans. We’re going to cover the optimizer and the plan cache and we’re going to show you how to read a ton of execution plans. We’re going to go over how to spot problems and how to fix them. We’ll be examining plans from people in the audience (yeah, bring your problem plans) live on stage. You’ll get the chance not just to hear us talk, but to ask us questions, to get clarification on topics and meet other people who are learning the same stuff you are.

In short, we’re going to be having a really good time and talking a lot about Execution Plans. This is your chance. Register for the Summit and, please, register for our pre-conference seminar. You won’t be disappointed.

Sep 20 2011

Statistics in Execution Plans

I was presenting on execution plans when another question came up that I didn’t know the answer to immediately. Yes, I know you’ve seen that phrase before on this blog. I love presenting because you get exactly the kinds of questions that make you think and make you learn. I’m presenting, in part, to learn, just as much as I am to teach. It was the same with kenpo. The more I taught, the better I learned the art. Wait, this isn’t supposed to be a blog post about learning. This one is about statistics.

The question was, does the execution plan have the statistics that were used by the optimizer to decide on the execution plan. And no, what was meant, was not does it show the estimated rows, which come from the statistics, but specifically does it show that it used a set of statistics named X? The answer to that is yes & no. Or, to be more DBAish about it, it depends.

Let’s take a really simple query run against a freshly installed copy of AdventureWorks2008R2:

SELECT p.BusinessEntityID,
p.FirstName
FROM Person.Person AS p
WHERE p.FirstName LIKE 'Toni%';

This query generates this execution plan:

Stats

Yes, a very sophisticated and hard to understand execution plan. Now, here’s the deal, there were two sets of statistics used to make this plan, but only one of them can be seen in the plan. See any statistics there? Sure you do. IX_Person_LastName_FirstName_MiddleName. That is one of the two sets of statistics that were used to make this execution plan. Where’s the second set? Not in the execution plan.

[sourcecode language=”sql”]sp_helpstats N’Person.Person’, ‘ALL’;[/sourcecode]

The results are here:

statslist

There is the second set of statistics used for this query, right at the top. Because I was searching the FirstName column, the optimizer found that it did not have the statistics it needed, so they were created, on the fly, and then, were not a part of the execution plan. Further, I probably looked at other statistics such as the PK_Person_BusinessEntityID because that’s the clustered index for the table. It could have scanned that to get the list of values just as easily as the other index. But, that other index is probably smaller, which means fewer pages scanned.

So, back to the question, can you see the statistics used by the optimizer inside the execution plan? Some of them, yes, but not all of them.

Please, if I’m presenting, ask questions. I’ll know a few of the answers, right off the top of my head. Others will make me go and learn so that I can answer the next person who asks the same question. I don’t mind losing at a game of Stump the Chump, so let’s play.