Clustered Indexes Have Statistics Too

It may seem obvious, but I’ve heard more than one person suggest to me that statistics on a clustered index just don’t matter. That if the clustered index can satisfy a given query, it’s going to get selected. That just didn’t make any sense to me, but I haven’t seen anyone set up a test that shows how it might work one way or the other. Here you go.

First, I’m going to create a table and load it up with data. I’m intentionally using strings because I don’t want to confuse the ease of management of integers within indexes. I also went for one column that would have a very attractive set of statistics and one that would have a very ugly set. Also, because we’re only dealing with two columns at any given juncture, either a clustered or a non-clustered index would be a covering index. Finally, I didn’t mark the clustered index as unique because I wanted the non-selective clustered index and the highly selective clustered index to both have to deal with that extra bit of processing. Here’s how I set up the table and the data:

CREATE TABLE dbo.IndexTest (
SelectiveString VARCHAR(50),
NonSelectiveString VARCHAR(2))

WITH Nums
AS (SELECT TOP (100000)
ROW_NUMBER() OVER (ORDER BY (SELECT 1
)) AS n
FROM master.sys.all_columns AS ac
CROSS JOIN master.sys.all_columns AS ac2
)
INSERT INTO dbo.IndexTest
(SelectiveString,
NonSelectiveString
)
SELECT n,
CASE WHEN n % 3 = 0 THEN 'ab'
WHEN n % 5 = 0 THEN 'ac'
WHEN n % 7 = 0 THEN 'bd'
ELSE 'aa'
END
FROM Nums;

From there I created the first two indexes:

CREATE CLUSTERED INDEX ClusteredSelective ON dbo.IndexTest
(SelectiveString);

CREATE NONCLUSTERED INDEX NonClusteredNonSelective ON dbo.IndexTest
(NonSelectiveString);

Then I ran each of these queries, both of which are actually going after fairly selective bits of data, although largely relatively speaking in terms of the second query:

SELECT * FROM dbo.IndexTest AS it
WHERE SelectiveString = '2323';

SELECT * FROM dbo.IndexTest AS it
WHERE NonSelectiveString = 'aa';

This resulted in the following two execution plans:

As you can see, the clustered index was used in the first query. It makes sense because we’re querying against the clustered key and it’s a very highly selective key. The second query, despite being against a fairly non-selective key, 48,000 rows out of 100,000, used the non-clustered index. If I drop the non-clustered index and use just the cluster for the second query, the number of reads goes from 110 to 299 despite the fact that the same data is being returned. Clearly there’s a huge advantage to how data is ordered. Also, clearly, the fact that the statistics suggest that the cluster can’t immediately satisfy the query makes the optimizer choose other options. But, what happens if we change the indexes like this:

CREATE NONCLUSTERED INDEX NonClusteredSelective ON dbo.IndexTest
(SelectiveString);

CREATE CLUSTERED INDEX ClusteredNonSelective ON dbo.IndexTest
(NonSelectiveString);

Then, when I rerun my queries, I get these execution plans:

At least to my mind, it’s pretty clear. The statistics for the cluster clearly help the optimizer decide if that index is useful. Yeah, if I drop the nonclustered indexes and then run the queries the clustered index is always used, but that’s not because the cluster is selective or not, it’s because the cluster is the table.

I’m not sure where this concept that the statistics of a clustered do not matter, but, from these tests, it seems that they do.

And remember, in just a couple of weeks I’ll be doing 7 hours of query performance tuning instruction at the PASS Summit. You can sign up, as far as I know, right up to the day of the event. The name of the session is Query Performance Tuning: Start to Finish. I cover gathering metrics, understanding the optimizer, reading execution plans, and tuning queries. It’s a beginner’s level to intermediate course. It should be a lot of fun. Go here to register. I hope to see you there.

4 thoughts on “Clustered Indexes Have Statistics Too

  • ASIT KAUSHIK

    Hi Grant,
    I have been reading your blogs,articles, session pass free session. I just wanted to ask if i am analyzing an execution plan the most costly block has a Clustered seek and that cost is around 50-60 % is that query performing at the optimal and does not require any further change. IF No what other factor should one look for. Hope this question doent suckssss

  • You can’t really determine if a query is running well just by seeing the costs. Those are estimates. You do use them to figure out which operators to look at, but it’s not a measure of performance. For that you capture execution time and the I/O.

  • Asit kaushik

    Hi Grant,
    Again in the execution plan i have a complex query and my question/problems are

    1) in the where clause as shown in the below code snippet the @variable is passed as a parameter to the Stored Procedure

    @variable = 0
    or
    @variable = and (select…)

    my @variable passed is 0 but in the execution plan i see the cost taken up by the other select query , i cant understand why..

    2) In the execution plan i see that some operation say index seek has been executed = 0 then also i am seeing a high cost .

    Please guide me if i my understanding is faulty.

    Regards
    Asit

  • Hello Asit,

    It’s hard to say what the issue might be without seeing the plan. But usually when the costs are different than you expect, either higher or lower, it goes back to the estimates generated from the statistics. If your statistics are out of date, or the values passed to the parameter when then query was compiled referenced different statistics, then you may see the costs being different.

Please let me know what you think about this article or any questions:

This site uses Akismet to reduce spam. Learn how your comment data is processed.