03Nov 2016 by Grant Fritchey 1 Comment

I recently wrote an introductory post about the importance of statistics. I just received a reinforcement on how important they are during my own work.

Bad Estimate

I hit a weird problem while I wasÂ setting up a query to illustrate a point (blog to be published next week). Let’s take the basis of the problem and explain it. I wanted data with distribution skew, so I ran this query to find out if there was a wide disparity between the top and bottom of the range:

SELECT i.BillToCustomerID,
   COUNT(i.BillToCustomerID) AS TestCount
FROM Sales.Invoices AS i
GROUP BY i.BillToCustomerID
ORDER BY TestCount ASC;

Sure enough, the bottom of the range returned three (3) rows and the top returned 21,551. If I then run a query to retrieve just a few rows like this:

SELECT *
FROM Sales.Invoices AS i
WHERE i.BillToCustomerID = 1048;

I get the following execution plan:

I’m happy because this is the plan I expected. With this plan in hand, I don’t bother looking at anything else.

Creating a Problem

I expand out theÂ query initially as follows:

SELECT i.InvoiceID,
   il.InvoiceLineID,
   si.StockItemName
FROM Sales.Invoices AS i
JOIN Sales.InvoiceLines AS il
   ON il.InvoiceID = i.InvoiceID
JOIN Warehouse.StockItems AS si
ON si.StockItemID = il.StockItemID
WHERE i.BillToCustomerID = 1048;

The execution plan nowÂ looks like this:

Frankly, I’m puzzled. Why on earth did we go from a key lookup operation to a scan on the Invoices table? I rebuild the query a couple of times and it keeps going to a scan. Finally, I pause a moment and look at the row estimate (you know, like I should have done the first moment I was puzzled):

258 rows? Wait, that’s wrong. The number of rows for this value is three. Why on earth would it be showing 258? There’s no reason. I haven’t done any kinds of calculations on the columns. I double check the structures. No hidden views orÂ constraints, or anything that would explain why the estimate was so wrong. However, it’s clear that the estimate of 258.181 is causing the loops join and key lookup to go away in favor of a hash join and scan when I add complexity to the row estimate needed by the optimizer.

After thinking about it a while, I finallyÂ ran DBCC SHOW_STATISTICS:

Note the highest point on the histogram, 1047. Yet I’m passing in 1048.

So, what’s happening?

While the number of rows for 1048 was the lowest, at 3, unfortunately it seems that the 1048 values were added to the table after the statistics for the index had been updated. Instead of using something from the histogram, my value fell outside the values in the histogram. When the value is outsideÂ histogramÂ the Cardinality EstimatorÂ uses the average value across the entire histogram, 258.181 (at least for any database that’s in SQL Server 2014 or greater and not running in a compatibility mode), as the row estimate.

I thenÂ change the query to use the value 1047, the execution plan then changed to look like this:

The new plan reflectsÂ the behavior I was going for when I was setting up the test. The row estimates are now accurate, and small, therefore I get a key lookup operation instead of a scan.

Conclusion

Statistics drive the decisions made by the optimizer. The very first moment you’re looking at an execution plan and you’re seeing a scan where you thought, for sure, you should have seenÂ a seek, check the row estimates (OK, not the first moment, it could be a coding issue, structural issue, etc.). It could be that your statistics are off. I just received my own reminder to pay more attention to the row estimates and the statistics.

I love playing with statistics and execution plans and queries. As a result, I also like teaching how to do this stuff. If you’re interested, I’m putting on a class in Rhode Island, December 2016. Sign up here.

One thought on “Reinforcing the Importance of Statistics on Row Estimate”

Check Those Estimates – Curated SQL

[…] Grant Fritchey runs into a statistics issue: […]

November 4, 2016 at 8:05 am

Please let me know what you think about this article or any questions:Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Reinforcing the Importance of Statistics on Row Estimate