Nov 30 2010

Regressions

http://www.flickr.com/photos/danar/223598560/sizes/s/in/photostream/

Hannah Dustin, Upset about Regression

One of the most important take-aways from David Dewitt’s presentation at the PASS Summit was the level of fear within the Query Processing team at Microsoft caused by regressions. If you missed Dr. Dewitt’s presentation, I tried to capture as much of it as I could here, and it will be available within the DVDs from PASS.

Regression is when something moves backwards to a less perfect state. When talking about the optimizer in SQL Server, a regression is when you see a query that used to run fast in SQL Server 2000 or 2005 and suddenly after upgrading to 2005 or 2008, the exact same query now generates a different execution plan and runs slowly. Now do you know why the Query Processing team fears these things? Yeah, you’re angry. You just went through the process of upgrading, with the expectation that everything would get better, not worse. Instead, here you are with your previously functional query and it’s a steaming pile. Do you think a substantial percentage of you call Microsoft and vent?

Why does this occur? Because building a query optimizer that takes in the hundreds and thousands and tens of thousands of possible plans and finds a plan that is good enough in under 50ms is really, really difficult. Almost any time they touch the optimizer it must be an experiment in terror. Yet, they continue to work to try to come up with ways that the optimizer runs faster. They continue to work to incorporate all the new T-SQL functionality that gets introduced with each new release into the optimizer. If you’re the unlucky person whose query gets swatted because of a regression, you might not care, but you ought to understand.

For what it’s worth, more often than not, when I’ve seen a regression occur, it has usually been in a query that shouldn’t have worked fast in 2000. Instead, people got lucky and found a small hole in the optimizer that actually let bad or questionable code not only run, but run well. Most of the time, but not all the time, examining the query and attempting to rewrite it in a more optimal fashion fixes the issue.

If you do think you’ve hit a regression, before you start lifting scalps at Microsoft, take a peek again at the query. Would you say it’s been written in as optimal a fashion as possible? If not, try tuning it. If so, I’d start with reporting the issue on Connect instead of a scalping expedition through the halls of Redmond.

Nov 24 2010

FreeCon

I got my FreeCon. What’s more, I did it in public and didn’t get in trouble.

FreeCon is the brain child of Brent Ozar (blog|twitter). Basically Brent gathered together a few bloggers & writers from the SQL Server community, some extremely well known, like Tom LaRock (blog|twitter), some in the middle, like me, and others that are clearly up & coming like David Stein (blog|twitter).  He jammed us all in a room and made us talk to one another. OK, that’s a lie. He invited us out to this lovely little poetry space in Seattle the day before the PASS Summit was due to start, where we imbibed good food & coffee and had the opportunity to share a lot of great information with each other.

The basic concept was to get people together to network, improve our writing, work on our brands, and help Brent test drive some new presentations. Personally, I’d say it worked extremely well. I came out of the session with some plans, for my blog, for my brand, and, well, for my life. Sitting and listening to people like Karen Lopez (blog|twitter) or Jorge Segarra (blog|twitter) talk about their experiences with writing, blogging, looking for work, etc. was, quite simply, an amazing opportunity. And then there was the information that Brent was putting out in his presentations. I really did pick up some specific information that’s going to change things. Just to share, here are some things from my notes, and yes, they go all over the place because the conversation jumped around, a lot. You have to remember, these were not simply amazing, smart people I was very honored to be with, but they’re vocal. Very. Vocal.  

Notes (minus a bunch of info that, frankly, I don’t want to share because it’s either private notes & plans or comments from others that shouldn’t go out to the world):

  • Read press releases: get the marketing contact: that’s the person to talk to
  • Play your own game (same advice offered by Steve Jones (blog|twitter) who was also there)
  • Look up article “Juggling writing and a job… figure it the <blank> out”
  • Get the book “Secrets of Consulting” by Jerry Weingberg (the red cover one, not the new one)
  • The more they pay you, the more they respect you
  • Price to win the gig is bad, set a high rate
  • Get yourself 50,000 foot goals (high level, important stuff, life, finance, parenting, that sort of stuff)
  • Get book “Escaping Flatland” by David Allen
  • Powerpoint is just a delivery mechanism (that’s OK, I hate PowerPoint, I try to deliver in code as much as possible)
  • Create a brand wheel, three words that describe your brand
    • I posted mine earlier, but here they are again
      • Direct
      • Instructive
      • Useful
  • Manage your brain better (right…)

For me, FreeCon defined some things that have been running around in my brain for the last 18 months or so. It also inspired me to pump up my blogging, to try to improve my game and my brand. I’ve done a very large number of actions in the two weeks since FreeCon went down. Many of them are already bearing fruit, for example, I’m now hosted on my own domain. Others may bear fruit in the near term, and I have EXTREMELY high hopes for these. Still more are the high level goals that I’ve started to define that will likely take me years to deliver.

Why am I posting all this for you? Because FreeCon is now a thing. The attendees of the original FreeCon have been granted the ability to hold their own, within a set of parameters, not least of which, is a quorum of FreeCon grads. Expect to see these occurring in and around other conventions & summits like Connections, Tech-Ed, maybe a SQL Saturday or two, SQL Rally, and certainly the Summit. If you get the invite to go, jump on it. If you haven’t received an invite, blog more, write articles more, get your name out there and you may then receive that invitation.

Kind of weird to consider that sitting around with a bunch of friends, online acquaintances, and downright strangers can result in life changing events, but, for me, that’s what happened. Thanks Brent. I’m so glad I made the list. And an extremely heartfelt thanks to all the Freecs at FreeCon. You guys are really wonderful people.

Nov 22 2010

Do Foreign Key Constraints Help Performance?

Most people are very aware of the fact that having a foreign key constraint in place on your tables adds overhead to inserts and deletes. And many people believe that there is no benefit to foreign keys beyond referential integrity (which, the application can handle perfectly well, right?). But is that an accurate statement? Here’s the basis for our investigation, a query in AdventureWorks2008R2:

[sourcecode language=”sql”]SELECT p.LastName + ‘, ‘ + p.FirstName AS ‘PersonName’
FROM Person.Address AS a
JOIN Person.BusinessEntityAddress AS bea
ON a.AddressID = bea.AddressID
JOIN Person.BusinessEntity AS be
ON bea.BusinessEntityID = be.BusinessEntityID
JOIN Person.Person AS p
ON be.BusinessEntityID = p.BusinessEntityID;[/sourcecode]

This query results in the following execution plan:

ForeignKeyExecPlan

I know that is an ugly query and an ugly query plan, but bear with me for a moment. Do you notice anything about the query and the plan at this point? Count the tables and the access operators. That’s four tables and two access operators. That means, that despite the fact that I listed four tables in the query, the optimizer was smart enough to figure out that it only needed to pull data from two of the tables and completely ignored the others because, the key values in the Person.Person table and the Person.BusinessEntityAddress table were the same. It didn’t even bother with the Address table since nothing from that table is in the WHERE or SELECT and, it skipped the BusinessEntity table because, the foriegn key relationships ensure that the data in the other tables can be trusted.

Don’t believe me? For those trying this at home, run this script in AdventureWorks2008R2:

[sourcecode language=”sql”]SELECT *
INTO dbo.MyAddress
FROM Person.Address;

SELECT *
INTO dbo.MyBusinessEntityAddress
FROM Person.BusinessEntityAddress;

SELECT *
INTO dbo.MyBusinessEntity
FROM Person.BusinessEntity;

SELECT *
INTO dbo.MyPerson
FROM Person.Person;[/sourcecode]

Then modify the original query so that it looks like this:

[sourcecode language=”sql”]SELECT p.LastName + ‘, ‘ + p.FirstName AS ‘PersonName’
FROM dbo.MyAddress AS a
JOIN dbo.MyBusinessEntityAddress AS bea
ON a.AddressID = bea.AddressID
JOIN dbo.MyBusinessEntity AS be
ON bea.BusinessEntityID = be.BusinessEntityID
JOIN dbo.MyPerson AS p
ON be.BusinessEntityID = p.BusinessEntityID[/sourcecode]

This query will produce the following execution plan:

NoFKExecPlan

That is the exact same query running against exact copies of the tables from the original, yet, now, instead of two operators and one join operation, we’re looking at four operators and three join operations. That my friends is the optimizer taking advantage of the fact that there are foreign keys in place that ensure trust in the data which enables the optimzer to eliminate unnecessary tables from the plan. Not only do the execution plans differ, but the execution time was about 1/6 slower, consistently, in the second query as it did all kinds of reads against the two tables that were eliminated in the original query plan.

I know some of you are saying, “Well, we can just put indexes on the tables to fix that problem, we still don’t need constraints.” OK. Let’s check it out. Here’s a script to put indexes in place, which will surely fix the heinous execution plan above:

[sourcecode language=”sql”]ALTER TABLE dbo.MyAddress ADD  CONSTRAINT PK_MyAddress_AddressID PRIMARY KEY CLUSTERED
(
    AddressID ASC
)

CREATE NONCLUSTERED INDEX IX_MyBusinessEntityAddress_AddressID ON dbo.MyBusinessEntityAddress
(
    AddressID ASC
)

ALTER TABLE dbo.MyBusinessEntityAddress ADD  CONSTRAINT PK_MyBusinessEntityAddress_BusinessEntityID_AddressID_AddressTypeID PRIMARY KEY CLUSTERED
(
    BusinessEntityID ASC,
    AddressID ASC,
    AddressTypeID ASC
)

ALTER TABLE dbo.MyBusinessEntity ADD  CONSTRAINT PK_MyBusinessEntity_BusinessEntityID PRIMARY KEY CLUSTERED
(
    BusinessEntityID ASC
)
USE [AdventureWorks2008R2]
GO

ALTER TABLE dbo.MyPerson ADD  CONSTRAINT PK_Person_BusinessEntityID PRIMARY KEY CLUSTERED
(
    BusinessEntityID ASC
)[/sourcecode]

After building out the indexes, we get this execution plan:

NoFkIndexesAddedExecPlan

Oops. Except for the fact that we have Clustered Index Scan instead of Table Scan for most operators, the only real change to this is the inclusion of the Index Scan operation against the new index on the MyBusinessEntityAddress table. In short, no improvement at all.

OK, you might be thinking to yourself, what if this was a proper query and there was a WHERE clause. Let’s modify the queries:

[sourcecode language=”sql”]SELECT p.LastName + ‘, ‘ + p.FirstName AS ‘PersonName’
FROM Person.Address AS a
JOIN Person.BusinessEntityAddress AS bea
ON a.AddressID = bea.AddressID
JOIN Person.BusinessEntity AS be
ON bea.BusinessEntityID = be.BusinessEntityID
JOIN Person.Person AS p
ON be.BusinessEntityID = p.BusinessEntityID
WHERE p.LastName LIKE ‘Ran%’

SELECT p.LastName + ‘, ‘ + p.FirstName AS ‘PersonName’
FROM dbo.MyAddress AS a
JOIN dbo.MyBusinessEntityAddress AS bea
ON a.AddressID = bea.AddressID
JOIN dbo.MyBusinessEntity AS be
ON bea.BusinessEntityID = be.BusinessEntityID
JOIN dbo.MyPerson AS p
ON be.BusinessEntityID = p.BusinessEntityID
WHERE p.LastName LIKE ‘Ran%'[/sourcecode]

The first query resulted in this execution plan:

ForeignKeyWhereExecPlan

Now clearly, more tuning is probably possible here since we’re still looking at an Index Scan, but the point is not whether or not the query is tuned, the point is, that the optimizer can eliminate tables because of the trust created by the foreign key constraints. And what did the other query produce?

NoFkIndexesAddedWhereExecPlan

Okay… The less said the better. Once again, tuning opportunities do exist, but we’re still accessing tables where it is absolutely not necessary, as demonstrated by the original query and it’s execution plan. The data returned has been the same, each and every time.

Back to the question, do foreign key constraints help performance? Let me ask you one in return. Have you ever looked to see which was done more in your OLTP, reads or writes? If you haven’t, take a look. I’ll be the answer surprises you. My answer to the question, yes, it can. Not that it will, but that it can.

Nov 19 2010

Un-SQL Friday: Branding?

Following in Midnight DBA’s more interesting half new blog fest topic.

My branding? I don’t have a clue. There, I said it. Up until very, very recently, I’ve just been doing stuff. I haven’t really even tried to think about what the plan was. Write a book? Sure, sounds like a cool challenge. Write another one? Well, the last one didn’t kill me, OK. Start a blog. Jump on Twitter. Pick a Name. Could I pretty please present at PASS?

I’ve been very focused on building my name, but I honestly never thought about my brand. It’s just not something that occurred to me. Yes, I am that slow. Now recently, stuff has been happening and I’m suddenly concerned about it. I too went through FreeCon with Brent Ozar (blog|twitter). And that got me thinking. Yeah, I’ve been doing all this stuff, but to what point? MVP? Yes, absolutely, but even that, why? I honestly don’t know. I’ve just kept going, running as hard as I could, winning a few, losing a few (let’s not talk about SQL Standard, ok?) and building the name… but I haven’t been building the brand. What’s more, I haven’t had a target that I was doing all this for (except the obvious). But, now I’m sort of there. I do sort of have this brand, Scary DBA, kilts, execution plans, MVP, query tuning… Cool! Now what?

I’ve been putting thought into it ever since Freecon (and I have a blog post percolating about FreeCon, still need time to assimilate & type) and I’m still not entirely sure what my plans are, but I have some plans now. We’ll see how things go with them. Oh, and my words for my brand wheel:

  • Direct
  • Instructive (gone through three words here, still not happy)
  • Useful

Am I? I’m not sure. But those are the goals I’ve got and the approach I try to take.

Nov 18 2010

Reason for Early Termination of Statement

Wouldn’t you like to know why the optimizer stopped tuning the execution plan you’re looking at? It’s actually possible and simple to get this information. I talked about this a little more than a year ago, but I left out some information that might be useful.

Let’s take a very simple query:

[sourcecode language=”sql”]SELECT * FROM Person.Address AS a;[/sourcecode]

This generates a simple execution plan, in fact a trivial plan:

Trivial Execution Plan

I know that this is a trivial plan one of two ways. The hard way is to look at the XML (BTW, this is the type of thing you can’t see it in STATISTICS PROFILE, which I understand a lot of people are using). Right at the top is the SELECT element. You can see the value for StatementOptmLevel property is equal to “TRIVIAL.” But reading XML is a pain. Right clicking on the SELECT operator above and selecting Properties will result in this property sheet:

Trivial Plan Properties Sheet

If you look at the properties here, just six up from the bottom is “Optimization Level,” the human readable equivalent to StatementOptmLevel. You can see that it’s set to TRIVIAL.

I know exactly what you’re thinking, “Hey, what the heck happened to early termination & stuff?”

Hang on. I’ll get you there.  The fact is, not every plan has a reason for early termination, like the one above. When a plan is identified as TRIVIAL, it doesn’t go through the optimizer at all. This is why it doesn’t display a reason for early termination. It didn’t terminate, it just stopped the process. The trick is, to get a more complicated query so that we get a more complicated plan:

[sourcecode language=”sql”]SELECT p.LastName + ‘, ‘ + p.FirstName AS FullName,
a.AddressLine1,
a.City,
a.PostalCode
FROM Person.Address AS a
JOIN Person.BusinessEntityAddress AS bea
ON a.AddressID = bea.AddressID
JOIN Person.Person AS p
ON bea.BusinessEntityID = p.BusinessEntityID
WHERE a.AddressID = 252[/sourcecode]

This query will not return a trivial plan because the plan, despite the simplicity of the query, is too complex, due to the joins, etc. Regardless of the details about the trivial plan, the interesting point here is to be found, again, in the properties of the SELECT operator:

With this set of properties, look first at the “Optimization Level.” Instead of TRIVIAL, we have a FULL optimization, something to shoot for. Looking down near the bottom of the properties you can see “Reason For Early Termination.” The value is “Good Enough Plan Found.” Which means that the optimizer feels, based on the statistics, that it has found a pretty good plan, not necessarily the best possible, plan, but a plan that is good enough. Which is exactly what the optimizer is supposed to do. You can also see the termination reason in the XML for the plan in the property StatementOptmEarlyAbortReason.

The trouble comes in when the queries are truly complex, nested views where views join to views are a good example. With a view, the optimizer can look at the query being run against it and make determinations as to whether all the tables are needed to satisfy the query. It’s possible that some tables are not needed. But when you start nesting views and joining views to views, well, things change. This is just an example (yes, I’m using *, I wouldn’t do this in production, I just need to get an ugly plan, and this is ugly) that uses three views from AdventureWorks2008R2:

[sourcecode language=”sql”]SELECT *
FROM HumanResources.vEmployee AS ve
JOIN Sales.vSalesPerson AS vsp
ON ve.BusinessEntityID = vsp.BusinessEntityID
JOIN Sales.vSalesPersonSalesByFiscalYears AS vspsbfy
ON vspsbfy.SalesPersonID = vsp.BusinessEntityID[/sourcecode]

This, seemingly simple, query produces a pretty heinous execution plan. I’ve zoomed and shrunk it to see it all at once:

And while this seems quite complex, the fact is, these views, and this execution plan, are relatively well written. My query against them is fairly out of line, but imagine what the plan might look like if someone who would write such an out of line query also wrote the views. Yes, these execution plans can get quite ugly. If we go on to examine the properties on the SELECT statement we’ll see something different:

This time you see that the “Optimization Level” is FULL, but the “Reason For Early Termination” is “Time Out.” This means that the optimizer, which doesn’t really measure time, but the number of attempts it makes at finding an optimal plan, has run through all it’s attempts and arrived at something less than “good enough.” This time it simply stopped trying. In this case, it takes the plan that is currently the best, which might be a good plan… or it might not be, and applies that plan to the query. While it’s guaranteed to return the data correctly, it may or may not return the data efficiently, and looking at the execution plan in this, that’s unlikely.

Why is this? Well, the easiest place to start is pretty simple, look at the number of tables in that execution plan. If you take into account the data that could meet the criteria of the query, in this case, most of it, and the number of tables, you can see why the optimizer might be taxed. After that, things get complicated. You have to begin to look at how you’re putting your query together, the types of functions and methods you’re using and, in general, the overall complexity of the process. My respect for the developers who have created the optimizer is just about boundless. These are some amazing people to have put all this together. But, the problem they’re solving is incredibly difficult, and they are not working miracles (even though it sometimes feels that way). The best thing you can do is to understand how the optimizer works and try to help it along in places where it might need it.

There is one more reason for early termination of the optimizer, and you may see it. I can’t reproduce it at will in a sample database, but it’s “Memory Limit Exceeded.” That one is largely self-explanatory and, unlike the other two values, it represents a serious problem. In this case, you need to do something about the available memory on your server. You can increase memory or decrease memory pressure by taking load off the server. Those are really your only options. You’ll still get a functional execution plan on this, but, like the timeout, it’s very likely this is not an optimal plan and certainly not “good enough.”

Nov 16 2010

SQL Server Team-Based Development

The new book is available in a free download from Red Gate. I had a lot of fun working on the three chapters that I did for this book. The topics I received are near & dear to me because of all the time I’ve spent working on getting databases into source control and automating (as much as possible) the deployment of databases out of source control. Everything I go over in the book might not be 100% applicable to your situation, but I’m pretty sure almost anyone involved in database development and deployment can find some useful information there. For those who are interested, I don’t just cover Red Gate tools in the book either. There’s quite a bit of time spent describing how to automate deployments using Visual Studio Team System 2010. I also go through ways you can collect and manage code snippets (the best being to use SQL Prompt, but there are others).

I’ve been reading the rest of the book and it’s filled with excellent information too. Yeah, you may already have a naming standard or you might already feel that you’ve got your schema well in hand and you might not see the utility of testing databases. But, read through this book and I’ll bet you pick up one or two things in almost every chapter. The guys who worked on this, Phil Factor (blog|twitter), Alex Kuznetsov (blog) and Mladen Prajdic (blog|twitter), are extremely smart and very informed on all the topics they tackled. There really is something in this book for everyone… unless you’re that guy that is sitting all alone and does everything for the company. Most of the rest of us work on teams, even if the teams are small. Small or large, that’s who this book is written for.

If you want to move your database development game up another notch, I strongly recommend taking a look

Nov 15 2010

Switching Off Parameter Sniffing

Or, another way to put it, in most cases, shooting yourself in the foot.

I was not aware that the cumulative update for SQL Server 2008 back in June included a switch that allows you to turn parameter sniffing off within SQL Server. Thanks to Kendra Little (blog|twitter) for letting me know about it (although she let me know by “stumping the chump” during my lightening talk at the Summit, thanks Kendra!).

When I first saw the switch, I thought about the places where switching off parameter sniffing could be helpful. But, as I thought about it, the more I realized that this was an extremely dangerous switch. Why? Because, most people only ever hear about parameter sniffing when they run into a problem. Someone says “Parameter sniffing” and you see people cringe. Too many people will take this information in and go, “Hey, I can just switch parameter sniffing off and I’ll have a much faster system, all the time.” But… even when you’re not hitting a problem with parameter sniffing, you’re still getting parameter sniffing. Here is where I see a problem. Let’s discuss what parameter sniffing is.

Parameter sniffing is applicable to stored procedures and parameterized queries. What happens is, when a value is passed to a parameter, the optimizer has the ability to read, or “sniff,” the value of the parameter. It can do this because it knows exactly what the value is when the proc/query is called. This is not applicable to local variables, because the optimizer can’t really know what those values might be, where as it knows exactly what the values of parameters are going in. Why does it do this? One word: statistics. Statistics are what the optimizer uses to determine how queries will be executed. If the optimizer is given a specific value, it can then compare that value to the statistics on the index or table in question and get as good an answer as is possible from those statistics as to how selective this value may be. That information determines how the optimizer will run the query and because it is using specific values, it’s looking at specific information within the stats. If the parameters are not sniffed, the statistics are sampled and a generic value is assumed, which can result in a different execution plan.

The problem with parameter sniffing occurs when you have out of date statistics or data skew (certain values which return a wildly different set of results compared to the rest of the data within the table). The bad statistics or skew can result in an execution plan that is not consistent with most of the data that the stats represent. However, for most of us, this is an edge case.

Most of the time we’re going to benefit from parameter sniffing because the specific values lead to more accurate, not less accurate, execution plans. Sampled data, basically an average of the data in the statistics, can lead to a more stable execution plan, but a less accurate one. Switching parameter sniffing off means that all queries will use sampled data, which can seriously impact performance negatively. Most of the time, most of us are benefitting wildly from the strengths of parameter sniffing and only occasionally are we seeing the problems.

Unless you know, and I mean know, not suspect, that your system has major and systematic issues with parameter sniffing, leave this switch alone and let the optimizer make these choices for you. If you don’t, it’s very likely that you’ll see a performance hit on your system.

Nov 15 2010

ScaryDBA Blog Reboot

After all the conversations at FreeCon last week (more on that later, when I’ve assimilated it better), I finally decided to make some changes. The list of things I’m hitting is extremely long, but I’m going to be going through them a few at a time. First, and easiest, was getting moved off the WordPress domain to my own, hosted, domain.

You’re seeing the results. This is the new ScaryDBA blog. I suspect the theme may change once or twice over the next week, but basically, all the functionality that has been added, better feeds, conversation tracking, other stuff, will remain.

Thank you for visiting in the past. I intend to provide you with plent of reasons to visit in the future.

Nov 12 2010

The New Path To MCM

Microsoft has announced changes to the MCM program. This page shows all the ways that you can become an MCM without having to spend a month at Microsoft.

I’m interested in this for two reasons. First, I’d love to have the opportunity to try to become an MCM. Making it easier to make the attempt, hopefully without dumbing down the curriculum, is a great idea. Second, scroll down to the books section. That’s right, my book is one of the suggested books for learning enough so that you can pass the MCM test. I’m gob-smacked and honored and jazzed and I’ll stop gushing now.

Nov 11 2010

PASS Summit 2010, Day 3 Key Note

Today is Dr. Dewitt.

The ballroom, where the keynotes are held, is filled with extra chairs. The Summit organizers expect extra attendance today, and well they should. Dr. Dewitt was amazing last year. I suspect this year will be more of the same.

Rick Heiges is introducing the day (waiting for Dr. Dewitt). Lynda Rab is leaving the board. Sad. I started volunteering for the PASS organization working for Lynda. She’s great. The new board members are Douglas McDowell, Andy Warren and Allen Kinsel.

The spring SQL Rally event was announced. I’ll be presenting a full day session on query performance, Query Performance Tuning, Start to Finish. Look for (a lot) more blog posts on this. The Summit next year has been moved to mid-October. WHOOP! This is great because I was going to miss it next year. Oct 11-15 will be the dates in 2011. Of course, it’ll be at Seattle.

Dr. Dewitt is finally on stage. From this point forward, I’ll be just posting his words & some comments. This is my best attempt to capture the information. There will be typos.

Query optimization is a really hard problem. Dr. Dewitt, says “I’m running out of ideas.” Yeah, right. His “Impress Index” is basically an arrow going down. He’s cracking jokes about his delivery, asking, How Can I Possibly Impress You. He’s showing this strange picture that has 240 seperate colors that each represent an exec plan in the optimizer. We’ll be back to that. This session was voted on. I’m glad optimization won. They live in fear of regression, talking about the optmizer developers.

The 100,000 foot view, magic happens. He’s working off of TPC-H benchmark, query 8. There are 22 million ways of executing this query. The optimizer has to spend a few seconds to pick the correct plan from this full set. It’s still possible to pick bad plans. Cost Based optimization came from System R & a lady named Pat Selinger at IBM. Optimization is the hardest part of building a DBMS, after 30 years. Situation is fruther complicated by advances in hardware and functionality within the DBMS.

The goal of the optimizer is to transform sQL queries into an efficient execution plan. The parser turns out a logical operator gtree, which then goes to the optmizer and a physical operator tree is sent to the execution engine. He’s showing a simple table, based on movie reviews. The query is a SELECT with AVG. Two possible plans. A scan occurs first, then a filter is applied to pull out the right movie and then an aggregate occurs. With this you’ll get a scan, meaning I/O corresponds to the number of pages on the table. Plan 2 uses an index to pull pages from the non-clustered index. This means random disk access that will look up the movies and then pass that on to the aggregate. The optmizer then has to figure out which is faster. The optimizer estimates the cost based on the statistics it has in hand. It has to estimate how many movies there are. So it estimates the selectivity of the predicate, then it calculates the cost of the plans in terms of CPU and I/O time.

So there are equivalence rules, such as select & join operators. Join operators are associative, meaning that the results from multiple tables are associated. Select operator distributes over joins and there are multiple ways of getting back the same information, all evaluated by the optimizer.

With a more complicated query, it could start with seelction of customers, then a selection of reviews, join them together, then join to the movies table and then project out the select out the columns wanted. But with equivalence rules, you can get other plans. Selects distribute over joins rule gets a different plan, or selects commute rule can change the plan. He showed five different plans, then four more plans & said he could have done another 20. For this simple query, he came up with 9 logically equivalent plans. All nine will produce the same data. For each of the 9 plans there is a large number of alternate physical plans that the optimizer can choose.

Assuming the optimizer has three joing strategies, nested loops, sort-merge & hash. He’s also assuming two selection strategies, sequential scan or index scan. Obviously, this is simplified.So, using these three joins & two select methods, there are 36 possible physical alternatives, for one logical plan. So with 9 logical plans there are 9*36 = 324 possible physical plans. And that’s for a VERY simple query.

Selectivity estimation, is the task of estimating how many rows can satisfy a predicate like MoviesId = 932. Plan quality is highly dependent on quality of the estimates that the optimizer makes.

I just sent in a question.

So the Histogram is the distribution of the data within the table. So there isn’t enough space within the db to store detailed statistical info. The solution is histograms. You can different kinds. The equi-widthy histogram divides the rows into equal sized buckets and then figures out how many values match each range of values. So, for an actual value, it might be .059 selectivity, but the estimated value is actually .050. That’s extremely close. But, another value he shows has .011 actual but in the histogram is .082, which is a HUGE error. Hello bad execution plan.

Another approach is equi-height histograms. These divide the ranges so that all buckets contain roughly the same number of rows, as opposed to an equal distribution of values. In equi-height, the second example is .033 instead of .082. Which is pretty good, but still skewed. He’s basically showing that errors can be introduced all over the place. The first example is .167.

Histograms are the critical tool for estimating selectiviy factors for selection predicates. But errors still occur. The deal is, there’s just a limited amount of space for these. other statistics are rows, pages, etc.

Estimating costs the optimzer considers I/O time and CPU time. Actual values are highly dependent on CPU and I/O subsystem on which the query will be run. For a parallel database system, such as PDW, plug, the problem focuses also on network traffic. So back to the two alternative physical plans… You have to determine which plan is cheaper. Assuming that the optimizer gets is right, we know that there are 100 rows out of 100k pages. These are sorted on date, but we’re going for MovieID, random reads. The optimizer doesn’t know system it’s on, but it makes a guess that a scan will take 8 seconds. The Filter will work on .1 microsecond/row & aggregate will be .1micrsec/row, for .00001 seconds, for a total of 9 seconds. Plan two will use the index. Since the rows are sorted on date, random seeks are going to occur. .003 seconds / seek, then  total time .3 seconds and same time for the aggregate. This means plan two is the winner.

But, what if the estimates are wrong. On a log plot, you start to see how, as the number of rows returned, each plan will perform better, based on the rows returned. More will make plan 1 better, but less will make plan 2 better.

That was just to get the data out of a table. To add in JOIN costs, things get worse. First example is to take a sort-merge join. This sorts each data set being returned, and then merges the results through a simple scan. Cost is 5r + 5m for I/O. A nested loop works on scanning one table and row-by-row, scanning the other table. The cost is R + R * M. R is rows M is pages.

With the example, you can see that with an indexin place, highly selective, loop joins can be cheap. But it’s the cardinalities that affect things. So, getting the histogram right is the key trick. With a log plot, again, you see how the various operations vary over time. So for a sort merge, it’s very expensive at a low number of rows, but at a large number of rows, it still returns in about the same amount of time. So as large sets of data are accessed, merge gets good. But at lower numbers of rows, the nested loop works better. So if the cardinality estimate is off, you could get a huge error in performance, especially at the larger sets of data. The optimizer has to pick the right join method. This is based on the number of rows in each set of data being joined.

He then moves on talk about how much space these things take up. The space depends on the “shape” of the query. He shows a type called a “start” join and a type called a “chain” join. Whoa! as you increase tables, the likely numbers of plans increases a lot. I knew this, but I haven’t seen it written down like this. But these shapes are extremes.

Every query optimizer starts off with a left deep plan, first, instead of bushy plans. For the example, a bushy tree would have 645k equivalents for the Star Join as opposed to 10k for left deep plans. With 3 joins methods and n number of joins in a query, there will be 3 to the power of n possible physical plans. Uh… wow. Instead, the optimizer uses dynamic programming. Sometimes heuristics will cause the best plan to be missed.

One method of optimization is Bottom Up. Optimiztion is performed in N passes (if N relations are joined). First pass, find the best 1-relation plan for each relation. Pass 2, find the best way to join the result of each 1-relation plan to another relation to generate all 2-relation plans. Pass n, find the best join result… can’t see it. Gets the lowest cost plans & interesting order rows. In spite of pruning plan space, this approach is still exponential in the # of tables. Costs are done, then pruning occurs. I’ve stopped taking notes on this part. You’ll have to see how this works in the slide deck (I’ll post the location at the end).

So that’s the theory. But the problem is, bad plans can be picked. If the statics are missing or out of date, cardinaltiy estimates are against skewed data, attribute values are correlated, and regression, hardware changes mess stuff up.

Opportunities to improve. Jayan Haritsa, has the Picasso Project. Bing this: Picasso Haritsa. There are actually software there that helps improve values. He’s back to TPC-H Query 8, and using the tool, it will show the plan space for the query, this is the painting of the cool picture at the start of the talk. With this, you can see how sensitive input parameters are to plan generation. So the cardinalities estmates are the key.

This animation shows how the estimated costs for a query start low, peak, and then, instead of continuing up, goes back down. And the optimizer team doesn’t know why. This is his example of how QO is indeed, harder than rocket science.

What can you do better? Well, Indexed Nested Loops looks good, but they’re not stable across the range of selectivity factors. If they went conservative and always picked sort-merge, it would be more stable. So, picking slower operations could make things more stable, just slower. Robustness is tied to the number of plans. And he says the QO team doesn’t understand.

At QO time, have the QO annotate compiled queryu plans with statistics and check operators. Then, you can see how this stuff works. They use this in two ways, a learning optimizer and dynamic reoptimization. The optimizer observed stats go back to a statistics tracker and then, feed that back through to the catalog, and the next query will be better. The dynamic reoptimization takes the idea that actual stats note the estimated stats and when there are differences, truncate the operation, pause the execution, output the query back to tempdb, stores that, and then uses that with the rest of the query to re-optimize using real values. Cool!

Key points: Query optimization is harder than rocket science. Three phases of QO: Enumeration of the logial plan space, enumeration of alternate physical plans selectivity estimates. The QO team of every DB vendor lives in fear of regressions, but it’s going to happen, so cut the optimizer some slack.

“Microsoft Jim Gray Systems Lab” on FaceBook is the source for the slides. Available here.