Oct 17 2011

PASS Summit 2011: Wrap-Up

DSC06049Another Summit done gone by and I’m exhausted. You know you did the Summit correctly if you’re crawling onto the plane to leave because you can no longer stand. That’s where I was on Saturday.

I’m going to go ahead and write down a few thoughts about the Summit, but I want to point you over to Andy Leonard’s (blog|twitter) blog for what is a truly great summary. I won’t blame you if you read that & skip this.

Still here?

This year the Summit was amazing. I think it’s probably the best run Summit I’ve attended. From registration on Sunday night, to the final sessions on Friday, everything was dealt with in a timely and professional fashion that made the place a joy to attend. Huge kudos to the staff and the volunteers who organized and ran it.

Want to know what PASS is about? Connect, Share, Learn. Yeah, that’s the new SQL Chuck Norris at the end. More on that in another blog post. But that’s it. It’s the people. You make connections with marvelous individuals at the Summit. Then, you share with them. They talk to you. You talk to them. You both learn. It’s a glory. If you’re not meeting people at the PASS Summit, you’re doing it wrong. Other conventions may be all about the vendor floor or the sessions, but PASS is all about the people (and the vendor floor and the sessions).

My own experience started with presenting a pre-con with Gail Shaw (blog|twitter). All I’ll say is, Gail & I had glorious fun. The MVP party was Monday night and… wow! Thanks Microsoft. Tuesday I attended some sessions away from the Summit. Tuesday night I went to the First Timer’s event because I wanted to see what Tom LaRock (blog|twitter) had cooked up. It was great. The big sister/brother program at the beginning was slightly rocky, but from there it took off. The reception was great and the Quiz Bowl rocked, again (although I missed half of it because I was sent off to get beers for people on stage, proof of my love for them, Chuck, Andy…) Then it was down to the Red Gate/SQL Server Central party to see a dear friend receive the Exceptional DBA of the Year award. Jeff Moden should probably get an award for the Decade, but that hasn’t been invented yet.

I was invited to sit at the bloggers table (which I’ve now dubbed “the problem children’s table” we’ll see if the moniker sticks) for all three key notes. I live-blogged every one and you can read about them here. Dr. Dewitt (blog) stole the show… again. It actually sounds like he might be back next year too. Very exciting. And yes, that’s a picture of me with Dr. Dewitt. The restraining order hasn’t been issued yet.

Over the week I only attended a few sessions, two-three a day. It’s just hard to stop talking to people. Plus, this year, I worked the Red Gate booth as a vendor as much as possible (I had to leave several times to do other things, like attend the Women In Technology lunch for the first time). That was incredibly hard work. My hat’s off to all the vendors. You guys have no clue just how much they’re busting their humps in there. We announced a little contest on Wednesday. Looks like fun.

I want to point out one individual who just, for want of a better term, assaulted the Summit. This was her first PASS Summit. She organized and led the first ever #sqlrun (a bunch of people got together at 7AM and ran two miles, although I cut out early to shower before the key note… you’re welcome), presented a session at her first Summit, was the host for Women in Technology, met just about every single person at the Summit, and was just a glorious little bundle of energy through the whole thing. Jes Borland (blog|twitter) is one to watch.

During the week, the book SQL Server Deep Dives 2 was released. All proceeds from the book are going to Operation Smile. If you attended the summit you can download a free sample of this book and Volume 1 from Manning.

Taken as a whole, the Summit was just simply wonderful. This community just doesn’t stop giving, and sharing. I’m so glad I get to take part in it.

Oct 14 2011

PASS Keynote Day #3: Dr. Dewitt

And we’re off. We opened with a video of people saying “Connect, Share, Learn” and “This, is Community”

Rob Farley & Buck Woody came out with a song about long running queries.

[8:20]Wayne Snyder has been working with the PASS organization since 1999. He spoke at the first PASS Summit and he’s been on the board forever. He has finally hit the point as immediate-past president where he has to leave the board. We’ve got a great little thank you for him from all sorts of people. Wayne is a magnificent guy, seriously. If you see him, thank him for his service.

[8:28]We have a new executive committee, Bill Graziano as President, WHOOP, Douglas McDowell is Executive Vice-President and Thomas LaRock is VP of Marketing and finally, Rushabh Mehta is now the immediate past president.

SQL Rally Nordic is taking place in Sweden and has completely sold out. SQL Rally Dallas will be in May. We have tons of SQL Saturday’s coming up.

You can get a registration, including 2 full days of Pre-Con between now & Nov 15, for $1395.

[8:33]Dr. David Dewitt, Big Data, What’s the Big Deal?

I got to meet Dr. Dewitt earlier in the week. I’m very excited about this presentation.

He’s going to be presenting with a co-presenter, Rimma Nehme is going to be on stage helping out. His presentations are magnificent.

[8:38]Dr. Dewitt, despite being smarter than the whole room, is really funny. He’s opening up with some great slides and some good humor.

And then we’re off. We’re talking big data. Petabytes. Typically housed on large clusters of low-cost commodity hardware. He’s also talking Zetabytes. Uh, wow.

Why are things growing so much? More and more things are picking up data. There are sensors from phone location & others. There are web clicks and page views. Data has been determined to be too valuable to delete. Cost of storage has dropped.

[8:42]Managing “Big Data” The old system is to use a parallel database system. like Ebay with 10PB on 256 nodes. New stuff is a NoSQL System like Facebook, 20PB on 2700 nodes. Bing is 150PB on 40K nodes. Wow!

NoSQL is not meant to be that SQL is dead, but that there are things in addition to SQL.

Why do people love NoSQL? There is more data model flexibility. Relaxed consistency models such as eventual consistency. Low upfront software costs. Never learned anything but C/Java in school, so maybe they’re not smart enough. Finally, introducing time to insight.

[8:45]Time to insight. The idea is not to load data into a system, but instead of schema first, they just want to get it. No cleansing, No ETL, No load, analyze data where it stands. Schema first, vs. Schema later.

Major Types of NoSQL Systems. Key/Value Stores like MongoDB or Cassandra, Usually have a data model such as JSON, records are sharded across nodes in a cluster by hashing on key, single record retrieval. The other kind is Hadoop, scalable fault tolerant framework for storing and processing massive data sets, no data model, records are stored in a file system. The first is NOSQL OLTP and the second is NOSQL Datawarehouse.

[8:50]The new reality is that we’re really going to see two universes as the new reality, Structures and Unstructured data or relational systems and NoSQL systems. Relational databsae systems provide maturity, stability, consistency. The noSQL systems are all about flexibility.

Why is Dr. Dewit talking? Because the world has changed. Relational db systems no longer the only game in town. As SQL people we must accept this new reality and understand how best to deploy technologies. This is not a paradigm shift. RDBMS will continjue to dominate transactoin processing and all small to medium sized data warehouse. But many Businesses will end up with data in both universes.

[8:52] Hadoop all started at Google. They needed to manage massive amounts of click stream data. It had to e scalable, fault tolerant, easy to program against.

What does Hadoop Offer? Ability to analyze massive amounts of data. Scalable. Easy to program, low upfront costs, think big data warehousing

The stack is a HDFS at the bottom, then MapReduce, then Hive & Pig, on the size is Sqoop and then there are other management parts.

[8:55] Underpinnings of the entire Hadoop ecosystem. HDFS. Traditional hierarchical file system. Written in Java so it’s highly portable.

File splits are done through 64mb chunks and then the blocks are stored around the cluster. Each block is stored as a separate file.

Disk placement. A replication factor is set. Assuming a set of 3, it uses triple replication. Then you can survive two failures.

There is a name node, which is one instance per cluster, which is a single point of failure. There’s a backup node, which backs up the name node? And then there’s a series of data nodes.

Giant file comes in, a bunch of blocks are created, then the namenode receives messages with the blocks and the namenode moves it into appropriate datanodes, but the client does the writes, the namenode just balances and replicates.

[9:02]Reads go the other way, the namenode tells the client where the data is stored and reads it back out that way.

Failures can occur through disk errors, datanode failures, switch/rack failures, namenode failures, data center failures.

Datanode failures is handled by the namenode which always manages the datanodes, tracking what’s stored where and which datanodes are available and which are not. When there’s a failure of a datanode, the namenode will identify which blocks were stored on that datanode and will replicate them to other nodes. Further, it will balance stuff out as datanodes come back online.

[9:07]This means you get something that is highly scalable. No use of mirroring or RAID, which reduces cost. It uses a single mechanism (triply replicated blocks) t deal with a variety of failure types rathern thatn multiple different mechanisms. Negatives, block locations and record placement invisible to higher level software. Makes it impossible to employ many optimizations successfully employed by parallel DB systems.

So to improve performance they use MapReduce. Takes a large problem and divides it into small problems. Perform the same function on all sub-problems. Combine the output from all sub-problems. The first is the map, the last is the reduce.

MapReduce is done by a Master, JobTracker, and a set of Slaves, TestTracker. The JobTracker watches failures, etc.

It all works with HDFS. On each node there is a TaskTracker and a DataNode and the JobTracker is on the server with the NameNode.

[9:15]Seeing the data come out of the MapReduce mechanism, but then you see that the data can’t be grouped in mechanisms other than how they’re stored.

Reduce Phase basically takes each mapper and reads from them to get the information out of the reducer. Each reducer works with a mapper and the reducer is the thing that applies the function that actually fixes the data coming out of the mappers.

Yeah, I’m starting to get a bit lost.

Actual number of Map tasks, M is generally much larger than the number of nodes used. This heps deal with data skew and failure. Skew with reducers is still aproblem.

Failures, like HDFS, MapReduce framework is fault tolerant & other stuff.

Beauty of this stuff is that it is highly fault tolerant, relativey easy to write arbitrary distributed computations, mr framework removes burden of dealing with failures from programmer.

Cons are Schema embedded in application code, which means that sharing data between apps is really hard. Also, performance tuning is difficult.

Keeping up with this stuff as fast as I can. We’re drinking from a really big fire hose in here.

[9:24]Hive and Pig. MapReduce can’t really do joins. Developers can spend days writing apps to analyze data like what we can do with a query in the relational systems (although I know people that take days to write a TSQL query). Declarative query languages are not going away. It’s still efficient for what it does.

Hive and HiveQL s the mechanism used to put in a query language. Hive has tables. Richer column types than in SQL. You get the primitive types, but you also get stuff like associative arrays, lists, structs.Hive tables have to be partitioned. It’s still using HDFS files.

[9:28]All the files are stored in chunks. If there’s no filtering, it will go against all files. This thing could seriously thrash disks, especially when you consider the fact that data is not relational at all.

HiveQL Optimization and Execution. There is very little statistics. Uses simple heuristics of pushing sections below joins, output of … something. slide went by too quick.

PDW vs. Hive. Testing using 600gb from TPC benchmarks. On small data sets, for straight forward queries, it was pretty radically different. Then when you complicate it, hive was about 4000 seconds, pdw is about 1000, and then pdw-p is a factor of 10 faster. That’s because of how parallel data systems can work.

Hive vs. PDW. Partitioning the hive tables provides no benefit since there is no way t control where HDFS places the blocks. Different for PDW.

We’re going to have to connect the two universes. Increasingly the data first lands in unstructured universe. MapReduce is an excellent big data ETL tool. Sqoop provides a command line load utility.

Some analyses are easier to do in a procedural language. Sqoop provides querycapability to pull data from RDBMS using SQL, but you can’t get good performance.

Some applications need data from both universes. Only option is unstructured universe as unstructured data can’t go into structured. Sqoop moves it over to there.

This means that there are some types of queries that are never going to perform well with this data.

[9:40]And I just got lost. Sqoop is really complicated. It basically moves the data in & out of the two universes, which scans the entire table (yes, scans) a table N+1 times.

There has to be a better way!

Moving data is so 20th Century. Why not build a database system that understands both systems. It can have the expressive power of a language like HiveQL. He’s trying to build an “Enterprise Data Manager” which his partner hates (name, open to suggestions).

Dr. Dewitt asserts that SQL Server PDW just needs to understand unstructured data. It needs improved scalability.

Jenn McCown of Midnight DBA suggested TARDIS because it can move between universes. I like that. Let’s lobby.

Remember, this is not a paradigm shift. These things are designed to meet different problems. RDBMS only or HADOOP only is not going to be the default.

Send ideas for next year to dewitt@microsoft.com We want this man to come back guys.

This was another great presentation from Dr. Dewitt!

[9:48]And now for the Q&A.Bing David Dewitt and you can get a link to the PASS Talks. The slide deck is available.

What features can we expect to see in SQL Server that manage private cloud. He can’t answer.

What are impacts of big data on scientific community? Dr. Dewitt talks about how the Sloan digital sky survey data was managed by Jim Gray (before he was lost at sea). They are working on building database systems for scientific data that allows for declarative languages. he says that the science community just doesn’t use anything but files. They’re trying.

Here are his slide decks

[9:55]Can you elaborate more on the importance of supporting Hadoop? he does believe that there are two universes, so hadoop is out there and running next to SQL Server, today. The world has spoken. The two things are being used. We need to embrace it. We should not bury our heads in the sand.

As a DBA working primarily with relational databases, what should I do to be better prepared for this new universe? Dr. Dewitt says download and play with the code. Get started.

Oct 12 2011

PASS Summit 2011 Day #1 Post #1

Hello again. The PASS Organization has once more allowed me to sit at the bloggers table for the key note. I’ll be posting updates as the information unfolds.

[8:06] The crowd is filling in. this looks like one of the biggest crowds I’ve ever seen at the summit. The organizers have done a great job so far. Thanks for all the hard work guys.

[8:10] SQUEEEEE! I just met Dr. David Dewitt and got a picture with him. I’ll post it after I get home and find the cable.

[8:22] Watching the videos of different people on the big screen talking about what they like about the PASS Summit is really cool.

[8:26] Rushahb Mehta is introducing the board and the Partners of the PASS board. We now have a new additional structure which is three people from different regions of the world, Rob Farley Raoul Illyes and James Rowland-Jones.

[8:28] The organization is up to 430k training hours this year if you count up all the various ways the organization reaches out to people through the Summit, 24 Hours of PASS, Rally, Virtual Chapters, SQL Saturday and the PASS Chapters. The organization really has grown a lot. Plus, they’re working with SQLBits and SQL Relay in the UK.

[8:32] They have a bunch of the bloggers and twitterers up on the screen. Nice to see everyone. It’s also nice to say that I know most of them and I’m friends with most of them. This is an amazing community.

[8:36] The different ways that we have for networking are excellent. The First Timers method for introducing people to each other has really taken off from last year. Then there are all the receptions and the Women in Technology lunch. There are just some great ways to make connections within this community and those connections really will help you in your job and your live.

[8:41] Yes, please visit the vendors and thank them, even if you don’t buy anything. And fill out your eval forms, it’s the one give you can give back to your speaker. If they did something wrong, let them know what it was. If you didn’t like their presentation, let them know why.

[8:42] Key note speaker is Ted Kummert from Microsoft. His topic is The New World of Data

[8:44] No other way to describe this, but we’re going down the marketing path now. The appliance choices in SQL Server are expanding and becoming more and more attractive to the enterprise, no question. SQL Server has a number of versions and service packs released over the last year. Azure is still a choice that Microsoft is making. Whether or not others are making that choice is still very much up in the air. Marketing or not, Microsoft is clearly committing 100% to this. Reporting and cubes are being set out there too.

[8:49]Denali: It’s coming. There are lots of new mechanisms and structures, especially supporting BI and knowledge management.

[8:52]First bit of news, Project Crescent will be released as Power View.

[8:53] We have a new name people, no longer referred to as Denali, but now referred to as SQL Server 2012. It’s official. Further, the release date is the “first half of 2012”

[8:55]Microsoft’s plan is all about “any data, any size, anywhere.” Concept is fine. Implementation is the question though isn’t it. Microsoft is starting to talk about Big Data. No definition yet on what that means. Ah, they’re going to support Hadoop as part of the product suite. Cool.

[9:00] Announcements, they’re going to support hadoop, odb driver & add-in in Excel for Apache Hive and a JavaScript framework for Hadoop. They’re going to partner up with HortonWorks as part of getting Apache Hadoop going.

[9:02] Now talking Eric Baldeschwieler, the CEO of hortonworks.

[9:04] Denny Lee, Microsoft developer is now coming out to show us how to move hundreds of millions of rows using Hadoop and Azure. Now, I’m excited. Denny Lee is a good, excited, informed presenter. This is what we like. Nerds excited about technology sharing the knowledge and passion.

He’s starting with the Hive ODBC driver in PowerPivot to connect to over 300 files. The ODBC driver will be available next month, so PowerPivot and Hadoop are connecting.

He’s pulling from a multi-node hadoop cluster to an excel workbook. He’s showing reports through Excel into hadoop data. It’s nice. No question.

[9:16] They’re working on capturing everyone’s data that is going to allow people explore everyone’s data. It’s called Data Explorer and will be part of Azure labs later.

[9:18]Introducing Contoso Frozen Yogurts, the self-service yogurt company… Anyway, it’s a sample set of data showing locations & performance scores of the various stores. They’re using that to relate to other global data available. So the machine is going to start learning what data might be interesting to you.

[9:28]Unfortunately, these demos, while somewhat interesting, were pretty dull, that, and the fact that I’m sitting with a bunch of witty people with ADD, we’re suddenly paying no attention.

Look, it’s interesting, but there is a TON of code behind the scenes to make this stuff happen. And, all that code has to be on Azure, and once it’s put together, you can pull it down to Excel.

[9:34] Power View is all about getting data out to the business people. That’s good. Amir Netz, Technical Fellow at Microsoft, is up to do a demo.

[9:36] I’m taking off a little early today. There’s a big announcement down in the expo hall from Red Gate Software at 10:05 this morning. please keep an eye out for it. It’s going to be HUGE!

Oct 03 2011

Networking

No, I’m not talking about hubs and switches. I’m talking about people. Networking is a major component of an event like the PASS Summit. Whether you’re networking with individuals or with organizations, this is an opportunity to build direct, personal connections with people that can, and will, help you in your professional career. With the economy behaving as it is, having a better network gives you an edge over your peers. You’re going to be better and faster at solving the tough problems, not because you’re smarter, but because you have contacts that have already solved that problem (which, actually, means you are smarter) and you can go to them for the solution.

I’ve been “that guy.” (I can’t be be “that gal” even if I tried). You know the person. They show up having flown in the morning of the event, they hit sessions all day, sit in a corner at lunch not making eye contact, repeat for 3 full days with time off in the hotel room, then fly home. Afterwards, they talk about, “Well, it was OK, but I didn’t get that much out of it.” Really? No kidding? Maybe it’s because you didn’t put anything into it. Just a guess.

Maybe you’re like me and you’ve spent a lot more time trying to figure out how to work on computer than you have people. Not a problem. I’ve got the solution for you. Don Gabor (blog) is back at the PASS Summit for his third year in a row. He has a two hour mini-seminar called Networking to Build Business Contacts. If that doesn’t sound perfect for the Summit, then you’re not paying attention.

I’ve taken the class with Don two years ago and it was really worth it to me. Further, I’ve read one of Don’s books and I was privileged to publish an article he wrote for our short-lived attempt at reviving the SQL Standard. His subject matter is serious and important, but Don’s approach is light, fun, approachable and incredibly useful.

If you’re that guy/gal, stop. You don’t have to be. If you need that leg-up on how to get involved, then you’re absolutely in luck. Take Don’s session. You won’t be disappointed. Then network your behind off at the Summit and get that extra edge you need for your career.

Sep 22 2011

SQL Server Execution Plans

PASS_2011_SpeakingButton_180x180-blackI write quite frequently about SQL Server Execution Plans. I started in that area just because that’s how you figure out what a query is doing and sooner or later, we all have to tune a query. I found I was doing it sooner and more frequently. When the opportunity came up to write a book , I jumped on it. Now I find myself presenting, rather frequently, on execution plans.

One of the people I’ve learned from over the last several years is Gail Shaw (blog|twitter). I first saw Gail on stage at the PASS Summit, I think it was 2007. A co-worker of mine was picked, along with Gail, to go on stage for the Quiz Bowl. Gail was answering all the questions. If you go over to SQL Server Central, Gail doesn’t answer all the questions, but she’s involved most of the important discussions. If you read her blog and articles, Gail has also been deeply involved in query tuning and execution plans for a long time.

Why am I telling you all this?

In just a little over three weeks, Gail and I will be presenting an all day seminar on SQL Server Execution Plans. We’re going to cover the optimizer and the plan cache and we’re going to show you how to read a ton of execution plans. We’re going to go over how to spot problems and how to fix them. We’ll be examining plans from people in the audience (yeah, bring your problem plans) live on stage. You’ll get the chance not just to hear us talk, but to ask us questions, to get clarification on topics and meet other people who are learning the same stuff you are.

In short, we’re going to be having a really good time and talking a lot about Execution Plans. This is your chance. Register for the Summit and, please, register for our pre-conference seminar. You won’t be disappointed.

Sep 06 2011

Meme Monday: Favorite PASS Memory

Have to complete my assignment from the Rockstar this month since I’ve missed the last several (he gets all weepy, it’s not pretty). The question this time is to describe your favorite PASS moment… Wow! And this is Tom’s idea of an easy assignment. How do you choose? Lots of stuff flashes through my head, a football helmet with antlers & the song Funky Cold Medina, learning Policy Based Management from Buck Woody (blog|twitter) while simultaneously laughing my ass off, a pile of neatly folded clothes in a bush, friends old & new, speaking for the first time and running out of material 15 minutes early, sessions with great information from really smart and helpful people, Dr. Dewitt nuff said, sitting down to breakfast with people that wrote books I’d read, a couch in the Men’s room, kilts, showing my boss a solution that I picked up at the Summit… I could keep going. I love the PASS Summit. It’s fun and educational and inspirational.

My favorite memory is rather long & involved. I went to the PASS Summit for the first time in 2005 in Dallas. I was not any kind of insider. I didn’t know anyone there except the guy from my company that attended with me. We did the usual for the first two days, hit the sessions, hit the vendors, collect swag, head back to the hotel room (and no, you should not do this). But, we scored an invite to a vendor party, my first. It was OK. Free food and free beer was the reason I went, but then I saw these people that all knew each other. They were laughing and clearly having a good time. I asked who they were, thinking a company or MVP’s or something, and got the word PASS volunteers. Hmmm…

I volunteered.

I spent the next year on phone calls and exchanging emails in my first volunteer assignment with PASS on the Book Review Committee. It was some extra work for me during the year, but nothing horrible. Plus, I got to get a couple of free books as long as I wrote reviews. It was good and I made friends with a couple of other volunteers. Then, time came around for the Summit. I went. But this year, it was different. I started getting introduced to people because of the contacts I had already made. I didn’t go back to my hotel right after the last session the first night. Instead I was out with the volunteers and I was meeting other people, MVP’s, vendors, authors, speakers. And it kept going. It was the best conference I had ever been to. Yes, the sessions were great, but so was all the time spent talking to people.

I made contacts that I still treasure to this day. I made friends that I treasure even more. I’ve gone on to volunteer multiple times with PASS in several different roles (although I’ve been off the volunteer grid mostly this year, I need to change that next year). I’ve spoken at  the Summit the last three years and I’m speaking again this year. I helped organize a local user group. I got involved.

That involvement, that’s my favorite memory from PASS. I learned a lot, kind of late in life let’s be honest, and gained a lot just by staying out of my hotel room for one evening in Dallas.

Aug 03 2011

All About Execution Plans

If you’re attending the PASS Summit this October, I’d like to make a suggestion. The Summit itself is only three days long, Wednesday to Friday. But, if you have to travel any distance to get there, you’re going to miss work on Tuesday as well. Why not take a whole week away and spend Monday with Gail Shaw (blog|twitter) and me?

The reason I ask is because Gail and I are putting together 7 hours of information all about execution plans. We’re going to be presenting this information on Monday at the Summit as a pre-conference seminar. Yes, you’ll have to pay extra to attend this session. But if you register for the Summit now, there’s still a discount, which you can put towards to the seminar. And, seriously now, didn’t you want to learn more about execution plans? This is your chance.

Gail is an acknowledged SQL Server expert who writes and speaks regularly about execution plans and query performance tuning. If you read this blog at all, you might notice the occasional post about execution plans. Plus there’s the book on execution plans and the one on query tuning . Come on! You really do want to attend our session.

Even if you’ve already registered for the PASS Summit, you can still add a pre-conference seminar to your registration. If you don’t like learning about execution plans, there are lots of other good seminars to pick from (I’d recommend considering Rob Farley’s (blog|twitter) session even though he can be slightly backwards when it comes to execution plans [kidding]).

Nov 11 2010

PASS Summit 2010, Day 3 Key Note

Today is Dr. Dewitt.

The ballroom, where the keynotes are held, is filled with extra chairs. The Summit organizers expect extra attendance today, and well they should. Dr. Dewitt was amazing last year. I suspect this year will be more of the same.

Rick Heiges is introducing the day (waiting for Dr. Dewitt). Lynda Rab is leaving the board. Sad. I started volunteering for the PASS organization working for Lynda. She’s great. The new board members are Douglas McDowell, Andy Warren and Allen Kinsel.

The spring SQL Rally event was announced. I’ll be presenting a full day session on query performance, Query Performance Tuning, Start to Finish. Look for (a lot) more blog posts on this. The Summit next year has been moved to mid-October. WHOOP! This is great because I was going to miss it next year. Oct 11-15 will be the dates in 2011. Of course, it’ll be at Seattle.

Dr. Dewitt is finally on stage. From this point forward, I’ll be just posting his words & some comments. This is my best attempt to capture the information. There will be typos.

Query optimization is a really hard problem. Dr. Dewitt, says “I’m running out of ideas.” Yeah, right. His “Impress Index” is basically an arrow going down. He’s cracking jokes about his delivery, asking, How Can I Possibly Impress You. He’s showing this strange picture that has 240 seperate colors that each represent an exec plan in the optimizer. We’ll be back to that. This session was voted on. I’m glad optimization won. They live in fear of regression, talking about the optmizer developers.

The 100,000 foot view, magic happens. He’s working off of TPC-H benchmark, query 8. There are 22 million ways of executing this query. The optimizer has to spend a few seconds to pick the correct plan from this full set. It’s still possible to pick bad plans. Cost Based optimization came from System R & a lady named Pat Selinger at IBM. Optimization is the hardest part of building a DBMS, after 30 years. Situation is fruther complicated by advances in hardware and functionality within the DBMS.

The goal of the optimizer is to transform sQL queries into an efficient execution plan. The parser turns out a logical operator gtree, which then goes to the optmizer and a physical operator tree is sent to the execution engine. He’s showing a simple table, based on movie reviews. The query is a SELECT with AVG. Two possible plans. A scan occurs first, then a filter is applied to pull out the right movie and then an aggregate occurs. With this you’ll get a scan, meaning I/O corresponds to the number of pages on the table. Plan 2 uses an index to pull pages from the non-clustered index. This means random disk access that will look up the movies and then pass that on to the aggregate. The optmizer then has to figure out which is faster. The optimizer estimates the cost based on the statistics it has in hand. It has to estimate how many movies there are. So it estimates the selectivity of the predicate, then it calculates the cost of the plans in terms of CPU and I/O time.

So there are equivalence rules, such as select & join operators. Join operators are associative, meaning that the results from multiple tables are associated. Select operator distributes over joins and there are multiple ways of getting back the same information, all evaluated by the optimizer.

With a more complicated query, it could start with seelction of customers, then a selection of reviews, join them together, then join to the movies table and then project out the select out the columns wanted. But with equivalence rules, you can get other plans. Selects distribute over joins rule gets a different plan, or selects commute rule can change the plan. He showed five different plans, then four more plans & said he could have done another 20. For this simple query, he came up with 9 logically equivalent plans. All nine will produce the same data. For each of the 9 plans there is a large number of alternate physical plans that the optimizer can choose.

Assuming the optimizer has three joing strategies, nested loops, sort-merge & hash. He’s also assuming two selection strategies, sequential scan or index scan. Obviously, this is simplified.So, using these three joins & two select methods, there are 36 possible physical alternatives, for one logical plan. So with 9 logical plans there are 9*36 = 324 possible physical plans. And that’s for a VERY simple query.

Selectivity estimation, is the task of estimating how many rows can satisfy a predicate like MoviesId = 932. Plan quality is highly dependent on quality of the estimates that the optimizer makes.

I just sent in a question.

So the Histogram is the distribution of the data within the table. So there isn’t enough space within the db to store detailed statistical info. The solution is histograms. You can different kinds. The equi-widthy histogram divides the rows into equal sized buckets and then figures out how many values match each range of values. So, for an actual value, it might be .059 selectivity, but the estimated value is actually .050. That’s extremely close. But, another value he shows has .011 actual but in the histogram is .082, which is a HUGE error. Hello bad execution plan.

Another approach is equi-height histograms. These divide the ranges so that all buckets contain roughly the same number of rows, as opposed to an equal distribution of values. In equi-height, the second example is .033 instead of .082. Which is pretty good, but still skewed. He’s basically showing that errors can be introduced all over the place. The first example is .167.

Histograms are the critical tool for estimating selectiviy factors for selection predicates. But errors still occur. The deal is, there’s just a limited amount of space for these. other statistics are rows, pages, etc.

Estimating costs the optimzer considers I/O time and CPU time. Actual values are highly dependent on CPU and I/O subsystem on which the query will be run. For a parallel database system, such as PDW, plug, the problem focuses also on network traffic. So back to the two alternative physical plans… You have to determine which plan is cheaper. Assuming that the optimizer gets is right, we know that there are 100 rows out of 100k pages. These are sorted on date, but we’re going for MovieID, random reads. The optimizer doesn’t know system it’s on, but it makes a guess that a scan will take 8 seconds. The Filter will work on .1 microsecond/row & aggregate will be .1micrsec/row, for .00001 seconds, for a total of 9 seconds. Plan two will use the index. Since the rows are sorted on date, random seeks are going to occur. .003 seconds / seek, then  total time .3 seconds and same time for the aggregate. This means plan two is the winner.

But, what if the estimates are wrong. On a log plot, you start to see how, as the number of rows returned, each plan will perform better, based on the rows returned. More will make plan 1 better, but less will make plan 2 better.

That was just to get the data out of a table. To add in JOIN costs, things get worse. First example is to take a sort-merge join. This sorts each data set being returned, and then merges the results through a simple scan. Cost is 5r + 5m for I/O. A nested loop works on scanning one table and row-by-row, scanning the other table. The cost is R + R * M. R is rows M is pages.

With the example, you can see that with an indexin place, highly selective, loop joins can be cheap. But it’s the cardinalities that affect things. So, getting the histogram right is the key trick. With a log plot, again, you see how the various operations vary over time. So for a sort merge, it’s very expensive at a low number of rows, but at a large number of rows, it still returns in about the same amount of time. So as large sets of data are accessed, merge gets good. But at lower numbers of rows, the nested loop works better. So if the cardinality estimate is off, you could get a huge error in performance, especially at the larger sets of data. The optimizer has to pick the right join method. This is based on the number of rows in each set of data being joined.

He then moves on talk about how much space these things take up. The space depends on the “shape” of the query. He shows a type called a “start” join and a type called a “chain” join. Whoa! as you increase tables, the likely numbers of plans increases a lot. I knew this, but I haven’t seen it written down like this. But these shapes are extremes.

Every query optimizer starts off with a left deep plan, first, instead of bushy plans. For the example, a bushy tree would have 645k equivalents for the Star Join as opposed to 10k for left deep plans. With 3 joins methods and n number of joins in a query, there will be 3 to the power of n possible physical plans. Uh… wow. Instead, the optimizer uses dynamic programming. Sometimes heuristics will cause the best plan to be missed.

One method of optimization is Bottom Up. Optimiztion is performed in N passes (if N relations are joined). First pass, find the best 1-relation plan for each relation. Pass 2, find the best way to join the result of each 1-relation plan to another relation to generate all 2-relation plans. Pass n, find the best join result… can’t see it. Gets the lowest cost plans & interesting order rows. In spite of pruning plan space, this approach is still exponential in the # of tables. Costs are done, then pruning occurs. I’ve stopped taking notes on this part. You’ll have to see how this works in the slide deck (I’ll post the location at the end).

So that’s the theory. But the problem is, bad plans can be picked. If the statics are missing or out of date, cardinaltiy estimates are against skewed data, attribute values are correlated, and regression, hardware changes mess stuff up.

Opportunities to improve. Jayan Haritsa, has the Picasso Project. Bing this: Picasso Haritsa. There are actually software there that helps improve values. He’s back to TPC-H Query 8, and using the tool, it will show the plan space for the query, this is the painting of the cool picture at the start of the talk. With this, you can see how sensitive input parameters are to plan generation. So the cardinalities estmates are the key.

This animation shows how the estimated costs for a query start low, peak, and then, instead of continuing up, goes back down. And the optimizer team doesn’t know why. This is his example of how QO is indeed, harder than rocket science.

What can you do better? Well, Indexed Nested Loops looks good, but they’re not stable across the range of selectivity factors. If they went conservative and always picked sort-merge, it would be more stable. So, picking slower operations could make things more stable, just slower. Robustness is tied to the number of plans. And he says the QO team doesn’t understand.

At QO time, have the QO annotate compiled queryu plans with statistics and check operators. Then, you can see how this stuff works. They use this in two ways, a learning optimizer and dynamic reoptimization. The optimizer observed stats go back to a statistics tracker and then, feed that back through to the catalog, and the next query will be better. The dynamic reoptimization takes the idea that actual stats note the estimated stats and when there are differences, truncate the operation, pause the execution, output the query back to tempdb, stores that, and then uses that with the rest of the query to re-optimize using real values. Cool!

Key points: Query optimization is harder than rocket science. Three phases of QO: Enumeration of the logial plan space, enumeration of alternate physical plans selectivity estimates. The QO team of every DB vendor lives in fear of regressions, but it’s going to happen, so cut the optimizer some slack.

“Microsoft Jim Gray Systems Lab” on FaceBook is the source for the slides. Available here.

Nov 10 2010

PASS Summit: Day 2 Keynote

Today is Kilt Day at the PASS Summit. We’re going to try to arrange a group photo at lunch time.

The network connection is extremely slow. I suspect the tweeting about the kilts.

Bill Graziano is leading the key note and he started off with having all the kilted stand. Only about 12-15 of us, but that’s five times better than last year. Then it was time for the volunteers to stand up. It was excellent to see so many people. The Outstanding Volunteer of the Year was Lorie Edwards. The PASSion award went to Wendy Pastrick, who really earned it.

Unfortunately the next segment was on governance… blech! But necessary. Everyone here is a member, so they should know how the money is spent. Luckily Bill is not digging in a lot. He’s covering the things he has to. Yes, it’s a boring topic, but this is a not-for-profit organization and it needs to be transparent. I’ve always been happy to see the numbers, even when it bored the heck out of me.

An X-Box Kinnect was given out to a lucky winner. Cool! I was too busy yesterday to take advantage of the contests… ah well.

Today is also the Women In Technology Luncheon.

The first speaker of the day is Quentin Clark of Microsoft. Mr. Clark is introducing Denali. Today we should get some meat. The goal is shifting user expectations and shifting business expectations. Sadly, I was extremely excited about this presentation, but, instead of getting into the product, we got quite a lot of sales pitch. I do want to see what they think is the most important functionality, but I want to see it, not hear about it. That’s important. I think vendors frequently don’t think about the audience. The Twitter stream started to get pretty abusive, just like last year during the “I can’t mention the major hardware vendor that supports PASS because we really appreciate it” presentation.

Finally, after 40 years in the wilderness, we got a demo of SQL Server Always On. He started right into Management Studio, which is the first time I’ve seen it in the last two days during any of the Denali demos. That’s an indication of something. This is pretty neat. Automatic failover with multiple secondaries, so you can have more than one data center, around the country and have synchronous data in multiple sites. THAT will be useful. This without shared disk. Yes, you can still use it, but now you don’t have to. That’s a huge improvement over what we’ve had in the past. And, he got an ovation during the demo. When you have a collection of nerds as big as this clapping for you, you did something right. Thank you Microsoft. The data synchs occur in near real time, behind the scenes, with HA set ups that you can put together, for individual databases or groups of databases, in about five minutes. Huzzah! Oh, and the secondaries can be set to be readable and you can move your backups to the secondary… WOW! Again, thank you Microsoft.

The break down of the goals is the same as outlined yesterday, of course, Mission Critical, which they just showed, then IT Pro & Developer Productivity and Pervasive Insight. Then Mr. Clark mentioned DAC and there was a low rumble around the blogger table. That is not a popular set of functionality. There’s going to be enhancements in spatial within Denali, modifying the abilities to run queries and moving all the way through the BI stack. We’re finally getting Sequence Generator and Paging and enhanced Error Handling.

FileTable, a whole new integration of FileStream technology is being demo’ed next. This should be good too. The Key Take Away is “Every windows application that generates files, can now store files within sQL Server without a single modification to the app.” I’m not so sure this is a good thing, and what about SharePoint? Still, technology is cool and I’m a geek enough. I’m going to enjoy it. So, to a degree, this works like FileStream, but it’s file management through the database, but, the demo showed a set of files getting inserted into SQL Server management through a command prompt. Oooh… That’s cool. The demo is impressive. You can update the documents from the file system or from the database. That’s pretty neat. I’m just not sure exactly where this goes within the enterprise. I’ll have to read some more about it.

The next set of functionality is Project Juneau. I’ve heard a lot about this. It’s likely to hurt some of the 3rd party tools. We went right to the Demo this time. Thanks. We’re in the VS 2010 Shell now, along with BIDS and everything else. They’re not retiring SSMS, but it’s clear that it’s on the way out, must be. I like the improved TSQL completion. The table designer is good too. Because you can sync the visuals & tsql as you create the table. That’s great! I think I said this yesterday, but there are a lot of people that will not enjoy moving to Visual Studio. I’m a fan, but others will not like it. Still, it looks good. It’s working better than it ever did, and that’s a good thing.

Nov 09 2010

PASS Summit: Day 1 Keynote, Part 3

Ted Kummert is still talking.

For the cloud, of course, they’re talking about SQL Azure. Microsoft really is throwing themselves into the cloud, completely. The emphasis is that they offer both a cloud and an on-premises solution. I don’t mind saying, I’m still trying to get the full business proposition for an old school, fat, business like the one I work for. What should we be doing with the cloud. I just haven’t seen the magic. I see where smaller businesses, or start-ups, or temporary surge capacity for businesses that may have that type of thing can use the cloud, but… traditional work, it just doesn’t seem to jive yet.

We’re going to see some made-up scenarios for how Azure can manage Contoso Bikes. He shows how the report can pull data from the cloud and deploy reports from the cloud, in order to deliver to people on the road. But, we can do that already in other ways. The ability to link your data with the Data Market data is pretty cool. I can see that being useful. You will have to purchase access to these data sets. You can query against them, but, similar to the PDW demo, we’re not in SSMS any more. I wonder what Microsoft’s long term plans are for SSMS based on the ways we’re seeing it being bypassed.

What’s next for SQL Server? Denali. The CTP is getting handed out tomorrow after the keynote tomorrow. We’ll be seeing the demo on Denali tomorrow. The idea that Mr. Kummert is communicating is that Denali represents client requests. They targets are Mission Critical, IT Pro & Developer Productivity, and Pervasive Insight. They’ve focused on manageability and upgrade capacity. That should be good. They’re going to work on performance, which is interesting. They’re unifying the experience into Visual Studio… I’m OK with this, but I know that a LOT of DBAs are not OK with this. It’ll be interesting to see how it breaks out. Denali is the largest release of integration services ever. Full life cycle development on SSIS. That will be good. They’re also talking about expansion on the PowerPivot type of work. Project Crescent is a new reporting tool that is coming out with Denali, which is a new way of showing business information. Sounds good. Finally, a demo. We’re seeing the 100 million row demo, again. I’d like to see the new stuff, please. So, they pulled the data out of Excel and directly into Analysis Services. That’s good. Showing how it’s working within VS, which gives you source control, etc., and then you also get to use the server, which is better than the memory limits within PowerPivot. And he’s showing how over 2 billion. This is a great demo. We’re seeing a trillion rows per minute, filtered & reported on. It’s very slick. This is good. Same technology is also in the database engine. We’re seeing fantastic performance. I might be out of a job. It’s based on the columnar data store technology. It’s a very good thing.

Come back for more tomorrow!