Category: SQL Server 2014

Jun 28 2016

CASE Statement in GROUP BY

Set based operations means you should put everything into a single statement, right?

Well, not really. People seem to think that having two queries is really bad, so when faced with logical gaps, they just cram them into the query they have. This is partly because SQL Server and T-SQL supports letting you do this, and it’s partly because it looks like a logical extension of code reuse to arrive at a query structure that supports multiple logic chains. However, let’s explore what happens when you do this on particular situation, a CASE statement in a GROUP BY clause.

You see this a lot because a given set of data may be needed in slightly different context by different groups within the company. Like many of my example queries, this could be better written. Like many of my example queries, it mirrors what I see in the wild (and for those following along at home, I’m using the WideWorldImporters database for tests now):

CREATE PROCEDURE dbo.InvoiceGrouping (@x INT)
AS
SELECT  SUM(il.UnitPrice),
        COUNT(i.ContactPersonID),
        COUNT(i.AccountsPersonID),
        COUNT(i.SalespersonPersonID)
FROM    Sales.Invoices AS i
JOIN    Sales.InvoiceLines AS il
        ON il.InvoiceID = i.InvoiceID
GROUP BY CASE WHEN @x = 7 THEN i.ContactPersonID
              WHEN @x = 15 THEN i.AccountsPersonID
              ELSE i.SalespersonPersonID
         END;
GO

Running this for any given value above, 7, 15 or other, you’ll get the same execution plan, regardless of the column used in the GROUP BY. However, Parameter Sniffing is still something of a factor. When you group this data by SalesPersonID, you only get 10 rows back. This will be shown as the estimated number of rows returned if some value other than 7 or 15 is used as a parameter. However, this is always the plan:

coreplan

You can click on that to expand it into something readable. We can eliminate the Parameter Sniffing from the equation if we want to by modifying the query thus:

CREATE PROCEDURE dbo.InvoiceGrouping_NoSniff (@x INT)
AS
DECLARE @x2 INT;
SET @x2 = @x;

SELECT  SUM(il.UnitPrice),
        COUNT(i.ContactPersonID),
        COUNT(i.AccountsPersonID),
        COUNT(i.SalespersonPersonID)
FROM    Sales.Invoices AS i
JOIN    Sales.InvoiceLines AS il
        ON il.InvoiceID = i.InvoiceID
GROUP BY CASE WHEN @x2 = 7 THEN i.ContactPersonID
              WHEN @x2 = 15 THEN i.AccountsPersonID
              ELSE i.SalespersonPersonID
         END;
GO

However, except for some deviation on the estimated rows (since it’s averaging the rows returned), the execution plan is the same.

What’s the big deal right? Well, let’s break down the code into three different procedures:

CREATE PROCEDURE dbo.InvoiceGrouping_Contact
AS
SELECT  SUM(il.UnitPrice),
        COUNT(i.ContactPersonID),
        COUNT(i.AccountsPersonID),
        COUNT(i.SalespersonPersonID)
FROM    Sales.Invoices AS i
JOIN    Sales.InvoiceLines AS il
        ON il.InvoiceID = i.InvoiceID
GROUP BY i.ContactPersonID;
GO

CREATE PROCEDURE dbo.InvoiceGrouping_Sales
AS
SELECT  SUM(il.UnitPrice),
        COUNT(i.ContactPersonID),
        COUNT(i.AccountsPersonID),
        COUNT(i.SalespersonPersonID)
FROM    Sales.Invoices AS i
JOIN    Sales.InvoiceLines AS il
        ON il.InvoiceID = i.InvoiceID
GROUP BY i.SalespersonPersonID;
GO

CREATE PROCEDURE dbo.InvoiceGrouping_Account
AS
SELECT  SUM(il.UnitPrice),
        COUNT(i.ContactPersonID),
        COUNT(i.AccountsPersonID),
        COUNT(i.SalespersonPersonID)
FROM    Sales.Invoices AS i
JOIN    Sales.InvoiceLines AS il
        ON il.InvoiceID = i.InvoiceID
GROUP BY i.AccountsPersonID;
GO

Interestingly enough, these three queries produce a nearly identical execution plan. The one big difference is the Compute Scalar operator that is used to generate a value for the Hash Match Aggregate is no longer in the query:

specificplan

The same basic set of structures, scans against both tables, to arrive at the data. Cost estimates between the two plans are very different though, with the targeted queries having a much lower estimated cost.

Performance-wise, interestingly enough, the average execution time of the first query, only returning the 10 rows, is 157ms on average, while the query grouping directly on the SalesPersonID averages about 190ms. Now, the reads tell a slightly different story with 17428 on the generic query and 5721 on the specific query. So, maybe a server under load will see a significant performance increase. However, let’s deal with what we have in front of us and say that, at least for these tests, the catch-all GROUP BY query performs well.

Now let’s change the paradigm slightly. Let’s add an index:

CREATE INDEX TestingGroupBy ON Sales.Invoices (SalespersonPersonID);

Frankly, this isn’t a very useful index. However, after adding it, the execution plan for the InvoiceGrouping_Sales query changes. Instead of scanning the table, it’s now scanning the index. Despite recompiles and attempts to force it using hints, the original InvoiceGrouping query will not use this index. Duration of the InvoiceGrouping_Sales query drops to 140ms on average and the reads drop a little further to 5021. Getting an 11% increase on performance is a win.

This is a pretty simplified example, however, making the CASE statement more complex won’t improve performance or further assist the optimizer to make good choices. Instead of trying to cram multiple different logical groupings into a single query, a better approach would be to create the three new procedures that I did above, and make the original InvoiceGrouping procedure into a wrapping procedure that chooses which of the individual procedures to call. This way, if you do add indexes in support of each of the different possible groupings, you would realize a positive outcome in your performance.


Want to talk more about execution plans and query tuning?. In August, I’ll be doing an all day pre-con at SQLServer Geeks Annual Summit in Bangalore India.

I’m also going to be in Oslo Norway for a pre-con before SQL Saturday Oslo in September.

May 12 2016

Use The Correct Data Type

Blog post #5 in support of Tim Ford’s (b|t) #iwanttohelp, #entrylevel. Read about it here.

Saying that you should use the correct data type seems like something that should be very straight forward. Unfortunately it’s very easy for things to get confusing. Let’s take a simple example from AdventureWorks. If I run this query:

SELECT  a.ModifiedDate
FROM    Person.Address AS a
WHERE   a.AddressID = 42;

The output looks like this:

2009-01-20 00:00:00.000

Normal right? You see the year, the month and the day followed by the time in hours, minutes, and seconds as a decimal. Ah, but there is an issue. This query is supposed to be for the reporting system, and the business only cares about the date that the values in the Person.Address table have been modified, so they don’t want to see the time. Also, the person in charge is kind of picky. They really don’t like seeing the date formatted that way. They prefer to see “January 20, 2009.”

Far too often then, the easy answer, just change the column to varchar. You can trim the time and output in exactly the format needed by the business. Problem solved and it was easy…

Well, until someone inputs “Janry 20, 2009” slightly mangling the spelling and suddenly your report looks all messed up. Or, they ask you to start filtering just the last two weeks, regardless of when the report was run and you now can’t easily do date math on the column. Even after you get over that problem with a little formatting using CAST (along with ISDATE to try to catch all those other typos that are in the system now)  you notice that the performance is really slow so you go to put an index on the column and now you have an index key that is 50 bytes wide instead of the 3 bytes that the DATE data type would have been, making the index less efficient (not to mention, sorting the data is going to put February ahead of January, more formatting).

The list goes on and on. You’re going to hit issue after issue and all you tried to do was a little formatting, make the data look pretty by using the VARCHAR data type in place of DATETIME or DATE. This kind of thing happens all the time, especially to people just getting started. The rules should be very simple, use the correct data type for the data you’re collecting. However, it’s easy to get distracted. It’s easy to get pushed. The business says “put it in this format” and the only way you can do that is by changing the data type so…

Don’t do it. Take the time to understand the implications of mangling the data types. It’s not just date and time that cause issues either. Lots and lots of stuff can be easily stuffed into a VARCHAR field that more properly should be an INT or a DECIMAL or some other data type. Be able to communicate with the business so that they understand that you’re not simply being difficult, you’re trying to protect the functionality and integrity of the data. Show how formatting can be done using client tools. Explain all the downstream issues that this decision will have. Then, use the correct data type for the information being stored.

May 03 2016

Implicit Conversion and Performance

Letting SQL Server change data types automatically can seriously impact performance in a negative way. Because a calculation has to be run on each column, you can’t get an index seek. Instead, you’re forced to use a scan. I can demonstrate this pretty simply. Here’s a script that sets up a test table with three columns and three indexes and tosses a couple of rows in:

CREATE TABLE dbo.ConvertTest (
     BigIntColumn BIGINT NOT NULL,
     IntColumn INT NOT NULL,
     DateColumn VARCHAR(30)
    );

CREATE INDEX BigIntIndex 
ON dbo.ConvertTest 
(BigIntColumn);
CREATE INDEX IntIndex 
ON dbo.ConvertTest 
(IntColumn);
CREATE INDEX DateIndex 
ON dbo.ConvertTest
(DateColumn);

WITH    Nums
          AS (SELECT TOP (1000000)
                        ROW_NUMBER() OVER (ORDER BY (SELECT 1
                                                    )) AS n
              FROM      master.sys.all_columns ac1
              CROSS JOIN master.sys.all_columns ac2
             )
    INSERT  INTO dbo.ConvertTest
            (BigIntColumn,
             IntColumn,
             DateColumn
            )
    SELECT  Nums.n,
            Nums.n,
            DATEADD(HOUR, Nums.n, '1/1/1900')
    FROM    Nums;

If I run a very simple query like this:

DECLARE @param DATETIME;

SET @param = '3/15/1963 8:00:00';

SELECT  ct.DateColumn
FROM    dbo.ConvertTest AS ct
WHERE   ct.DateColumn = @param;

I get an execution plan that looks like this:

convert_scan

The issue is, while I put dates into the DateColumn, the data type there is VARCHAR, not DATETIME. Passing a DATETIME value results in an implicit conversion that we can see in the Index Scan Predicate property:

ConverImplicitHurtsPerformance

This is a function against a column and that is what results in the scan. If I modify the code as follows:

DECLARE @param VARCHAR(30);

SET @param = '3/15/1963 8:00:00';

SELECT  ct.DateColumn
FROM    dbo.ConvertTest AS ct
WHERE   ct.DateColumn = @param;

I get an execution plan that looks like this:

convert_seek

No change in data type is needed, so the index can be used. It’s a very simple demonstration of how implicit conversions hurt performance. Now, what about the integer columns? What happens when I do this with the code:

DECLARE @param INT;

SET @param = 650323;

SELECT  ct.BigIntColumn
FROM    dbo.ConvertTest AS ct
WHERE   ct.BigIntColumn = @param;

Here’s the execution plan:

convert_intseek

And here is the predicate:

convert_predicate

In this case, we’re still getting an implicit conversion of the data type, but SQL Server is smart enough to realize that, hey, it’s still just an integer, so it can convert the @param instead of the column, resulting in a seek. If I change the test so that it runs a query against the INT column passing a BIGINT value, it doesn’t even do a conversion of any kind.

In short, not all implicit conversions are a problem. However, it’s still a best practice to use the right data type for columns. You should also use the same data type for your parameters and local variables as the column data type. Taking these steps avoids issues with implicit conversion.


For lots more information on query tuning, in August, I’ll be doing an all day pre-con at SQLServer Geeks Annual Summit in Bangalore India.

I’m also going to be doing an all day training course before SQL Saturday Oslo in September.

Apr 12 2016

A View Is Not A Table

Blog post #4 in support of Tim Ford’s (b|t) #iwanttohelp, #entrylevel

In SQL Server, in the T-SQL you use to query it, a view looks just like a table (I’m using the AdventureWorks2014 database for all these examples):

SELECT  *
FROM    Production.vProductAndDescription AS vpad;

 

SELECT  vpad.Name,
        vpad.Description,
        vpmi.Instructions
FROM    Production.vProductAndDescription AS vpad
JOIN    Production.Product AS p
        ON p.ProductID = vpad.ProductID
JOIN    Production.vProductModelInstructions AS vpmi
        ON vpmi.ProductModelID = p.ProductModelID
WHERE   vpad.ProductID = 891
        AND vpad.CultureID = 'fr';

The above query actually combines two views and a table. This is what is commonly referred to as a “code smell”. A code smell is a coding practice that works, but that can lead to problems. In this case, we’re talking about performance problems. The performance problems when using views to join to tables and other views as if they were real tables comes about because a standard view is not a table. Its a query. For example, the second view introduced, vPorductModelInstructions looks like this:

ALTER VIEW [Production].[vProductModelInstructions] 
AS 
SELECT 
    [ProductModelID] 
    ,[Name] 
    ,[Instructions].value(N'declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/ProductModelManuInstructions"; 
        (/root/text())[1]', 'nvarchar(max)') AS [Instructions] 
    ,[MfgInstructions].ref.value('@LocationID[1]', 'int') AS [LocationID] 
    ,[MfgInstructions].ref.value('@SetupHours[1]', 'decimal(9, 4)') AS [SetupHours] 
    ,[MfgInstructions].ref.value('@MachineHours[1]', 'decimal(9, 4)') AS [MachineHours] 
    ,[MfgInstructions].ref.value('@LaborHours[1]', 'decimal(9, 4)') AS [LaborHours] 
    ,[MfgInstructions].ref.value('@LotSize[1]', 'int') AS [LotSize] 
    ,[Steps].ref.value('string(.)[1]', 'nvarchar(1024)') AS [Step] 
    ,[rowguid] 
    ,[ModifiedDate]
FROM [Production].[ProductModel] 
CROSS APPLY [Instructions].nodes(N'declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/ProductModelManuInstructions"; 
    /root/Location') MfgInstructions(ref)
CROSS APPLY [MfgInstructions].ref.nodes('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/ProductModelManuInstructions"; 
    step') Steps(ref);

GO

That’s a query against the XML stored in the ProductModel table. The view was created to mask the complexity of the necessary XPath code, while providing a mechanism for retrieving the data from the XML. This is a common use of views. However, when we then treat the view as a table, and join it to other tables and views, we present a problem for the optimizer. Because a view is not a table, but is instead a query, the optimizer has to resolve this query in combination with any other views or tables to arrive at an execution plan for the whole combined mess. While the optimizer is very good at what it does, because of the complexity caused by the additional unnecessary processing to figure out which parts of the view is not needed to satisfy the query, it can make poor choices. That can result in poor performance.

If I were to rewrite the query, it would look something like this:

SELECT  p.Name,
        pd.Description,
        pm.Instructions.value(N'declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/ProductModelManuInstructions"; 
        (/root/text())[1]', 'nvarchar(max)') AS Instructions
FROM    Production.Product AS p
JOIN    Production.ProductModelProductDescriptionCulture AS pmpdc
        ON pmpdc.ProductModelID = p.ProductModelID
JOIN    Production.ProductDescription AS pd
        ON pd.ProductDescriptionID = pmpdc.ProductDescriptionID
JOIN    Production.ProductModel AS pm
        ON pm.ProductModelID = p.ProductModelID
CROSS APPLY Instructions.nodes(N'declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/ProductModelManuInstructions"; 
    /root/Location') MfgInstructions (ref)
CROSS APPLY MfgInstructions.ref.nodes('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/ProductModelManuInstructions"; 
    step') Steps (ref)
WHERE   p.ProductID = 891
        AND pmpdc.CultureID = 'fr';

That’s a lot more complex than the query we had above that only referenced three objects and had only two JOIN operations. However, if you capture the I/O and the execution time for these queries, you’ll see a different story.

I used two methods for measuring performance. I used SET STATISTICS IO and SET STATISTICS TIME to ON for the queries for one set of tests. For another set I used Extended Events. Consistently the execution time for the query with the view was around 110ms. The query that didn’t reference any views was around 37ms. The reads were 155 for the query with views, but only 109 for the query without. If you look at the individual table I/O, you can start to see where the differences come from. These are the results from the query with the views:

Table ‘ProductDescription’. Scan count 0, logical reads 56
Table ‘ProductModelProductDescriptionCulture’. Scan count 28, logical reads 56
Table ‘xml_index_nodes_418100530_256001’. Scan count 13, logical reads 37
Table ‘ProductModel’. Scan count 0, logical reads 2
Table ‘Product’. Scan count 0, logical reads 4

These are the results for the query without the view:

Table ‘ProductDescription’. Scan count 0, logical reads 56
Table ‘xml_index_nodes_418100530_256001’. Scan count 13, logical reads 37
Table ‘ProductModelProductDescriptionCulture’. Scan count 6, logical reads 12
Table ‘ProductModel’. Scan count 0, logical reads 2
Table ‘Product’. Scan count 0, logical reads 2

You can see the differences in both ProductModelProductDescriptionCulture and Product. This is because of the differences in the execution plans caused by differences in the choices made by the optimizer.

A standard view is not a table. There is such a thing as a materialized view or indexed view, which is a table. That’s not what we’re talking about here. While you can use a view as if it was a table, don’t mistake it for a table. A view is just a mask in front of a query. It can’t be used like an object so that you avoid rewriting the same JOIN. That will lead to issues for the optimizer as this simple set of examples showed. Don’t shy away from using views, just understand what their real behavior is. A view is a query, not a table.

Apr 05 2016

Views and Simplification

I’ve been getting lots of questions on views lately. Must be something in the water.

Because SQL Server allows you to treat a view as if it was a table, lots of people pretty much assume that it is a table since they get to treat it that way. The thing is, a view is not a table. It’s a query. Let’s explore this just a little bit. Here’s a relatively straight forward view:

CREATE VIEW dbo.PersonInfo
AS
SELECT  a.AddressLine1,
        a.City,
        a.PostalCode,
        a.SpatialLocation,
        p.FirstName,
        p.LastName,
        be.BusinessEntityID,
        bea.AddressID,
        bea.AddressTypeID
FROM    Person.Address AS a
JOIN    Person.BusinessEntityAddress AS bea
        ON a.AddressID = bea.AddressID
JOIN    Person.BusinessEntity AS be
        ON bea.BusinessEntityID = be.BusinessEntityID
JOIN    Person.Person AS p
        ON be.BusinessEntityID = p.BusinessEntityID;
GO

I can query this view like this:

SELECT  *
FROM    dbo.PersonInfo AS pni
WHERE   pni.LastName LIKE 'Ran%';

The resulting execution plan looks like this:

viewSimple1

You don’t even need to expand it for what I’m about to show. If we modify the query against our view as follows:

viewSimple2

Again, you can expand these, but you don’t need to. Notice, the first plan had four tables being referenced, which represent the four tables from the view. The second query only has two tables. This is because the optimizer looked at the query that the view represents, not simply the query that I used to call the view. It then recognized that simplification could be used to eliminate unnecessary JOIN operations from the execution plan and still get the same data because of foreign key constraints on the tables.

The important point to note is that the optimizer is absolutely not treating the view like a table. The optimizer is treating the view like a query, which is all it is. This has both positive and negative impacts when it comes to query performance tuning and this view. You could spend all sorts of time “tuning” the view, only to find all that tuning you’ve done tossed out the window when the query doesn’t reference a column in the view and that causes the optimizer to rearrange the plan. I don’t want to convey that this is an issue. It’s not. I’m just trying to emphasize the point that a view is just a query.

Now, when we get into treating a view exactly like a table in JOINs or calling a view from a view (known as nesting), then we’re talking about issues. I’ll put up another post on a JOIN and views.


For lots more information on query tuning, I’m presenting an all day pre-con at SQL Day in Wroclaw Poland on May 16.

Mar 07 2016

SQL Server Backups Are A Business Decision

Blog post #3 in support of Tim Ford’s (b|t) #iwanttohelp, #entrylevel

Read more about Tim’s challenge here.

It’s very easy to think of SQL Server backups as a technical problem. You have so much stuff going on, BACKUP DATABASE commands, recovery models, BACKUP LOG commands, Differential backups. Getting them all into the correct order and automating the processes sure seems like a technical problem. It isn’t. It’s all about the business. If you’re taking on the duties of a DBA whether you’re an accidental DBA, a reluctant DBA or you were voluntold into the DBA position, you need to plan to sit down with responsible parties from the business and get an understanding with them regarding RPO and RTO.

RPO is a TLA for Recovery Point Objective. The easiest way to describe RPO is to ask, “In terms of time, how much data are we willing to lose?” The immediate answer is always going to be zero. Here is where we have to be honest. You won’t be able to guarantee zero data loss (yeah, there are probably ways to do this, but #entrylevel). Talk with the business. Most of the time, you’ll find that they’d actually be OK with 15 minutes, or maybe 5 minutes, or even an hour of lost data. It really varies, not only from business to business, but from database to database within the business (allow for this flexibility). You need to establish this number. RPO is going to help you figure out how to set up your backups, your recovery model, your logs and their backups. All that stuff that seemed so technical, it’s all based on this extremely important number that you’re going to work with the business to arrive at.

Oh, but we’re not done. Once you’ve managed to get the business comfortable (as comfortable as they can be) with the idea that they could lose data, you also have to prepared them for the idea that, in the event of a disaster, restoring the database from backups is not going to be instant. It’s going to take time. This is where we have to define the RTO or Recovery Time Objective. This is our goal for how quickly we can restore the database. RTO is not so much a negotiation with the business as it is an education for the business. You see, you can only restore so fast on your hardware. Further, the RESTORE DATABASE process is dependent on the size of the backups. Even further, it’s dependent on the types of restore operations we’re running and whether or not we use WITH RECOVERY in the RESTORE operations. You may have to test a few restores to get an idea how fast things are with your system. Regardless, the RTO has to be arrived at and agreed on. You may also have to readdress the RTO as the number, size and volume of your data changes over time. Be prepared for this as well.

With the RPO established, you can now decide on the recovery model. Let’s take an example. If the business says that they can afford a day of data loss, depending on the size of your database, you can put this database into SIMPLE recovery, run a full backup once a day and walk away a winner. Another example, the business decides that it could live with up to 15 minutes worth of data loss. Now you have to go to FULL recovery and you have to set up log backups in addition to your full backups. Then, you start to mix the RTO into the mix. Let’s say you’re outage was to occur at 8PM and you run your backups at Midnight. You now have to restore 20 hours worth of log backups. That can take a long time. So, in order to make the RTO as short as possible, you toss in a differential backup every day at noon. Now you’ll only ever have to restore 12 hours worth of backups, so you can define a rough RTO for the business.

These simplified, and somewhat simplistic, examples are just the start of the process of figuring out how best to do your backups. However, that’s the technical part of the problem. The fundamental definitions that you have to have in order to start solving this technical issue are business decisions that you must get your business people involved with. Define the RPO and RTO, then start defining your recovery strategy.

Feb 22 2016

Why Is The Server Slow?

This is blog post #2 in support of Tim Ford’s (b|t) #iwanttohelp, #entrylevel.

If you haven’t been working in SQL Server for very long, you may not have got this phone call yet, but you will:

Hi, yeah, the server is slow. Thanks. Bye.

Let’s pretend for a moment that you know which server they’re referring to (because just finding out that piece of information can be a challenge). Now what?

The list of tools and mechanisms within SQL Server for gathering metrics is extremely long:

Performance Monitor
Dynamic Management Views & Functions
System Views
Extended Events
Trace Events
Activity Monitor
Data Collector
Execution Plans
3rd Party Tools

I’m leaving out lots of stuff in that list. So where do you start when you get this phone call? Where is the server slow?

The best place to start is by looking at the wait statistics.

A simple, and simplified, explanation for how processes work within SQL Server is that each process gets access to the various resources for a little while, then has to pass off access to another process, each of them working together to get the work done. But, some processes take longer than others. When a process has to wait for anything in SQL Server, this information gets logged. Depending on the system you’re working with you can access these wait statistics from one of three locations:

sys.dm_os_wait_stats: for SQL Server
sys.dm_db_wait_stats: for Azure SQL Database
sys.dm_pdw_nodes_os_wait_stats: for Azurew SQL Data Warehouse

Getting information from these system views is extremely simple. Here’s an example query:

SELECT  *
FROM    sys.dm_os_wait_stats AS dows
ORDER BY dows.wait_time_ms DESC;

The results of this query look like this:

waits

The first column lists the wait types. The second column provides a count of the tasks that have had that wait type. The third column, wait_time_ms, is the amount of time in milliseconds that tasks have been waiting within the system, cumulative. Next is the maximum wait time that any one task has waited, max_wait_time_ms. Finally, signal_wait_time_ms, you see the time that the waits have had to wait for access to the CPU (also known as time spent on the Runnable queue). The importance of waits are not simply the time that a wait has had, but also the number of tasks and the max time. Using all these values gives meaning to the individual wait.

Here is where things actually get difficult. The wait types are arcane, difficult to understand, and difficult to interpret. Further, a lot of the wait types actually don’t mean anything at all. The waits are not indicative of an issue. So, while the query above is simple, the results it provides are very weak. Instead, I strongly recommend you use the query provided by Paul Randal, located here. It will filter out the wait statistics that you shouldn’t care about.

Now, you have a meaningful list of wait statistics that will tell you exactly why, if not where, your server is running slow. Unfortunately, these waits still need to be interpreted. If you read further on Paul’s blog, you’ll see he has a number of waits and their causes documented. That’s your best bet to start understanding what’s happening on your system (although, I hear, Paul might be creating a more complete database of wait stats. I’ll update this blog post should that become available).

One other thing to consider. These waits are since the last time the server was started (or failed over, or the database was failed over in Azure, or if the values have been reset). This means that simply looking at the list doesn’t give you necessarily enough information. Instead, running this more than once during a day can show you what’s been slow over time by comparing the two data sets. You can also use sys.dm_exec_session_wait_stats to see what any given session is experiencing if you want to know what a given user or process is experiencing. Don’t just look at the list and think you’re done. The best thing to do is get to a point of proactive monitoring (for this, monitoring tools make it easier).

So, when you get the phone call that says the server is slow, you know how to get started understanding exactly why that may be the truth.

Jan 27 2016

The Importance of a Full Backup in SQL Server

This is the first of 12 posts this year in support of Tim Ford’s #iwanttohelp initiative. These will be completely 100 level, introductory blog posts meant to help people that are just getting started as data professionals. I’ll try to cover several different topics over the year, but felt I should start with what I think is the most important, backups.

It is impossible to overstate the importance of getting a good backup of your SQL Server databases. A backup is the most fundamental of protections for the information on which your business is dependent. Since SQL Server is a service, it manages it’s own files. Because of this, you can’t just copy the *.mdb file where your data is stored. Instead, you must run a process, usually through the BACKUP command within SQL Server. That link goes to the MSDN documentation at Microsoft (which is always your first source for information about SQL Server). The first command outlined there is BACKUP DATABASE… This will create a complete, page-by-page (everything is stored within the database file in a construct called a page), copy of your database, including every single object within the database and all the data.

Well, you think, I’ve got that bit. I’ll take one of those full backup thingies and I’m all set. Ooh, look there are all kinds of other backups too. Log, Differential, File, Filegroup and options, COPY_ONLY, MIRROR… Not only does this get complicated, quickly, but it’s very easy to lose sight of the most fundamental aspect of all this. That is, that backups are not the main thing you’re going for here. The main thing you’re going for here is the ability to RESTORE your database. That link goes to the MSDN entry on RESTORE.

As you explore all the aspects of backups, you must always remember that the most important thing is that you can restore those backups. The most important aspect of the RESTORE operation, whether you’re restoring a full backup, a differential backup, a file backup a filegroup backup or a log backup is that every single one of them starts with a restore of the full backup. If you lose your full backup, that complete, BACKUP DATABASE command, you can’t run the first restore needed. Yes, taking log backups means that you can restore to a point in time. That starts with a restore of the database from a full backup. Yes, taking differential backups can be faster because they’re copying fewer pages to create a backup. Restoring a differential is only possible when you have the full backup that was taken before that differential was taken. The same goes for the rest of the backups (with some exceptions that are WAY beyond entry level). You must have that full backup before you can do the rest. This simple fact is why the full backup is so important.

So please, if you’re just getting started working with SQL Server, then get your backups in order, but make darned sure that you know exactly where your backups are stored and that you can get to them when the time comes to restore a database. Without that full backup, you’re in serious trouble.

I have a video of exactly how to take a full backup of your database using T-SQL to help you get started. Just remember, don’t lose that backup.

Jan 12 2016

“Applies To…” in the MSDN Documentation

Quick little post. I just wanted to share how happy I am with the new “THIS TOPIC APPLIES TO” infographic. An example here:appliesto

I think it makes things much more clear when you’re attempting to figure out what’s up with some T-SQL syntax.

Well done Microsoft and thank you.

Side note, this only exists in documentation that has been updated recently. I first saw it in some documentation that was updated January 11, 2016. It’s not there in another piece of documentation I saw that was updated October 15, 2015. Here’s hoping it gets put everywhere. It works.

Oct 20 2015

Statistics for the New Data Pro

Next week at the PASS Summit I’ll be presenting a session called Statistics for the New Data Pro.

You can read the abstract at the link. I just want to emphasize that this is a beginner level session. I think way too many people who are just starting out with SQL Server don’t understand the role that statistics play in determining how your queries are going to behave. What’s more, too many people don’t know how to get and read statistics to understand how it is that the optimizer thinks you have X number of rows in your database that match a given value. I’m going to make darned sure that the people who attend this session come out with a full understanding of how to read the statistics. This includes the good information available in the header, how the density graph is used, and a complete understanding of the monstrosity that is the histogram (it’s not that bad).

We’re also going to talk about statistics maintenance. You need to know how stats get updated because they absolutely impact how your queries are going to behave. There are a couple of ways that statistics are automatically maintained and I really want to be sure that you know what those are. We’ll also talk about manual maintenance, which must be done.

Overall, this is meant as a session for those who have stumbled into a position at work that requires them to be a Data Professional. You can call it accidental DBA, or reluctant database programmer, or just the poor individual who was left standing when everyone else stepped backwards one pace. If you already feel you have a firm grasp of statistics, awesome. You might not want to attend this session. It’s Summit. There are others during the same time slot. But, I have a request. Could you please direct your junior DBA to this? If you know someone who’s just getting started writing T-SQL code, maintaining a server, developing applications and reports, I think this session would do them good, so please send them on over.

Also, on the topic of statistics, I’d like to suggest that you go to Erin Stellato’s session, Statistics and Query Plans. That’s going to add a bunch of information in addition to what I’m covering. Plus, she’s a great speaker. It takes place the day before my session, but that’s OK. We’re covering different information. They’re very complementary sessions.