Jan 23 2012

Deadlock Monitoring

There are four different ways you can get information about deadlocks in your system. These are:

  • traceflag 1204
  • traceflag 1222
  • trace events
  • extended events

For years I’ve been pushing traceflag 1222 as the best of the lot. Well, that’s over. I’ve been learning more and more about extended events and I’m currently in love with xml_deadlock_report event. Why? Simple, it has everything that traceflag 1222 has, but there are two glorious things about it.

First, it’s not going to be filling up my error log with, for the error log, noise. Seriously. As much as I liked the information displayed from traceflag 1222, I didn’t like what it did to the log, but I saw it as a necessary evil.

Second, it’s XML baby! That means you can set up queries to pull information out if you need to. Yeah, I know you can hop through hoops to set up the text from the error log for querying, but, that’s nothing like pointing to the file and referencing the nodes. Way too cool.

On top of that, for simple deadlocks, the graphical version of the deadlock graph gives you most of what you need to quickly identify the issue. You absolutely don’t get that with the traceflag.

Nope. I’m sold more than ever on extended events. If you haven’t started exploring them, I strongly recommend you do. Especially with SQL Server 2012 just around the corner. The functionality around extended events there completely makes these things accessible in ways they weren’t before.

Oct 17 2011

Extended Events Data

I’ve been working quite a bit over the last week or so with extended events in Denali. The sheer magnitude of what you can do with extended events is just becoming clear to me. The interesting thing though is how much the basics are similar to trace. Similar mind you, not the same. For example, the best way to gather trace data is to output it to a file and then read the file into a table for later querying. It’s the same with extended events. There’s even a function that acts as a table:

SELECT *
FROM sys.fn_xe_file_target_read_file ('C:\APath\Query Performance Tuning*.xel', NULL, NULL, NULL);

This can take advantage of roll-over files just like the old function used for traces. You can also provide offsets to read a sub-set of the file, which is cool. The event fields come back within an XML column. Here’s a sample of how I’m dealing with that from the new version of my Query Performance Tuning book:

WITH xEvents AS
(SELECT object_name AS xEventName,
CAST (event_data AS xml) AS xEventData
FROM sys.fn_xe_file_target_read_file
('C:\Apath\Query Performance Tuning*.xel', NULL, NULL, NULL))

SELECT xEventName,
xEventData.value('(/event/data[@name=''duration'']/value)[1]','bigint') Duration,
xEventData.value('(/event/data[@name=''physical_reads'']/value)[1]','bigint') PhysicalReads,
xEventData.value('(/event/data[@name=''logical_reads'']/value)[1]','bigint') LogicalReads,
xEventData.value('(/event/data[@name=''cpu_time'']/value)[1]','bigint') CpuTime,
CASE xEventName WHEN 'sql_batch_completed' THEN
xEventData.value('(/event/data[@name=''batch_text'']/value)[1]','varchar(max)')
WHEN 'rpc_completed' THEN
xEventData.value('(/event/data[@name=''statement'']/value)[1]','varchar(max)')
END AS SQLText,
xEventData.value('(/event/data[@name=''query_plan_hash'']/value)[1]','binary(8)') QueryPlanHash
FROM xEvents;

yeah, XQuery isn’t my strong point. However, you get the idea. Oh, and the CASE statement is necessary because the rpc_complete code is visible in the statement field, but the sql_batch_complete code is visible in the batch_text field. That’s just Microsoft keeping us on our toes.

Want to talk about this and other ways of gathering performance metrics? Please come and visit during SQL In The City:Los Angeles on October 28th.

Dec 13 2010

Life/Work Balance

Apple iPad - Work Life Balance ToolTechnology, especially information technology, is the greatest thing to ever happen to mankind, freeing us from toil and drudgery. Technology, especially information technology, is a pernicious evil taking over our lives forcing us to work harder and longer. Depending on the time of day, the day of the week, my mood, my wife’s mood, or the direction the wind is blowing, either of these statements could be true.

The fact is, I love technology and I do have to wrestle with keeping it from taking over my life, but only because I have so much fun with the toys that technology brings. You want to know how much I love toys, ask me about my Droid sometime. Pull up a chair. We’re going to be here a while. The trick is, finding that sweet spot, where you use the tools presented to you in order to enhance your life while enhancing your work. Just enough of each and you can be a hero at home and on the job and have a blast doing both.

The one thing I really hate about being a DBA is being on call. I’m not sure why but most systems fail one of three times, right when I’m going to sleep, so I get to stay up another 1-3 hours fixing the issue; around 3AM, so I can spend about 1/2 an hour figuring out how to log into the network before I spend 1-3 hours fixing the issue; or, when I’m half way up a mountain with the Scouts, in which case, I just have to call the boss and get someone else engaged (and yes, I do prefer these last failures to the others). The real trick here is, to get your systems set up so that you don’t have constant emergencies, regardless of the time of day. How do you do this? Proactive monitoring.

Red Gate handed me 10 iPad’s along with 10 licenses for SQL Monitor, their new monitoring tool. I’m to give these 10 devices away to the best responses in the comment section of this post to the question I’m going to put to you shortly. That’s right, you can get out in front of the issues you’re running into and avoid whenever it is that you get called from work and get an awesome toy at the same time.

The goal is life/work balance. Notice which one I put first. That’s the priority. Here’s your question:

What do you think the most common cause of server outages is, why, and how would being able to monitor your systems remotely help solve this issue, thereby improving the quality of your life?

The contest runs from now until 11:59 PM, December 17th, 2010. Please reply below, but keep it pithy. Don’t publish your version of War & Peace in the comments (I might delete it). I’m the sole judge and arbiter (which means, I probably will delete anything resembling War & Peace). One entry only. Make sure there’s a means of contacting you in the post, or I’ll give your iPad to someone else. Remember, pithy is our watch word. You can answer this question in three well constructed sentences. If you win, I’ll want to get a picture of you using the iPad to monitor your systems remotely. Plan on sending me that picture by January 31st. An interesting picture. Something with you sitting in your cube at work just won’t fly.

That’s it. I’ll announce the winners in a new post on the blog at the end of the week. Here are the official rules:

  1. The contest is open to professionals with SQL Server monitoring responsibility. Entrants must be 18 years old or over.
  2. Entries must be received by Friday, December 17, 2010. The contest organizers accept no responsibility for corrupted or delayed entries.
  3. Employees of Red Gate, the contest organizers and their family members are not eligible to participate in the contest.
  4. Entries are limited to one per person across the three simultaneous contests hosted on SQLServerCentral.Com, BrentOzar.Com, and ScaryDba.Com.
  5. The organizers reserve the right, within their sole discretion, to disqualify nominations.
  6. The organizers’ decisions are final.
  7. Red Gate Software and those involved in the organization, promotion, and operation of the contest and in the awarding of prizes explicitly make no representations or warranties whatsoever as to the quality, suitability, merchantability, or fitness for a particular purpose of the prizes awarded and they hereby disclaim all liability for any loss or damage of any kind, including personal injury, suffered while participating in the contest or utilizing any prizes awarded. 
May 28 2010

Buy vs. Build

It’s the classic question faced by everyone in Information Services. I know how to do this and I could build software to do it, but I’m a lazy b_____d so I’d rather just pick up a piece of software that does it for me. I love working for large companies because I can usually get them to purchase stuff so that I can loll around stuffing my face all day instead of doing actual work. Unfortunately, not everyone can afford to pick up Microsoft’s Operations Manager or Idera’s Diagnostic Manager. But you still need to monitor your servers. With buy eliminated, that leaves build.

Which, is where this excellent blog post by Laerte Junior comes in. He lays out how to build a wrapper around calls to get Performan Counter information using PowerShell. It’s a pretty slick and worth a read. Becuase the thing is, when you need to build your monitoring products, you want to use a language that you know. Since everyone is learning Powershell (right?) this provides a good foundation for beginning your monitoring project.

Apr 16 2010

Confio Ignite: Part II

I’m continuing to evaluate Confio’s Ignite database monitoring tool. I’ve had it collecting data on a couple of production servers for about a week now. Based on what I’ve seen so far, it’s looking like a pretty good piece of software.

Breaking with the usual tradition, I’m going to talk about the things I’m not crazy about with the software, before I start singing its praises. The first thing, that I thought was missing, but is actually just hard to find, is the ability to look at the query information that Ignite collects, broken down by database. It looks like you should be able to get to it by looking at the Databases tab, but instead you have to first drill down into a time-period, then select specific databases within that time period, which will show you the queries by database. I know that in my environment, it’s this database listing of queries that is probably going to get used a lot. Tracking it down required help from the Confio team, which they quickly provided. They also showed me a way I could run queries to get this from their database so that I could create a report.

Speaking of reports, because they support Oracle & DB2 as well as SQL Server through their interface, no Reporting Services. OK, I recognize that I’m more than a bit Microsoft-centric when I say this, but I’d like to see SSRS reports so that I can manipulate things the way I want to. Again, not the end of the world, but it’s just something I don’t like. Because the data store is available though, I can get in there and create my own reports to my heart’s content and, if I like their report, I can always run a trace to capture the query and see how they built it so that I can build one of my own that mirrors it. I think they provide a mechanism for customizing reports by building some XML that will add it to the interface, which is a bit of a pain, but shows they’re on top of things.

One other thing that bothered me as I worked with Ignite is that, in the real time monitoring section, it was hard to find my way to the list of locks. It had the list, but it just wasn’t obvious how to get to them and this is something I’m used to looking at when I’m worrying about real time performance issues.

Right, enough talking about things I don’t like. See this? I love this:

That’s a break down, by database, by day, of the cumulative waits on the system. Yeah, that little pink database is causing all sorts of problems on one of my production systems. I actually already knew this was a problematic database, but I wasn’t that precisely aware of when and how it had issues. You can drill down to see the same thing for a given day:

And it’s not just the pretty pictures either, showing that most of our production load is in the morning, but that there are some odd spikes at midnight, 4AM & 7PM, but there’s data available behind the graph. If you drill down and hover over the graphs, pop-ups like this one appear:

(Names have been blacked out to protect my job)

And it’s the focus on wait times and types provided by the trending views and reports that make this a very strong tool. Ignite also collects pretty standard performance metrics, buffer cache hit ratio, memory paging, etc. And you can customize which of those metrics it collects and which ones you display, all on your own. But almost any decent monitoring tool does that. I use a tool that does that, and might even do that a bit better. No, what seperates this tool from the pack is that ability to identify the wait states and the query that is causing them. That’s what will get you excited when you’re using Ignite.

It also has a little ping mechanism that shows response time on the server, a helpful and friendly little reminder that all performance isn’t around what’s happening on SQL Server, but what’s happening across the enterprise.

Big chunks of the interface are customizable and you can add or drop things using the little icons on the right side of the picture above.

I can keep going with all the stuff that’s good, because it’s a long list, but I’ll summarize what I like so far. I like the focus on wait states, a lot. I like the focus on query performance. Between those two I feel like I’m getting the majority of what I need from the tool, and more than I get from other, similar, monitoring tools. The fact that it does all the other standard data collection on top feels like gravy. The problems I have with the software so far are either personal issues, Reporting Services, or fairly minor problems. I have a monitoring product running against the same servers as Ignite and I haven’t seen any impact from Ignite’s monitoring showing up there, so it looks like this is a benign monitoring software.

For my final installment, I’m going to look up the skirt of Ignite and see what the underlying data structure looks like. The ability to gather and present data is all well and good, but more important is the ability to take that data and make serious use of it through reports or by building other data structures out of it for long term storage and reporting.

Apr 06 2010

Confio Ignite: First Impressions

One of the national sponsors for SQL Satuday is Confio Software. Just because they sponsor SQL Saturday, it’s worth checking out their product, but they also host the speaker’s dinner at these events. So, having eaten off their dime, I felt obligated to take a look. I’m glad I did.

Confio Ignite is a monitoring software that keeps real-time and historical track of the performance of your SQL Server (and Oracle and DB2) database servers. It’s focus is on wait states and queues, a very common method for troubleshooting performance.

You can get a trial download from their web site to run for a couple of weeks yourself. Everyone’s first impression of a software, after the web site of course, is when go to do the install. The install routine for Ignite was very easy. It’s also a bit of a shock. It actually uses web pages for configuration. This makes sense when you consider that the software originally comes from Oracle and Unix systems. Anyway, despite the bit of weirdness of using a web page for configuring software, it’s a very easy install. Once the install is complete, it’s also really simple to add servers for the software to monitor.

I put it to work on a number of development servers to start (I’m not plugging some unknown software into Production without a shake-down. I may be stupid, but I’m not crazy). It’s agentless and it began running queries against the systems immediately. It’s storing the data to my local desktop. That data is available in real-time, a historical view, or a trending view, showing how performance is changing over time. It’s all through a web site, not dissimilar to Quest’s Foglight. You can see a list of servers, their immediate state, action choices and some other general info.

Clicking a server, after it’s had time to gather data, put’s you into the Trend screen. It’s laid out very nicely. You get to break down performance by SQL, Waits, Programs, Databases, Machines, or Users. Then it shows information by Total Wait, Average Wait or Typical Day. The chart gives you a visual representation of queries running against the system, by default numbered and assigned by an internal hashing system that Ignite uses (no, I don’t know what the internals look like, I’m inferring that from the information presented). Down at the bottom of the screen you can see Top Query Problems, which shows you queries that are running slow or using too many resources according configurable settings, or Resources, your standard breakdown of CPU, Memory Paging, Disk Queue and the rest, or SQL Text, literally the queries that have been running on your system.

Almost everything has a drill down that takes you to a lot more information, and there are pop-ups that show you bits of detail.

I’ve only been running it for a few days, but I can already see how it could be very useful. I’m very happy with the focus on waits and on queries. So often when you’re looking at these tools queries are either an after thought or ignored completely. You frequently have to spend time monitoring your queries on your own, while that very expensive monitoring software is tracking page life expectancy and buffer cache hit ratio, both useful measures, but you still need to know which query is causing the problems. Ignite will get to that information right away and still collect the other as well.

So far, I’m pretty pleased with the software. I’m going to switch this over to a production system in the next few days and see what I can see with Ignite running against a real system. More to come.

Sep 21 2009

Identifying Frequently Running Queries

It’s not enough to look at the longest running query. What if you have two queries, one runs for 5 minutes and the other runs for 5 seconds. Which do you tune? The first one, of course. But, let’s add a little information to our thought experiment. The one that runs for 5 minutes is called 2AM, your local time, when there are almost no users on the system and the one that runs for 5 seconds is called 40 times a minute all day long. Now, which one do you tune? That’s right, suddenly that 5 second query takes on a HUGE significance based on how often it’s called.

But how do you know how often a query is called? The easiest way to gather that information is not necessarily the best in terms of a long term methodology, a server side trace that captures RPC and SQL Batch completion events. The beauty of the server side trace is how easy it is to set up and how much information you can gather, so quickly. You can immediately start getting every query run on the server along with how long they took to run, how much CPU each used and various other metrics. You can run this for short periods or long and then gather the data together into a table in a database and run aggregation queries against it to generate reports showing the most frequently called query. Simple, quick easy. But there are a couple of problems with it. First, while you can run it 24×7, you’re going to see quite a bit of data generated from the events, depending on the load on your system, of course. Second, the queries that come through will likely have all kinds of stuff associated with them, various paramter values and what not, that will make aggregation hard. You’ll have to put together a means of cleaning the data and maintaining it or purchase one of the many third party tools for doing this. You’ll have to store and manage large amounts of data as it comes into the system. It’s going to be a lot of work.

Another option, one that will require a degree of setup, is a little less accurate, it’s unlikely to capture every single execution of every single query, but it will get the majority, sys.dm_exec_query_stats. This handy little dynamic management view carries aggregate data about all the queries currently in the cache. It will show how much accumulated processing time a query has taken, the last time, the longest time, the shortest time and lots of other information, but the one interesting value we would be most interested in here, is the usage count. If you just run a SELECT statement against sys.dm_exec_query_stats, you’ll get interesting data back, but you won’t see your query or your execution plan. To make the information available within the DMV, you’ll need to combine it with other DMV’s. Here’s a sample query that pulls back information that could be useful for identifying badly performing queries on your system including how often they’re called.

SELECT dest.text
,deqp.query_plan
,deqs.execution_count
,deqs.creation_time
,deqs.query_hash
,deqs.query_plan_hash
,deqs.total_elapsed_time
,deqs.total_clr_time
,deqs.total_physical_reads
,deqs.total_worker_time
FROM sys.dm_exec_query_stats deqs
CROSS APPLY ( SELECT *
           FROM sys.dm_exec_sql_text(deqs.sql_handle)) dest
CROSS APPLY ( SELECT *
           FROM sys.dm_exec_query_plan(deqs.plan_handle)) deqp

There’s a lot you could do with this query, but the real power is going to come from gathering this data on a regular basis. It’s going to depend on how volatile the cache is on your system. If most of your queries sit in the cache for weeks (and that’s possible), then you could run a query once a week to gather the information. If your cache is getting flushed clean once an hour, you’d need to run this query more often. Regardless of how often you run the query, understand that, because you’re dependent on the cache, it is possible to miss data. Once you get the query running on your system, you can use the information to determine which of the queries needs to be tuned, not just on how long it runs, but also how often it runs, or even using other combinations of the information returned.

One other point, you may see ad hoc queries on your system. Here’s a simple example:

SELECT TOP (5) sod.*
FROM Sales.SalesOrderDetail sod
WHERE sod.ProductID = 772
ORDER BY sod.SalesOrderID ASC ;

If the value passed to sod.ProductID is changed to 773 or any other value, you will get a different entry in cache for this procedure. That’s where you can put the query_hash or the query_plan_hash values to work. These values show a hash that SQL Server creates for each query that it stores in cache. You could group by these values to aggregate ad hoc queries that would, if they were stored procedures, be identical executions and arrive at an aggregated count.

So, if you need precise information, you can use trace events, but, pay for that with a lot of data that you have to manage.

EDIT: Changed deqs.total_elapsed_date to deqs.total_elapsed_time as per comments below.

Sep 14 2009

Editorial on SQL Server Central

My first one over there. It’s discussing whether or not you should do two things, build your own monitoring tool, come out in particular favor of one tool or suite of tools from a single vendor. Please read it and watch the video. And, even more importantly, leave a comment in the discussion.

Sep 09 2008

Query Fingerprints and Plan Fingerprints

SQL Server 2008 has been released, but the full capabilities of the product are still be discovered. This isn’t a case of discovery so much as Microsoft, in the person of Bart Duncan, has shown us something new. Read the article. This is one of the most exciting things I’ve seen in a long time. It’s especially exciting as we see databases becoming less and less manageable through the implementation of tools like nHibernate.

As I noted in a previous post, nHibernate will create a lot of execution plans. With the capabilities here, we’ll be able to easily and quickly aggregate some of those plans to identify the costly queries coming out of nHibernate without having to resort to 24 hour Profiler monitoring.

Great stuff.

Apr 01 2008

Performance Studio

I just found out about some new functionality coming out in SQL Server 2008 called Performance Studio. It’s actually largely a framework around which you can build performance monitoring routines for an entire enterprise. This sounds terrific. I’m going to dig into a bit and make it my presentation for the Heroes {Community} Launch event at SNESSUG next week. Here’s a Technet webcast on the topic. Here’s a very nice blog entry over at SQLTeam (I suppose I should ad them to my blog roll) discussing the function of the Data Collector, the foundation for this new framework. Performance Studio only works with 2008 systems though, so that’s something to take into account. Although I see an interview with Brad McGehee that says it’s not enterprise ready. Another something to take into account.