Monitoring for Timeouts

The question came up at SQL Rally, “Can you use Extended Events to monitor for query timeouts?”

My immediate response was yes… and then I stood there trying to think of exactly how I’d do it. Nothing came quickly to mind. So, I promised to track down the answer and post it to the blog.

My first thought is to use the Causality Tracking feature to find all the places where you have a sql_batch_starting without a sql_batch_completed (or the same thing with rpc calls). And you know what, that would work. But, before I got too deep in trying to write the query that would find all the mismatched attach_activity_id values that have a sequence of 1, but not one of 2, I did some additional reading. Seems there’s another little trick that works even better (sort of). It’s to set up an Event Pairing Target.

This is pretty slick. Here’s the code I used to set up for “timeouts” in SQL Server Management Studio:

CREATE EVENT SESSION [Timeouts] ON SERVER
ADD EVENT sqlserver.sql_batch_completed (
    ACTION (sqlserver.session_id)
    WHERE ([result] <> (2))),
ADD EVENT sqlserver.sql_batch_starting (
    ACTION (sqlserver.session_id))
ADD TARGET package0.pair_matching (SET begin_event = N'sqlserver.sql_batch_starting',
                                   begin_matching_actions = N'sqlserver.session_id',
                                   begin_matching_columns = N'batch_text',
                                   end_event = N'sqlserver.sql_batch_completed',
                                   end_matching_actions = N'sqlserver.session_id',
                                   end_matching_columns = N'batch_text')
WITH (MAX_MEMORY = 4096 KB,
      EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS,
      MAX_DISPATCH_LATENCY = 30 SECONDS,
      MAX_EVENT_SIZE = 0 KB,
      MEMORY_PARTITION_MODE = NONE,
      TRACK_CAUSALITY = ON,
      STARTUP_STATE = OFF)
GO

A couple of notes. You see the “timeout” in quotes above. That’s because I wasn’t getting a timeout in SSMS, despite setting one. Rather than muck around (I’ll figure that out later), I just set a long running query and then stopped it, but, I put a filter in place to not capture batches that terminated from an Abort command, [result] <> 2. But that did it. As soon as I aborted, I could use the query supplied at the link for Event Pairing Target to immediately see my unmatched pair. That’s quick and easy timeout monitoring using extended events. Output here:

<event name="sql_batch_starting" package="sqlserver" timestamp="2015-03-10T17:25:14.089Z">
    <data name="batch_text">
      <type name="unicode_string" package="package0" />
      <value>SELECT  *
FROM    Production.ProductCostHistory AS pch,
        Production.BillOfMaterials AS bom,
        Person.ContactType AS ct;</value>
    </data>
    <action name="session_id" package="sqlserver">
      <type name="uint16" package="package0" />
      <value>63</value>
    </action>
    <action name="attach_activity_id_xfer" package="package0">
      <type name="activity_id_xfer" package="package0" />
      <value>1BF4899F-4B74-434B-9CDE-2B646A560FB1-0</value>
    </action>
    <action name="attach_activity_id" package="package0">
      <type name="activity_id" package="package0" />
      <value>4B92C757-F780-4D8C-B1DA-F95C206A32EE-1</value>
    </action>
  </event>

So, here’s the biggest question, would I actually use this in production?

I’m a little nervous about using the batch_text as a matching point. With rpc_completed you could just use session_id and object_name or object_id. That’d be great. So, if I were going to use this in production, I’d probably only monitor for certain databases, putting tighter filtering in place for batch commands. But, other than that, I can query the target and pull out any timeout objects, so this is actually pretty slick and works well.

The only shortcoming I see is that you can’t add two pair matching targets in a session, so you would need one session for batch timeouts and another for rpc timeouts. Not the end of the world, but worth noting.

 

 

13 thoughts on “Monitoring for Timeouts

  • Cody

    Unfortunately I’m not savvy enough with extended events and the warnings and additional work given at the end. That’s a problem because now I’m not comfortable using the method.

    Maybe this is targeted at a level just above me.

  • Which part? I’m happy to help. For actual timeouts you don’t need the filter that I added. That was just so I could capture canceled calls. Then you just need the query shown for Event Pairing, and you’re done. You’ve got timeout monitoring in place.

    Thinking about it, I think it might work to have it match on the query hash. It’s an action that can be added pretty easily.

  • Cody

    Hi, I’m back, 6 months later. I was dealing with timeouts the past few months and had to experiment with XE to capture more detail when applications were swallowing it and just throwing generic error messages.

    I ended up not using any fancy pairing; simply an rpc/batch completed with a result 0 condition. To confirm it would work I experimented with aborting SSMS queries, and also simulating .NET timeouts in PowerShell.

    It picked up everything I wanted and now it’s part of the toolbox. I was curious though why you didn’t use this method. If there was some edge case that it doesn’t pick up (or false alarm) I’d still recommend it, given that it has helped me quickly and easily identify and resolve so many troubling problems lately.

    Now I’m more focused on finding out why some queries run fine and then start timing out later; using this, plus query fingerprints to identify multiple execution plans. I realised I didn’t fully understand how parts of queries get split in the plan cache, which is why I started reading your book, and how I came across this old post.

  • Is there some reason, why simply tracking the “sqlserver.sql_batch_completed” events with the result = “Abort” wouldn’t be enough?

    As far as I can tell, even if query times out, sql_batch_completed is emitted, with result = “Abort”. Basically pair_matching seems overkill for this (with potential memory pressure ect).

    • Not all commands actually get a completed event when there’s a disconnect from the other side. That’s why you’d need to do something like this. I’d only do this if I was under stress or I had a very targeted & specific set of functionality that I was attempting to troubleshoot. I wouldn’t recommend just setting this up and letting it run for the reasons you mention.

Please let me know what you think about this article or any questions:

This site uses Akismet to reduce spam. Learn how your comment data is processed.