Distributed Replay: The Little Engine That Almost Could

Honestly, sincerely, no kidding, I love Distributed Replay. Yes, I get it. Proof positive I’m an idiot. As we needed proof. To be a little fair to me, I love what Distributed Replay could have been, with a little more love. However, fact is, it’s on the deprecation list for 2022. Which means, what minimal amount of love, if any, that Microsoft was giving to it, it’s all gone, forever. Unlike the Little Engine That Could, turns out that Distributed Replay was the Little Engine That Almost Could, But Didn’t. Really Didn’t. Let’s discuss it a bit.

Distributed Replay

The concept is wonderful. Capture a bunch of queries from your production system. Replay them on a non-production system for testing. Add in the idea of being able to chain together multiple machines to make it all happen, suddenly, you’ve got a full-blown load testing and performance testing environment that is going to very accurately mirror behaviors in production. Perfection.

Setup and administration seemed straight forward. You have a target server of course. You have an administration tool (command line) and a controller. The controller orchestrates one or more (and it really, really should be more to make this useful) than one client (which are set up on multiple different machines). Easy enough.

Go to production. There’s a ready-made Trace you can use to capture the queries in the format needed for Distributed Replay. Backup your database. Actually, best get it set up so you can restore to a point in time.

Put it all together, perfect test environment.

So what happened?

I Think I Can’t

Before we talk all the pain, let me say, I’ve done this, successfully, multiple times with multiple systems. It really can be done. That out of the way, let’s talk about the experience.

First, Trace. I don’t like it. I don’t use it. There was a way to capture the queries using Extended Events and then convert them to the format needed for Distributed Replay because it didn’t support using Extended Events but the conversion process was hard and long and added to the pain and no one really wanted to do it. So, Trace.

Also, quick one, you’re taking raw, possibly GDPR violating data, out of production, into a test system. My hackles are raised. How about yours?

Next, you really do have to get your database restored to a point in time the directly correlates to your replay. In fact, the only way I ever got it to work was to restore to the EXACT point in time when my Trace starts. Otherwise, you miss one transaction, and then it snowballs. That transaction changed data that affects the next three. They changed data that affects the next 107. Etc. You get the point. Instead of a whole bunch of queries running, you just get a sea of red, like a Biblical plague.

Client set up is easy. Connectivity to the controller is spotty and weird and the best way you can do it is to make everything, everywhere, admin.

The controller, hoo boy. All through a command line interface. Documentation on it is… less than ideal. However, I wrote up a simplified approach in my query tuning book (actively getting rewritten, from scratch, for 2022, and won’t include a chapter on Distributed Replay). There were others online. Still, it made it all difficult.

Let’s say you get it set up perfectly, first time (HA!!!). It just works right? Yeah, maybe. Unless one of the clients hiccups and gets out of sync. Then, sea of red time. Oh, and reset time. Restore the database to that perfect point in time (you did save the script right?) and back into the Controller command line.

Look, getting it to work just sucked. So people gave up. Which means no one is using the feature, requesting functions, reporting bugs, etc.. Why would Microsoft continue investment? They wouldn’t.

Conclusion

I completely understand why Distributed Replay was deprecated. I still don’t like the fact. When I did get it set up, despite the real pain involved, the results were spectacular. I could accurately predict exactly how a new index, a code change, all sorts of stuff, would directly impact production. But, I had the, relatively unique, support of a situation that let me take the time to learn how to implement this monster of a process (and it really is painful). So, it is with sadness that I be farewell to the Little Engine That Couldn’t.

One thought on “Distributed Replay: The Little Engine That Almost Could

Please let me know what you think about this article or any questions:

This site uses Akismet to reduce spam. Learn how your comment data is processed.