Mar 04 2013

Pro SQL Server 2012 Practices: Chapter 17

A9781430247708-3d_1I jumped at the opportunity to write a chapter for this book, Pro SQL Server 2012 Practices, because of all the fantastic people writing and the topics that were covered. I reviewed Chapter 12 a few weeks ago and I’ve been meaning to get this chapter, Big Data for the SQL Server DBA, read & reviewed but I’ve just been sidetracked. I finally noticed it on my Trello board, weeks out of date, and decided to finally get the job done.

Carlos Bossy (b|t) is a great guy and a part of my SQL Family. We run into each other all the time at SQL Saturday’s and other events. I’ve never listened to him present before, and now, after reading his chapter in the book, I’m very sorry. I’ll fix that at my first opportunity.

Carlos starts off nicely, getting a reasonable definition of “big data.” It’s not merely about size, although that has to play a part. He breaks it down to the three “Vs” volume, variety and velocity. It makes sense. Just having a big database is not the same thing as dealing with big data. He takes the time to differentiate between the structures of a relational system, that work just fine, and the needs of big data, which just don’t fit well in relational storage. It’s as clear a delineation as I’ve seen on this topic.

The big data management system that Carlos then focuses on is Hadoop. Which makes sense again. If you’re already working within the Microsoft stack, they’re moving on Hadoop with the HDInsight servers and stuff on Azure. So learning how the Hadoop File System (HDFS) and MapReduce work together to provide you with massive data movement and storage keeps you within the set of tools with which you’re already familiar (although there is a ton of learning to do here). Carlos covers MapReduce in quite a lot of detail because it really is the driving force behind Hadoop and how you’re going to get your big data into the system. He covers the topic by explaining in general terms how things will work, then walks you through an example with additional detail and then walks you through another example with a lot more detail. The accumulative effect works well. The core concepts generally start to stick.

I really like how Carlos addresses the technology from a business solution perspective. It’s not that any one technology is a glorious thing, let’s go use it. It’s that certain business problems require certain tools in order to solve them most efficiently. He makes an excellent case for DBAs understanding technologies like Hadoop in order to implement them where they are appropriate. It’s the kind of arguments I’ve always tried to support. The concepts for management of big data are both simple and really complex and Carlos attacks them both in order to educate. I’ve learned a lot from reading this chapter.

He then walks through different aspects of the type of work you’ll need to support as a DBA using a Hadoop server. There’s going to be some pretty major work around import/export of the data, into and out of Hadoop yes, but into and out of other types of storage, such as good old fashioned relational SQL Server, data warehouses, cubes, and Excel/Sharepoint. Carlos takes us through a detailed examination of the primary tool for this function within the HDInsight/Hadoop infrastructure in Azure, squoop. It’s a great way to learn how this stuff works.

Finally, Carlos goes over some pieces of the future of big data, again focusing primarily on the Microsoft/SQL Server stack and how it relates to Hadoop, but also SSRS and all the tools with which you’re already familiar.

Overall, it’s a great chapter. It’s very solid introduction to a topic that I’m just starting to really wrestle with myself, so I’m happy to have it as a reference.

Dec 03 2012

HDInsight, Finally

See this:

That’s right. The install worked. All I had to do was get a completely clean server set up. No domain controller. Now to get my learn on.

Microsoft has a web site with a number of introductory samples. I’ll start there and work through them. The very first example gets me set up with some data that it builds by running a Powershell script, importdata.ps1. But I’m not going to just blindly follow along. I want to see what the heck is happening so I can start understanding this stuff. By the way, thank you Microsoft for making the samples in PowerShell and not forcing me to relearn Python or something else. That would have been frustrating.

The script is really simple. It has two scenarios you can pass it, w3c or RunTraffic. They just change directory and run another PowerShell script, import.ps1, from two different directories. I’ll be the scripts are different. I’m running the w3c scenario, so let’s see what that script is doing.

Ah, now things are getting interesting. There are two functions, one for data generation which uses an executable to make up test data. The other a mechanism for calling Hadoop. Basically it uses two objects, System.Diagnostics.ProcessStartInfo and System.Diagnostics.Process. The ProcessStartInfo is for defining startup information for a process that you then define using the Process command. In this case it’s setting the location of hadoop:

$pinfo.FileName = $env:HADOOP_HOME + "\bin\hadoop.cmd";

Then it sets up arguments, if any. The actual calls to this from the code use a command, dfs, which has different settings -mkdir and -copyFromLocal. From what I can tell, it’s creating a storage location within Hadoop and then moving the data generated over. I’m good with all the scripts I can see except knowing where this dfs thing comes from.

Data load ran just fine:

Data loaded, time to test out a Map/Reduce job. Again there’s a powershell script included for running a simple job, so I check it out. First run, fails. Great. More stuff to try to troubleshoot in order to be able to see this work. This is not going to be easy.

Stepping through and running the scripts might not be the best way to learn this. So, I’m going to now start reading the Big Data Jumpstart Guide. I’ll post more as I learn it.


Nov 29 2012

HDInsight: Trying again

And I thought this would be so easy.

Latest attempt. I just did an uninstall of HDInsight. Then I found all the Hadoop folders on the drive and removed them. I validated that there were no services left running or anything else that might interfere with a fresh install. Then, I rebooted the server (have you tried turning it off and on again?). After all that, I tried rerunning the HDInsight Community Preview. This time I ran it as an administrator on the system, even though I already am logged in as one. Paranoia at every step is my new approach since there have been no errors until I try to turn the silly thing on.

And the install went perfectly. I can see two folders, Hadoop and HadoopFeaturePackSetup. Last time there was a third folder HadoopInstall or something like that. This doesn’t bode well. Let me check the services. Nope. I can’t see anything added. Crud. So, I try the powershell command anyway, just to see what I’ll get. It’s a new error, which is nice:

Join-Path : Cannot bind argument to parameter ‘Path’ because it is null.
At C:\hadoop\start-onebox.ps1:2 char:23
+ $start_cmd = Join-Path <<<< $env:ISOTOPEJS_HOME “bin/start_daemons.cmd”;
+ CategoryInfo : InvalidData: (:) [Join-Path], ParameterBindingValidationException
+ FullyQualifiedErrorId : ParameterArgumentValidationErrorNullNotAllowed,Microsoft.PowerShell.Commands.JoinPathCom

Crud. My assumption is, I’m doing something wrong. I just can’t figure out what. Lots of searching later, I think I may have found the issue. I’ve been trying to install this on my local server that I use for lots of my tests because it’s a physical box that I can access on the road. It’s also a domain controller. I finally tracked down this thread on Microsoft’s forums. Great. OK. Next step, setting up a virtual server. May as well make it Windows 2012 while I’m at it. More to come when I get past this hurdle. Oy.

Nov 28 2012

HDInsight: First Pass

I installed HDInsight with no errors. The install is so brain-dead, I couldn’t see how there could be errors. But, I decided to fire up the service to start playing with it. That’s when I hit the errors. First up, I was getting logon errors for the Hadoop service itself. I tried resolving them manually, but was digging a hole. So, I uninstalled, and reinstalled, figuring I had missed an error message somewhere along the line. After this install, I didn’t get logon errors. Instead I got internal errors. Some more investigation and it appears that the service was marked for deletion. The uninstall worked, but somehow wasn’t complete. Yay! So, I did another uninstall after rebooting the server. Install #4 coming up. ¬†And we’re back to logon errors. Specifically:

Starting IsotopeJS services
Starting isotopejs
System error 1069 has occurred.

The service did not start due to a logon failure.

Let’s check the logs, if any. Nothing in the c:\Hadoop directory. Checking the system logs to see if there’s anything there. Nothing I can see. No joy. I’ll go through it all again tonight.