Category: HDInsight

Mar 04 2013

Pro SQL Server 2012 Practices: Chapter 17

A9781430247708-3d_1I jumped at the opportunity to write a chapter for this book, Pro SQL Server 2012 Practices, because of all the fantastic people writing and the topics that were covered. I reviewed Chapter 12 a few weeks ago and I’ve been meaning to get this chapter, Big Data for the SQL Server DBA, read & reviewed but I’ve just been sidetracked. I finally noticed it on my Trello board, weeks out of date, and decided to finally get the job done.

Carlos Bossy (b|t) is a great guy and a part of my SQL Family. We run into each other all the time at SQL Saturday’s and other events. I’ve never listened to him present before, and now, after reading his chapter in the book, I’m very sorry. I’ll fix that at my first opportunity.

Carlos starts off nicely, getting a reasonable definition of “big data.” It’s not merely about size, although that has to play a part. He breaks it down to the three “Vs” volume, variety and velocity. It makes sense. Just having a big database is not the same thing as dealing with big data. He takes the time to differentiate between the structures of a relational system, that work just fine, and the needs of big data, which just don’t fit well in relational storage. It’s as clear a delineation as I’ve seen on this topic.

The big data management system that Carlos then focuses on is Hadoop. Which makes sense again. If you’re already working within the Microsoft stack, they’re moving on Hadoop with the HDInsight servers and stuff on Azure. So learning how the Hadoop File System (HDFS) and MapReduce work together to provide you with massive data movement and storage keeps you within the set of tools with which you’re already familiar (although there is a ton of learning to do here). Carlos covers MapReduce in quite a lot of detail because it really is the driving force behind Hadoop and how you’re going to get your big data into the system. He covers the topic by explaining in general terms how things will work, then walks you through an example with additional detail and then walks you through another example with a lot more detail. The accumulative effect works well. The core concepts generally start to stick.

I really like how Carlos addresses the technology from a business solution perspective. It’s not that any one technology is a glorious thing, let’s go use it. It’s that certain business problems require certain tools in order to solve them most efficiently. He makes an excellent case for DBAs understanding technologies like Hadoop in order to implement them where they are appropriate. It’s the kind of arguments I’ve always tried to support. The concepts for management of big data are both simple and really complex and Carlos attacks them both in order to educate. I’ve learned a lot from reading this chapter.

He then walks through different aspects of the type of work you’ll need to support as a DBA using a Hadoop server. There’s going to be some pretty major work around import/export of the data, into and out of Hadoop yes, but into and out of other types of storage, such as good old fashioned relational SQL Server, data warehouses, cubes, and Excel/Sharepoint. Carlos takes us through a detailed examination of the primary tool for this function within the HDInsight/Hadoop infrastructure in Azure, squoop. It’s a great way to learn how this stuff works.

Finally, Carlos goes over some pieces of the future of big data, again focusing primarily on the Microsoft/SQL Server stack and how it relates to Hadoop, but also SSRS and all the tools with which you’re already familiar.

Overall, it’s a great chapter. It’s very solid introduction to a topic that I’m just starting to really wrestle with myself, so I’m happy to have it as a reference.

Dec 03 2012

HDInsight, Finally

See this:

That’s right. The install worked. All I had to do was get a completely clean server set up. No domain controller. Now to get my learn on.

Microsoft has a web site with a number of introductory samples. I’ll start there and work through them. The very first example gets me set up with some data that it builds by running a Powershell script, importdata.ps1. But I’m not going to just blindly follow along. I want to see what the heck is happening so I can start understanding this stuff. By the way, thank you Microsoft for making the samples in PowerShell and not forcing me to relearn Python or something else. That would have been frustrating.

The script is really simple. It has two scenarios you can pass it, w3c or RunTraffic. They just change directory and run another PowerShell script, import.ps1, from two different directories. I’ll be the scripts are different. I’m running the w3c scenario, so let’s see what that script is doing.

Ah, now things are getting interesting. There are two functions, one for data generation which uses an executable to make up test data. The other a mechanism for calling Hadoop. Basically it uses two objects, System.Diagnostics.ProcessStartInfo and System.Diagnostics.Process. The ProcessStartInfo is for defining startup information for a process that you then define using the Process command. In this case it’s setting the location of hadoop:

$pinfo.FileName = $env:HADOOP_HOME + "\bin\hadoop.cmd";

Then it sets up arguments, if any. The actual calls to this from the code use a command, dfs, which has different settings -mkdir and -copyFromLocal. From what I can tell, it’s creating a storage location within Hadoop and then moving the data generated over. I’m good with all the scripts I can see except knowing where this dfs thing comes from.

Data load ran just fine:

Data loaded, time to test out a Map/Reduce job. Again there’s a powershell script included for running a simple job, so I check it out. First run, fails. Great. More stuff to try to troubleshoot in order to be able to see this work. This is not going to be easy.

Stepping through and running the scripts might not be the best way to learn this. So, I’m going to now start reading the Big Data Jumpstart Guide. I’ll post more as I learn it.