Mar 04 2013

Pro SQL Server 2012 Practices: Chapter 17

A9781430247708-3d_1I jumped at the opportunity to write a chapter for this book, Pro SQL Server 2012 Practices, because of all the fantastic people writing and the topics that were covered. I reviewed Chapter 12 a few weeks ago and I’ve been meaning to get this chapter, Big Data for the SQL Server DBA, read & reviewed but I’ve just been sidetracked. I finally noticed it on my Trello board, weeks out of date, and decided to finally get the job done.

Carlos Bossy (b|t) is a great guy and a part of my SQL Family. We run into each other all the time at SQL Saturday’s and other events. I’ve never listened to him present before, and now, after reading his chapter in the book, I’m very sorry. I’ll fix that at my first opportunity.

Carlos starts off nicely, getting a reasonable definition of “big data.” It’s not merely about size, although that has to play a part. He breaks it down to the three “Vs” volume, variety and velocity. It makes sense. Just having a big database is not the same thing as dealing with big data. He takes the time to differentiate between the structures of a relational system, that work just fine, and the needs of big data, which just don’t fit well in relational storage. It’s as clear a delineation as I’ve seen on this topic.

The big data management system that Carlos then focuses on is Hadoop. Which makes sense again. If you’re already working within the Microsoft stack, they’re moving on Hadoop with the HDInsight servers and stuff on Azure. So learning how the Hadoop File System (HDFS) and MapReduce work together to provide you with massive data movement and storage keeps you within the set of tools with which you’re already familiar (although there is a ton of learning to do here). Carlos covers MapReduce in quite a lot of detail because it really is the driving force behind Hadoop and how you’re going to get your big data into the system. He covers the topic by explaining in general terms how things will work, then walks you through an example with additional detail and then walks you through another example with a lot more detail. The accumulative effect works well. The core concepts generally start to stick.

I really like how Carlos addresses the technology from a business solution perspective. It’s not that any one technology is a glorious thing, let’s go use it. It’s that certain business problems require certain tools in order to solve them most efficiently. He makes an excellent case for DBAs understanding technologies like Hadoop in order to implement them where they are appropriate. It’s the kind of arguments I’ve always tried to support. The concepts for management of big data are both simple and really complex and Carlos attacks them both in order to educate. I’ve learned a lot from reading this chapter.

He then walks through different aspects of the type of work you’ll need to support as a DBA using a Hadoop server. There’s going to be some pretty major work around import/export of the data, into and out of Hadoop yes, but into and out of other types of storage, such as good old fashioned relational SQL Server, data warehouses, cubes, and Excel/Sharepoint. Carlos takes us through a detailed examination of the primary tool for this function within the HDInsight/Hadoop infrastructure in Azure, squoop. It’s a great way to learn how this stuff works.

Finally, Carlos goes over some pieces of the future of big data, again focusing primarily on the Microsoft/SQL Server stack and how it relates to Hadoop, but also SSRS and all the tools with which you’re already familiar.

Overall, it’s a great chapter. It’s very solid introduction to a topic that I’m just starting to really wrestle with myself, so I’m happy to have it as a reference.

Dec 03 2012

HDInsight, Finally

See this:

That’s right. The install worked. All I had to do was get a completely clean server set up. No domain controller. Now to get my learn on.

Microsoft has a web site with a number of introductory samples. I’ll start there and work through them. The very first example gets me set up with some data that it builds by running a Powershell script, importdata.ps1. But I’m not going to just blindly follow along. I want to see what the heck is happening so I can start understanding this stuff. By the way, thank you Microsoft for making the samples in PowerShell and not forcing me to relearn Python or something else. That would have been frustrating.

The script is really simple. It has two scenarios you can pass it, w3c or RunTraffic. They just change directory and run another PowerShell script, import.ps1, from two different directories. I’ll be the scripts are different. I’m running the w3c scenario, so let’s see what that script is doing.

Ah, now things are getting interesting. There are two functions, one for data generation which uses an executable to make up test data. The other a mechanism for calling Hadoop. Basically it uses two objects, System.Diagnostics.ProcessStartInfo and System.Diagnostics.Process. The ProcessStartInfo is for defining startup information for a process that you then define using the Process command. In this case it’s setting the location of hadoop:

$pinfo.FileName = $env:HADOOP_HOME + "\bin\hadoop.cmd";

Then it sets up arguments, if any. The actual calls to this from the code use a command, dfs, which has different settings -mkdir and -copyFromLocal. From what I can tell, it’s creating a storage location within Hadoop and then moving the data generated over. I’m good with all the scripts I can see except knowing where this dfs thing comes from.

Data load ran just fine:

Data loaded, time to test out a Map/Reduce job. Again there’s a powershell script included for running a simple job, so I check it out. First run, fails. Great. More stuff to try to troubleshoot in order to be able to see this work. This is not going to be easy.

Stepping through and running the scripts might not be the best way to learn this. So, I’m going to now start reading the Big Data Jumpstart Guide. I’ll post more as I learn it.


Nov 29 2012

HDInsight: Trying again

And I thought this would be so easy.

Latest attempt. I just did an uninstall of HDInsight. Then I found all the Hadoop folders on the drive and removed them. I validated that there were no services left running or anything else that might interfere with a fresh install. Then, I rebooted the server (have you tried turning it off and on again?). After all that, I tried rerunning the HDInsight Community Preview. This time I ran it as an administrator on the system, even though I already am logged in as one. Paranoia at every step is my new approach since there have been no errors until I try to turn the silly thing on.

And the install went perfectly. I can see two folders, Hadoop and HadoopFeaturePackSetup. Last time there was a third folder HadoopInstall or something like that. This doesn’t bode well. Let me check the services. Nope. I can’t see anything added. Crud. So, I try the powershell command anyway, just to see what I’ll get. It’s a new error, which is nice:

Join-Path : Cannot bind argument to parameter ‘Path’ because it is null.
At C:\hadoop\start-onebox.ps1:2 char:23
+ $start_cmd = Join-Path <<<< $env:ISOTOPEJS_HOME “bin/start_daemons.cmd”;
+ CategoryInfo : InvalidData: (:) [Join-Path], ParameterBindingValidationException
+ FullyQualifiedErrorId : ParameterArgumentValidationErrorNullNotAllowed,Microsoft.PowerShell.Commands.JoinPathCom

Crud. My assumption is, I’m doing something wrong. I just can’t figure out what. Lots of searching later, I think I may have found the issue. I’ve been trying to install this on my local server that I use for lots of my tests because it’s a physical box that I can access on the road. It’s also a domain controller. I finally tracked down this thread on Microsoft’s forums. Great. OK. Next step, setting up a virtual server. May as well make it Windows 2012 while I’m at it. More to come when I get past this hurdle. Oy.

Nov 28 2012

HDInsight: First Pass

I installed HDInsight with no errors. The install is so brain-dead, I couldn’t see how there could be errors. But, I decided to fire up the service to start playing with it. That’s when I hit the errors. First up, I was getting logon errors for the Hadoop service itself. I tried resolving them manually, but was digging a hole. So, I uninstalled, and reinstalled, figuring I had missed an error message somewhere along the line. After this install, I didn’t get logon errors. Instead I got internal errors. Some more investigation and it appears that the service was marked for deletion. The uninstall worked, but somehow wasn’t complete. Yay! So, I did another uninstall after rebooting the server. Install #4 coming up. ¬†And we’re back to logon errors. Specifically:

Starting IsotopeJS services
Starting isotopejs
System error 1069 has occurred.

The service did not start due to a logon failure.

Let’s check the logs, if any. Nothing in the c:\Hadoop directory. Checking the system logs to see if there’s anything there. Nothing I can see. No joy. I’ll go through it all again tonight.

Nov 09 2012

PASS Summit Day 3: Dr. David Dewitt

Two quick points, I’m putting this blog together using the Surface.. ooh… and this isn’t a keynote, but a spotlight session at the Summit. Still, I thought I would live blog my thoughts because I’ve done it for every time Dr. Dewitt has spoken at the Summit.

Right off, he has a slide with a little brain character representing himself.

But, we’re talking PolyBase, and futures. This is basically a way to combine hadoop unstructured nosql data with structured storage within SQL Server. Mostly this is within the new Parallel Datawarehouse. But it’s coming to all of SQL Server, so we need to learn this. The information ties directly back to what was presented at yesterday’s keynote.

HDFS is the file system. On top of that a framework for executing distributed fault-tolerant algorithms. Hive & Pig are the SQL languages. Sqooop is the package for moving data and Dr Dewitt says it’s awful and he’s going to tell us why.

HDFS was based on a google file system. It supports 1000s of nodes and it assumes hardware failure. It’s aimed at small numbers of large files. Write once, read multiple times. The limitations on it are caused by the replication of the files which makes querying the information from a datawarehouse more difficult. He covers all the types of nodes that manage HDFS.

MapReduce is used as a framework for accessing the data. It splits the big problem into several small problems. It puts the work out into the nodes. That’s Map, Then the partial results from all the nodes is combined back together through Reduce. MapReduce uses a master, JobTracker and slaves, multiple TaskTrackers.

Hive, a datawarehouse solution for Hadoop. Supports SQL-like queries.It has somewhat performant queries. By somewhat he says that the PDW is 10 times faster.

Sqoop is the library and framework for moving data between HDFS and a relational DBMS. It seriealizes access to hadoop. That’s the purpose of PolyBase to get parallel execution access all the Hadoop hdfs. Sqoop breaks up a query through Map process. Then Sqooop runs two queries a count, and then reworks the query into a pretty scary query including an ORDER BY statement. This causes multiple scans against the tables.

Dr. Dewitt talks through the choices for figuring out how to put together the two data sets, structured and unstructured. The approach taken by Polybase is to work directly into HDFS, ignoring where the nodes are stored. Because it’s all going through their own code, they’re also setting up to text and other data streams.

They’re parallelizing access to HDFS and supporting multiple file types. Further, putting “structure” on “unstructured data”

By the way, I’m trying to capture some of this information, but I have to pay attention. This is great stuff.

How the DMS,the stuff used by Microsoft to manage the jump between HDFS and SQL Server is just flat out complicated. But the goal was to address the issues above and it does it.

He’s showing the direction that they’re heading in. You can create nodes and objects within the nodes through sql-like syntax. Same thing with the queries. They’ll be using the PDW optimizer. Phase 2 modifies the methods used.

I’m frankly having a little trouble keeping up.

It’s pretty clear that the PDW in combination with the HDFS allows for throwing lots and lots of machines at the problem. If I was in the situation of needing to collect & process seriously huge data, I’d be checking this out. The concepts are to use MapReduce directly, but without requiring the user to do that work, but instead using TSQL. It’s seriously slick.

By the way, this is also making yesterday’s keynote more exciting. That did get a bad rap yesterday, but I’m convinced it was a great presentation spoiled by some weak presentation skills.

All the work in Phase 1 is done on PDW. Phase 2 moves the work, optionally, to HDFS directly, but still allows for that to be through a query.

Dr. Dewitt’s explanation of how the queries are moved in and out of PDW and HDFS are almost understandable, not because he[s not explaining it well, but because I’m not understanding it well. But seeing how the structures are logically handling the information does make me more comfortable with what’s going on over there in HDFS.

I’m starting to wonder if I’m making buggy whips and this is an automobile driving by. The problem is, how on earth do you get your hands on PDW to start learning this?