PASS Keynote Day #3: Dr. Dewitt

And we’re off. We opened with a video of people saying “Connect, Share, Learn” and “This, is Community”

Rob Farley & Buck Woody came out with a song about long running queries.

[8:20]Wayne Snyder has been working with the PASS organization since 1999. He spoke at the first PASS Summit and he’s been on the board forever. He has finally hit the point as immediate-past president where he has to leave the board. We’ve got a great little thank you for him from all sorts of people. Wayne is a magnificent guy, seriously. If you see him, thank him for his service.

[8:28]We have a new executive committee, Bill Graziano as President, WHOOP, Douglas McDowell is Executive Vice-President and Thomas LaRock is VP of Marketing and finally, Rushabh Mehta is now the immediate past president.

SQL Rally Nordic is taking place in Sweden and has completely sold out. SQL Rally Dallas will be in May. We have tons of SQL Saturday’s coming up.

You can get a registration, including 2 full days of Pre-Con between now & Nov 15, for $1395.

[8:33]Dr. David Dewitt, Big Data, What’s the Big Deal?

I got to meet Dr. Dewitt earlier in the week. I’m very excited about this presentation.

He’s going to be presenting with a co-presenter, Rimma Nehme is going to be on stage helping out. His presentations are magnificent.

[8:38]Dr. Dewitt, despite being smarter than the whole room, is really funny. He’s opening up with some great slides and some good humor.

And then we’re off. We’re talking big data. Petabytes. Typically housed on large clusters of low-cost commodity hardware. He’s also talking Zetabytes. Uh, wow.

Why are things growing so much? More and more things are picking up data. There are sensors from phone location & others. There are web clicks and page views. Data has been determined to be too valuable to delete. Cost of storage has dropped.

[8:42]Managing “Big Data” The old system is to use a parallel database system. like Ebay with 10PB on 256 nodes. New stuff is a NoSQL System like Facebook, 20PB on 2700 nodes. Bing is 150PB on 40K nodes. Wow!

NoSQL is not meant to be that SQL is dead, but that there are things in addition to SQL.

Why do people love NoSQL? There is more data model flexibility. Relaxed consistency models such as eventual consistency. Low upfront software costs. Never learned anything but C/Java in school, so maybe they’re not smart enough. Finally, introducing time to insight.

[8:45]Time to insight. The idea is not to load data into a system, but instead of schema first, they just want to get it. No cleansing, No ETL, No load, analyze data where it stands. Schema first, vs. Schema later.

Major Types of NoSQL Systems. Key/Value Stores like MongoDB or Cassandra, Usually have a data model such as JSON, records are sharded across nodes in a cluster by hashing on key, single record retrieval. The other kind is Hadoop, scalable fault tolerant framework for storing and processing massive data sets, no data model, records are stored in a file system. The first is NOSQL OLTP and the second is NOSQL Datawarehouse.

[8:50]The new reality is that we’re really going to see two universes as the new reality, Structures and Unstructured data or relational systems and NoSQL systems. Relational databsae systems provide maturity, stability, consistency. The noSQL systems are all about flexibility.

Why is Dr. Dewit talking? Because the world has changed. Relational db systems no longer the only game in town. As SQL people we must accept this new reality and understand how best to deploy technologies. This is not a paradigm shift. RDBMS will continjue to dominate transactoin processing and all small to medium sized data warehouse. But many Businesses will end up with data in both universes.

[8:52] Hadoop all started at Google. They needed to manage massive amounts of click stream data. It had to e scalable, fault tolerant, easy to program against.

What does Hadoop Offer? Ability to analyze massive amounts of data. Scalable. Easy to program, low upfront costs, think big data warehousing

The stack is a HDFS at the bottom, then MapReduce, then Hive & Pig, on the size is Sqoop and then there are other management parts.

[8:55] Underpinnings of the entire Hadoop ecosystem. HDFS. Traditional hierarchical file system. Written in Java so it’s highly portable.

File splits are done through 64mb chunks and then the blocks are stored around the cluster. Each block is stored as a separate file.

Disk placement. A replication factor is set. Assuming a set of 3, it uses triple replication. Then you can survive two failures.

There is a name node, which is one instance per cluster, which is a single point of failure. There’s a backup node, which backs up the name node? And then there’s a series of data nodes.

Giant file comes in, a bunch of blocks are created, then the namenode receives messages with the blocks and the namenode moves it into appropriate datanodes, but the client does the writes, the namenode just balances and replicates.

[9:02]Reads go the other way, the namenode tells the client where the data is stored and reads it back out that way.

Failures can occur through disk errors, datanode failures, switch/rack failures, namenode failures, data center failures.

Datanode failures is handled by the namenode which always manages the datanodes, tracking what’s stored where and which datanodes are available and which are not. When there’s a failure of a datanode, the namenode will identify which blocks were stored on that datanode and will replicate them to other nodes. Further, it will balance stuff out as datanodes come back online.

[9:07]This means you get something that is highly scalable. No use of mirroring or RAID, which reduces cost. It uses a single mechanism (triply replicated blocks) t deal with a variety of failure types rathern thatn multiple different mechanisms. Negatives, block locations and record placement invisible to higher level software. Makes it impossible to employ many optimizations successfully employed by parallel DB systems.

So to improve performance they use MapReduce. Takes a large problem and divides it into small problems. Perform the same function on all sub-problems. Combine the output from all sub-problems. The first is the map, the last is the reduce.

MapReduce is done by a Master, JobTracker, and a set of Slaves, TestTracker. The JobTracker watches failures, etc.

It all works with HDFS. On each node there is a TaskTracker and a DataNode and the JobTracker is on the server with the NameNode.

[9:15]Seeing the data come out of the MapReduce mechanism, but then you see that the data can’t be grouped in mechanisms other than how they’re stored.

Reduce Phase basically takes each mapper and reads from them to get the information out of the reducer. Each reducer works with a mapper and the reducer is the thing that applies the function that actually fixes the data coming out of the mappers.

Yeah, I’m starting to get a bit lost.

Actual number of Map tasks, M is generally much larger than the number of nodes used. This heps deal with data skew and failure. Skew with reducers is still aproblem.

Failures, like HDFS, MapReduce framework is fault tolerant & other stuff.

Beauty of this stuff is that it is highly fault tolerant, relativey easy to write arbitrary distributed computations, mr framework removes burden of dealing with failures from programmer.

Cons are Schema embedded in application code, which means that sharing data between apps is really hard. Also, performance tuning is difficult.

Keeping up with this stuff as fast as I can. We’re drinking from a really big fire hose in here.

[9:24]Hive and Pig. MapReduce can’t really do joins. Developers can spend days writing apps to analyze data like what we can do with a query in the relational systems (although I know people that take days to write a TSQL query). Declarative query languages are not going away. It’s still efficient for what it does.

Hive and HiveQL s the mechanism used to put in a query language. Hive has tables. Richer column types than in SQL. You get the primitive types, but you also get stuff like associative arrays, lists, structs.Hive tables have to be partitioned. It’s still using HDFS files.

[9:28]All the files are stored in chunks. If there’s no filtering, it will go against all files. This thing could seriously thrash disks, especially when you consider the fact that data is not relational at all.

HiveQL Optimization and Execution. There is very little statistics. Uses simple heuristics of pushing sections below joins, output of … something. slide went by too quick.

PDW vs. Hive. Testing using 600gb from TPC benchmarks. On small data sets, for straight forward queries, it was pretty radically different. Then when you complicate it, hive was about 4000 seconds, pdw is about 1000, and then pdw-p is a factor of 10 faster. That’s because of how parallel data systems can work.

Hive vs. PDW. Partitioning the hive tables provides no benefit since there is no way t control where HDFS places the blocks. Different for PDW.

We’re going to have to connect the two universes. Increasingly the data first lands in unstructured universe. MapReduce is an excellent big data ETL tool. Sqoop provides a command line load utility.

Some analyses are easier to do in a procedural language. Sqoop provides querycapability to pull data from RDBMS using SQL, but you can’t get good performance.

Some applications need data from both universes. Only option is unstructured universe as unstructured data can’t go into structured. Sqoop moves it over to there.

This means that there are some types of queries that are never going to perform well with this data.

[9:40]And I just got lost. Sqoop is really complicated. It basically moves the data in & out of the two universes, which scans the entire table (yes, scans) a table N+1 times.

There has to be a better way!

Moving data is so 20th Century. Why not build a database system that understands both systems. It can have the expressive power of a language like HiveQL. He’s trying to build an “Enterprise Data Manager” which his partner hates (name, open to suggestions).

Dr. Dewitt asserts that SQL Server PDW just needs to understand unstructured data. It needs improved scalability.

Jenn McCown of Midnight DBA suggested TARDIS because it can move between universes. I like that. Let’s lobby.

Remember, this is not a paradigm shift. These things are designed to meet different problems. RDBMS only or HADOOP only is not going to be the default.

Send ideas for next year to dewitt@microsoft.com We want this man to come back guys.

This was another great presentation from Dr. Dewitt!

[9:48]And now for the Q&A.Bing David Dewitt and you can get a link to the PASS Talks. The slide deck is available.

What features can we expect to see in SQL Server that manage private cloud. He can’t answer.

What are impacts of big data on scientific community? Dr. Dewitt talks about how the Sloan digital sky survey data was managed by Jim Gray (before he was lost at sea). They are working on building database systems for scientific data that allows for declarative languages. he says that the science community just doesn’t use anything but files. They’re trying.

Here are his slide decks

[9:55]Can you elaborate more on the importance of supporting Hadoop? he does believe that there are two universes, so hadoop is out there and running next to SQL Server, today. The world has spoken. The two things are being used. We need to embrace it. We should not bury our heads in the sand.

As a DBA working primarily with relational databases, what should I do to be better prepared for this new universe? Dr. Dewitt says download and play with the code. Get started.

4 thoughts on “PASS Keynote Day #3: Dr. Dewitt

Please let me know what you think about this article or any questions:

This site uses Akismet to reduce spam. Learn how your comment data is processed.