There’s a lot of buzz around Hadoop. If you’re like most people, you’ve looked into Hadoop and found a bewildering array of products, terms, and technology surrounding Hadoop – Hive, Pig, HDFS, MapReduce, HBase, co-processors, ZooKeeper, etc. Knowing where to start can be daunting. Trying to make sense of the Hadoop world is certainly possible, but takes a lot of time. Thankfully, I’ve gone ahead and done the work for you.
This is an update of last year’s presentation Hadoop Basics for DBAs.
What is Hadoop?
Hadoop is a group of tools to help developers create bigger tools. More specifically, Hadoop is a basic set of tools that help developers create applications spread across multiple CPU cores on multiple servers – it’s parallelism taken to an extreme. Although Hadoop is a set of tools and libraries, there are a number of products that are lumped into the same bucket as Hadoop and, frequently, they’re all referred to as Hadoop. Instead of describing every piece of Hadoop in detail, I’m going to focus on the functionality that’s of the most interest to SQL Server professionals.
Data Warehousing in Hadoop
If you need to work with big data, Hadoop is becoming the _de facto_ answer. But once your data is in Hadoop, how do you query it?
If you need big data warehousing, look no further than Hive. Hive is a data warehouse built on top of Hadoop. Hive is a mature tool – it was developed at Facebook to handle their data warehouse needs. It’s best to think of Hive as an enterprise data warehouse (EDW) – Hive can be used to research complex interactions across your company’s entire history of data, in exchange for that power you have to understand that queries will return in minutes. Unlike traditional EDWs, Hive is spread across tens, hundreds, or even thousands of commodity grade servers.
Hive was designed to be easy for SQL professionals to use. Rather than write Java, developers write queries using HiveQL (based on ANSI SQL) and receive results as a table. As you’d expect from an EDW, Hive queries will take a long time to run; results are frequently pushed into tables to be consumed by reporting or business intelligence tools. It’s not uncommon to see Hive being used to pre-process data that will be pushed into a data mart or processed into a cube.
While Hive can operate on large volumes of data, it’s not the most efficient tool: Impala seeks to overcome some of the limitations of Hive by making better use of CPU, memory, and disk resources. Impala operates more like SQL Server – data is cached in memory to improve query performance. Although Impala uses a separate query engine than hive, it uses data that’s already in Hadoop, making it easy to query massive amounts of data without having to store your data twice.
Both Impala and Hive are great for businesses querying of amounts of data while avoiding expense of massively parallel EDW solutions like Microsoft SQL Server PDW or Oracle Exadata. Hive is in a stable release cycle and, although Impala is still a beta product, many organizations are deploying one or both solutions to tackle their largest workloads.
SQL Server professionals are familiar with using SQL Server Integration Services (SSIS) to move data around their organization. SSIS provides a rich set of functionality for manipulating, but it’s difficult to make SSIS operations run across multiple CPU cores, much less multiple servers.
Pig is a tool for creating parallel data workflows. Pig takes advantage of the Hadoop tools to provide rich functionality across huge amounts of data. Pig makes it easy to perform step-by-step data manipulation over large data sources using a combination of different tools and functionality. There are a number of great reasons to use Pig (parallel processing, sampling, and loose schema requirements), but it’s safe to say that Pig is a great tool for processing data with Hadoop.
Deep Analysis of Data
SQL Server professionals are used to having analytic insight available, either through SQL Server’s windowing functions or through SQL Server Analysis Services. Although Hadoop doesn’t natively provide tools for OLAP style cubes or for windowing functions, it’s possible to gain insight from your data using Hadoop. Unfortunately, deep analytics are not Hadoop’s strong suit out of the box. Teams looking to take advantage of large scale data analytics will be doing a lot of heavy lifting themselves.
Mahout is a set of libraries that can be used to distribute analytics around a cluster, but there are limitations to the flexibility and interactivity of Mahout. Developers looking for the ad hoc interactive capabilities of SQL Server Analysis Services (or even of a relational data warehouse) will be disappointed. Bulk computation can be performed disconnected from users, but Mahout and Hadoop don’t provide any kind of ad hoc querying capability.
Real Time Querying
So far, all of the use cases we’ve explored have been based on distributed batch processes and large scale querying. Even though Impala is a vast performance improvement over Hive, Impala is still responding in a matter of several seconds to several minutes – hardly fast enough for interactive querying. Databases are used for more than running massive reports, and this is where HBase comes in to play.
HBase is a real time, random access, read-write database built on top of Hadoop. This isn’t a database like SQL Server with tables and joins; HBase is a NoSQL database that’s loosely based on Google’s BigTable database. There are tables, there are columns, but the schema isn’t as rigid as a relational database. Developers will be able to solve many problems with HBase, but there will be a bit of a learning curve as they understand the data model and update their data structures to work effectively with HBase. Data stored in HBase can even be queried through Hive or Impala making it possible to combine transactional and reporting data in the same Hadoop cluster – the scale and redundancy of Hadoop make it easier to reduce load on any single system and avoid many problems associated with reporting from a transactional data source.
When Should You Use Hadoop?
Ultimately, you’re looking for an answer to the question “When should I use Hadoop?” This is a difficult question to answer. Hadoop may make sense for part of a workload, or even for all of it. The best way is to start by looking at your environment and asking questions like:
- Can I keep my data on a single instance?
- Can I keep my data on a single instance and doing it cheaply?
- Are my queries running fast enough?
- Do I need complex, interactive, ad hoc analytics?
- What type of latency is acceptable between data arrival, analysis, and queryability?
Understanding your workload is critical to determining if you’ll be able to use Hadoop to meet your needs. Having realistic expectations of Hadoop is equally critical. No part of Hadoop will solve all of the problems an organization is facing. Hadoop can mitigate some problems, but it presents a different set of challenges – being comfortable with the limitations of Hadoop will go a long way toward having a successful implementation.