Blog

Microsoft have made announcements about bringing Hadoop to Windows, but what does that mean for SQL Server? In this talk, Jeremiah Peschka will cover a the Hadoop ecosystem at a high level and discuss specific use cases for Hadoop. By the end of this talk, you should have a general idea of how different parts of Hadoop can be used to augment SQL Server’s rich functionality. This talk is for database professionals who are looking for ways to extend the capability of SQL Server with different pieces of the Hadoop ecosystem.

Hadoop Basics

Getting Started with Hadoop – Downloadable Virtual Machines

Data Warehousing with Hadoop and Hive

Hungry for more? Take a look at Hadoop Revisited – a survey of the Hadoop landscape at the close of 2012.

What is Hadoop?

Hadoop is a group of tools to help developers create bigger tools. More specifically, Hadoop is a basic set of tools that help developers create applications spread across multiple CPU cores on multiple servers – it’s parallelism taken to an extreme. Although Hadoop is a set of tools and libraries, there are a number of products that are lumped into the same bucket as Hadoop and, frequently, they’re all referred to as Hadoop. Instead of describing every piece of Hadoop in detail, I’m going to focus on the functionality that’s of the most interest to SQL Server professionals.

Data Warehousing in Hadoop

If you need to work with big data, Hadoop is becoming the _de facto_ answer. But once your data is in Hadoop, how do you query it?

If you need big data warehousing, look no further than Hive. Hive is a data warehouse built on top of Hadoop. Hive is a mature tool – it was developed at Facebook to handle their data warehouse needs. It’s best to think of Hive as an enterprise data warehouse (EDW) – Hive can be used to research complex interactions across your company’s entire history of data, in exchange for that power you have to understand that queries will return in minutes. Unlike traditional EDWs, Hive is spread across tens, hundreds, or even thousands of commodity grade servers.

Hive was designed to be easy for SQL professionals to use. Rather than write Java, developers write queries using HiveQL (based on ANSI SQL) and receive results as a table. As you’d expect from an EDW, Hive queries will take a long time to run; results are frequently pushed into tables to be consumed by reporting or business intelligence tools. It’s not uncommon to see Hive being used to pre-process data that will be pushed into a data mart or processed into a cube.

While Hive can operate on large volumes of data, it’s not the most efficient tool: Impala seeks to overcome some of the limitations of Hive by making better use of CPU, memory, and disk resources. Impala operates more like SQL Server – data is cached in memory to improve query performance. Although Impala uses a separate query engine than hive, it uses data that’s already in Hadoop, making it easy to query massive amounts of data without having to store your data twice.

Both Impala and Hive are great for businesses querying of amounts of data while avoiding expense of massively parallel EDW solutions like Microsoft SQL Server PDW or Oracle Exadata. Hive is in a stable release cycle and, although Impala is still a beta product, many organizations are deploying one or both solutions to tackle their largest workloads.

Data Flow and ETL with Hadoop and Pig

SQL Server professionals are familiar with using SQL Server Integration Services (SSIS) to move data around their organization. SSIS provides a rich set of functionality for manipulating, but it’s difficult to make SSIS operations run across multiple CPU cores, much less multiple servers.

Pig is a tool for creating parallel data workflows. Pig takes advantage of the Hadoop tools to provide rich functionality across huge amounts of data. Pig makes it easy to perform step-by-step data manipulation over large data sources using a combination of different tools and functionality. There are a number of great reasons to use Pig (parallel processing, sampling, and loose schema requirements), but it’s safe to say that Pig is a great tool for processing data with Hadoop.

Deep Analysis of Data

SQL Server professionals are used to having analytic insight available, either through SQL Server’s windowing functions or through SQL Server Analysis Services. Although Hadoop doesn’t natively provide tools for OLAP style cubes or for windowing functions, it’s possible to gain insight from your data using Hadoop. Unfortunately, deep analytics are not Hadoop’s strong suit out of the box. Teams looking to take advantage of large scale data analytics will be doing a lot of heavy lifting themselves.

Mahout is a set of libraries that can be used to distribute analytics around a cluster, but there are limitations to the flexibility and interactivity of Mahout. Developers looking for the ad hoc interactive capabilities of SQL Server Analysis Services (or even of a relational data warehouse) will be disappointed. Bulk computation can be performed disconnected from users, but Mahout and Hadoop don’t provide any kind of ad hoc querying capability.

Real Time Querying

So far, all of the use cases we’ve explored have been based on distributed batch processes and large scale querying. Even though Impala is a vast performance improvement over Hive, Impala is still responding in a matter of several seconds to several minutes – hardly fast enough for interactive querying. Databases are used for more than running massive reports, and this is where HBase comes in to play.

HBase is a real time, random access, read-write database built on top of Hadoop. This isn’t a database like SQL Server with tables and joins; HBase is a NoSQL database that’s loosely based on Google’s BigTable database. There are tables, there are columns, but the schema isn’t as rigid as a relational database. Developers will be able to solve many problems with HBase, but there will be a bit of a learning curve as they understand the data model and update their data structures to work effectively with HBase. Data stored in HBase can even be queried through Hive or Impala making it possible to combine transactional and reporting data in the same Hadoop cluster – the scale and redundancy of Hadoop make it easier to reduce load on any single system and avoid many problems associated with reporting from a transactional data source.

When Should You Use Hadoop?

Ultimately, you’re looking for an answer to the question “When should I use Hadoop?” This is a difficult question to answer. Hadoop may make sense for part of a workload, or even for all of it. The best way is to start by looking at your environment and asking questions like:

  • Can I keep my data on a single instance?
  • Can I keep my data on a single instance and doing it cheaply?
  • Are my queries running fast enough?
  • Do I need complex, interactive, ad hoc analytics?
  • What type of latency is acceptable between data arrival, analysis, and queryability?

Understanding your workload is critical to determining if you’ll be able to use Hadoop to meet your needs. Having realistic expectations of Hadoop is equally critical. No part of Hadoop will solve all of the problems an organization is facing. Hadoop can mitigate some problems, but it presents a different set of challenges – being comfortable with the limitations of Hadoop will go a long way toward having a successful implementation.

 

↑ Back to top
  1. Awesome intro about Hadoop to SQL community. I am interested to see how Microsoft will integrate this with SQL Server.

  2. Pingback: Something for the Weekend – SQL Server Links 02/12/11

  3. Pingback: Introduction to Hadoop « Click-Rex

  4. Pingback: Introduction to Hadoop « Click-Rex | Programmer Solution

  5. hello my Freand
    does hadoop support CRUD and Load balance for SQL Server

    HadoopServer1 HadoopServer2 … HadoopServerN

    Hadoop Provider(For SQL Server 2012)

    User1 User2 User3 ….UserN (Send CRUD Command )
    Does Support This Solution

    • I believe that you’ve confused Hadoop with an ORM. Hadoop is a family of products based around the Hadoop Distributed File System (HDFS) and Hadoop MapReduce – a job scheduling and execution framework.

      For CRUD procedure creation, you should look something like into Entity Framework, nHibernate, or LLBLGen.

      For load balancing SQL Server, you should probably talk to an architect or consultant.

  6. Thanks.
    I need Solution for Loadbalance SQL Server in
    Insert Update Delete Select
    many user ,…
    Oracle :RAC
    SQL: ? (Hadoop or ?)

  7. Thanks for the great presentation Jeremiah.
    Do you mind sharing your thoughts about what type of queries will perform better on Hadoop as oppose to relational database and vice versa?

    Thanks,
    Matan

    • Hi Matan, great question. Honestly, the question is too complex for even a blog post comment, but I’ll give it a shot anyway.

      If you’re thinking about warehousing, any query that can easily be split among multiple nodes is likely to perform better on a Hadoop based system. Likewise, anything that you can break down into a series of Map and Reduce phases will perform well.

      If you are thinking about OLTP, things get tricky because there is no OLTP alternative in the Hadoop world that speaks a SQL dialect. Instead you have to re-arrange the way you query to make sense in HBase’s data model.

  8. Pingback: Hadoop in a Microsoft environment | Data story

  9. Pingback: Hadoop « SQLAnidma

  10. Pingback: Hadoop for SQL DBA « SQLAnidma

  11. Thanks for the information Jeremiah. C
    Could you please let me know if it is mandatory to know JAVA to learn Hadoop?
    I’m basically a Database guy and have not worked on JAVA.

    Thanks,
    Abhilash

    • You don’t need to learn Java or a JVM language to use some features of Hadoop. Depending on how deep you want to go, though, you’re going to have to learn some kind of programming.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php