Blog

Where Can You Use Hadoop?

“Where can you use Hadoop?” isn’t an easy question to answer. An enterprising or creative person could probably figure out ways to replace the entire database infrastructure with various components of Hadoop. I’m sure it’s being done right now and I’m sure that someone is being incredibly successful with it.

Asking the question “where can I do XYZ” will inevitably lead to the answer “everywhere… if you’re creative!” There’s a better question that we can ask.

Where Should I Start Using Hadoop?

Let’s face it: Hadoop is something that you should start thinking about. Microsoft are clearly investing Hadoop as part of their enterprise data warehouse products. Microsoft has partnered with Hortonworks to bring Hadoop to Windows.

One of the most obvious places to implement Hadoop is for ETL processes. ETL jobs are typically difficult to tune – data is streamed from an OLTP data source, processed in memory, and then streamed to another data source. Tuning the process to run faster on a single machine requires specific skills – a good ETL expert knows T-SQL, SSIS, and more than a little bit of .NET. These are important skills for an ETL expert to have; but we don’t always need an expert to get the job done.

How Can I Start Using Hadoop?

What if you could make a process run four times faster by running it on four computers? This is the basic premise of Hadoop – workloads are made faster by splitting them across multiple workers. Just as SQL Server splits a query across multiple threads, Hadoop is able to parallelize across multiple computers and each computer may parallelize the work across multiple threads.

We can take advantage of Hadoop’s easy scale out without really changing our tools. There’s a tool called Hive – it sits on top of Hadoop and translates SQL into MapReduce jobs in the back end. Hive isn’t going to be useful for real time querying, but it it gives us the ability to perform translations on huge amounts of data using a familiar language. If we need custom functionality, we just track down an enterprising developer to write a custom function. Just like SQL Server, it’s easy to grab custom functions from another source, install them, and use them in queries.

Where Else Can Hadoop Help?

While ETL is an obvious place to start using Hadoop, there are other places where we can start using Hadoop. Just like SQL Server, Hadoop is a rich ecosystem – it’s more than a one dimensional tool. Portions of Hadoop can be used to create a distributed file system, machine learning tools, data processing frameworks, and large scale random read-write data. You can use Hadoop to scale your data needs in many different directions. The most important thing is to pick a single pain that you’re having – typically ETL or reporting – and experiment with using Hadoop to make things faster or operate at a much bigger scale.

Want to Know More?

If you’d like to more, make sure you check out my video on Hadoop Revisited.

Interested in learning more about Hadoop? Check out our Introduction to Hadoop training class.

↑ Back to top
  1. At core Hadoop is a way to execute logic in the language of your choice over raw data on multiple machines. It’s all just files – no rows, no tables, no records, no tuples – what you do from there is entirely your responsibility.

    With all respect to Jeremiah, I’d suggest those interested in Hadoop should also read “What Hadoop Is. What Hadoop Isn’t” by Mark Madsen at http://www.insideanalysis.com/2012/12/what-hadoop-is-what-is-isnt/.

    A quote: “The Hadoop stack is a data processing platform. It combines elements of databases, data integration tools and parallel coding environments into a new and interesting mix.” “It combines data storage, retrieval and programming into a single highly scalable package.”

  2. Hadoop is very good product but and not useful to DBA or any other person unless he or she is TRUE JAVA developer. Hadoop is developed using 100% Java

    So i would like to know & learn more about Hadoop been used currently. what i see that SQL Server will be front end for Hadoop & you need developer who can develope good code using java to develop bridge between Hadoop & SQL Server.

    Thanks
    Jay-

    • Hi Jay,

      While some Hadoop development is done in Java, much of the day-to-day business data analysis happens in tools like Pig, Hive, and Impala. These tools make it possible for non-Java devs to pull meaningful data out of massive data systems and, very frequently, push it in a distilled form back into something like SQL Server or SSAS. Developers are typically only dropping down into Java when it’s absolutely necessary or to implement custom user defined functions.

      If you’re interested in seeing more of the Java side of development, I suggest you check out one of the many Apache hosted mailing lists (http://hadoop.apache.org/mailing_lists.html) to see how other people are working with Hadoop.

    • Huh, I’ve been trying HD-Insight the last few weeks, which is MSFT & Horton’s Hadoop distribution for Windows Server (local not cloud based), and you could submit Map/Reduce jobs using C# rather than Java (I think u could do the same with the Azure based Hadoop distribution too).

      To be honest I don’t think we should limit ourselves because we haven’t played with a particular technology before, Java (or generally any popular enough programming language) is based on sound programming principles that can easily be picked up… Am having to learn Python these days for some social network data mining, and it is soo awesome, sometimes its good to step out of your comfort zone.

      My only beef with the MSFT local Hadoop installation is that it does not yet support Mahout, which is only supported in the Azure distribution.

    • FYI:
      Hadoop can not be defined as product rather we can say Hadoop is an open-source framework/platform from which you can derive products. When it comes to hadoop It is all which belong to big enterprise and not really new. companies like google,amazon,facebook.. are using this from decades

      Thanks

  3. This post comes across wrong to me.
    Hadoop is not an ETL tool, its a MapReduce stack for scatter-gather-aggregate scaleout of compute jobs. If you want to compare it to something on the MS stack, it is more akin to HPC than SSIS !

    • You’re absolutely correct!

      There is a lot that could be said in a much more specific way in this post. If I were talking to a non-Microsoft audience (or assumed that they had more exposure to the ecosystem), I’d be much more specific. Inside the Microsoft community, the entire ecosystem of Hadoopery is still frequently referred to as Hadoop. Sometimes you have to start vague and bring in specifics over time.

      Thanks much for contributing, though, the differentiation is key.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php