Where Can You Use Hadoop?
“Where can you use Hadoop?” isn’t an easy question to answer. An enterprising or creative person could probably figure out ways to replace the entire database infrastructure with various components of Hadoop. I’m sure it’s being done right now and I’m sure that someone is being incredibly successful with it.
Asking the question “where can I do XYZ” will inevitably lead to the answer “everywhere… if you’re creative!” There’s a better question that we can ask.
Where Should I Start Using Hadoop?
Let’s face it: Hadoop is something that you should start thinking about. Microsoft are clearly investing Hadoop as part of their enterprise data warehouse products. Microsoft has partnered with Hortonworks to bring Hadoop to Windows.
One of the most obvious places to implement Hadoop is for ETL processes. ETL jobs are typically difficult to tune – data is streamed from an OLTP data source, processed in memory, and then streamed to another data source. Tuning the process to run faster on a single machine requires specific skills – a good ETL expert knows T-SQL, SSIS, and more than a little bit of .NET. These are important skills for an ETL expert to have; but we don’t always need an expert to get the job done.
How Can I Start Using Hadoop?
What if you could make a process run four times faster by running it on four computers? This is the basic premise of Hadoop – workloads are made faster by splitting them across multiple workers. Just as SQL Server splits a query across multiple threads, Hadoop is able to parallelize across multiple computers and each computer may parallelize the work across multiple threads.
We can take advantage of Hadoop’s easy scale out without really changing our tools. There’s a tool called Hive – it sits on top of Hadoop and translates SQL into MapReduce jobs in the back end. Hive isn’t going to be useful for real time querying, but it it gives us the ability to perform translations on huge amounts of data using a familiar language. If we need custom functionality, we just track down an enterprising developer to write a custom function. Just like SQL Server, it’s easy to grab custom functions from another source, install them, and use them in queries.
Where Else Can Hadoop Help?
While ETL is an obvious place to start using Hadoop, there are other places where we can start using Hadoop. Just like SQL Server, Hadoop is a rich ecosystem – it’s more than a one dimensional tool. Portions of Hadoop can be used to create a distributed file system, machine learning tools, data processing frameworks, and large scale random read-write data. You can use Hadoop to scale your data needs in many different directions. The most important thing is to pick a single pain that you’re having – typically ETL or reporting – and experiment with using Hadoop to make things faster or operate at a much bigger scale.
Want to Know More?
If you’d like to more, make sure you check out my video on Hadoop Revisited.