Blog

I’m a Microsoft SQL Server MVP – I like to talk about SQL Server a lot. But, as Brent loves to point out, I really like data; I’m open to alternative database lifestyles like PostgreSQL or NoSQL when they solve a business problem. And, frankly, I like some of these databases so much that I’m using them to build stuff for clients; I went so far as to become a Cloudera Certified Developer for Apache Hadoop this week.

Which MVP has two thumbs and a Hadoop certification?

This guy has two thumbs and an obsession with Hadoop.

How I’m Using Hadoop and Hive

“What kind of information gold mine are we sitting on?” That’s the question one of our clients was asking themselves earlier this year. The client had been tracking user’s search parameters for several years. Over time the data grew to where it was impossible to query the search logs without bringing the line of business application to its knees. Faced with the prospect of buying a second SQL Server for analytics, they were considering trimming data out of the logging database.

When I sat down with the client, they said “We want to get a better understanding of how users are interacting with the site, the types of searches being performed, and uncover richer information around product pricing. We just can’t answer these questions right now.” I talked through different options like a relational data warehouse or SQL Server Analysis Services before we settled on using Hive hosted in Amazon’s Elastic MapReduce (EMR). Using Hive hosted in Elastic MapReduce lets the business meet their goals while minimizing costs – the entire Hive cluster is turned off once data processing is done.

Money is important to businesses – everyone wants more of it and nobody wants to spend any of it. When faced with the idea of buying a second server, a second SQL Server license, and a second set of really fast disks, the client balked. By using Hive hosted on EMR we are able to run the data warehouse on demand and only pay for the resources used – this keeps costs under $200 per month.

How I Approach An Engagement

I love new technology, but that doesn’t mean I view it as a cure all. As I worked with the client, we worked together to understanding the data and the business’s questions before proposing a solution. One of the most important parts of our conversation was focusing the scope of questions into different buckets – the majority of the questions were traditional data warehouse queries.

When we began the process, we used a list of questions to kick off our investigation.

  • What are the current problems you have querying this data?
  • Just how much data are we talking about?
  • What types of queries do you need to answer?
  • How does this data interact with the rest of your data?
  • How will this data be consumed?
  • What does your team’s skill set look like?

Once we went through the list of questions, I took the client’s requirements and technical experience and used that to find the best fit product for the business. In this case, the solution was Hive running on top of Elastic MapReduce. I discussed the pros and cons of the approach and once I had the go ahead on a technology choice, I started building a prototype of the data warehouse that the business could continue to build on using their existing querying skills, without having to learn new technologies and platforms.

How I Can Help You

In this case, I was able to help a business get started on the Hadoop platform using Hive. If your company is like most companies, you’re probably asking yourself questions like “Is Hadoop or Hive right for us?”, “How could we get started with this project?”, or “How would Hadoop or Hive fit into our current environment?” This is where I can help out – I will work with your team and create a plan that meets your goals, works with the existing skills and technology that you have on hand, and create a high level road map. I can even help you prototype your first system using less than $200 in computing time in Amazon, no servers required. Contact me to set up a time to talk.

↑ Back to top
  1. What query tool did the client use against your solution?

    • Great question! We will be loading the processed data back into SQL Server once the data has been through Hive. Since we are able to process more data concurrently (and without fear of locking), we’re able to materialize a higher number of queries; this alleviates the biggest problem that the data set is too large to query effectively on the existing system. Instead the client will be able to use tools they’re already familiar with to work with the data.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php