Company Overview is the largest rural listing service in the Nation. The Network specializes in land for sale, which includes farms, ranches, mountain property, lake houses, river homes, beachfront homes, country homes, and residential homes in smaller towns across the country. These properties have many diverse uses including recreational and agricultural activities like hunting, fishing, camping, backpacking, horseback riding, four wheeling, grazing cattle, gardening, vineyards, cropland, raising horses, and other livestock.

Business Challenges (LoA) has been collecting an immense amount of data about visitor usage and search patterns for several years using Microsoft SQL Server as their data storage solution. LoA is happy using SQL Server for OLTP workloads, but with nearly 1 billion rows of data, previous attempts at combining OLTP and analytical queries on the same SQL Server instances caused poor reporting performance. LOA’s goal was to provide this data to several thousand clients so it needed to be optimal for their use.

The real time requirements for data analysis are relaxed: a 24-hour lag between data collection and analysis delivery is acceptable for the business users. Many of the analysis questions took the form of “What is the average price range users are searching for in these 7 counties over the last 6 months?” This presented a problem: how could the business provide a product reporting on this data about user trends without sacrificing core application performance? The issue was LoA’s production SQL Server was unable to answer these questions and still serve OLTP data.

LoA’s team was faced with two choices: they could purchase a second server used solely for analytical querying or they could evaluate other options. During discussions with the company’s development team, I reviewed solutions that would let LoA use their historical data, help the business make better decisions, and move the data processing load outside of SQL Server.

Working closely with LoA, I designed and implemented a solution using Apache Hive hosted in Amazon Elastic MapReduce (EMR) – EMR delivers managed Hadoop and Hive services, low cost storage, and flexible computing resources. LoA was up and running with EMR and Hive in just several weeks.

Use Case Description already leverages components of Amazon Web Services in conjunction with Microsoft SQL Server. Extending their usage to Elastic Map Reduce and Hive was an easy addition. Long term data storage is offloaded from SQL Server to Amazon S3 to lower storage costs compared to traditional storage options. S3 stores the detailed source records exported from SQL Server and data stored in S3 is accessed through Hive.

Hive is used as a separate data processing system. User search and activity is aggregated along several key measures through Hive. The aggregated data is stored in S3 before being imported into an on-premise SQL Server for interactive querying and reporting.


Multi-dimensional search data provides many opportunities for complex analysis. This analysis is typically both CPU and disk intensive – it’s difficult to provide effective indexing techniques for large analytic queries. Through Hive’s ability to conduct large-scale analysis, is able to uncover trends that would otherwise remain hidden in their data. Other options exist to perform analysis, but carry a significant hardware and licensing cost, like the Microsoft SQL Server Fast Track Data Warehouse. By utilizing commodity cloud computing resources and Apache Hive, is able to gain insight across their collected data without a significant investment of capital – resources are consumed on-demand and paid for on-demand.

By using Hive as the definitive store of historical data, is able to reduce their local storage requirements. Older historical data can be removed from Microsoft SQL Server as it is loaded into Hive.

Interested in learning more about Hadoop? Check out our Introduction to Hadoop training class.

Jeremiah Peschka
When I’m not working with databases, you’ll find me at a food truck in Portland, Oregon, or at conferences such as DevLink, Stir Trek, and OSCON. My sessions have been highly rated and I pride myself on their quality.
Jeremiah Peschka on sabtwitterJeremiah Peschka on sablinkedin
↑ Back to top
  1. Jeremiah, can you share your EMR cluster size, it’s performance metrics (hadoop terasort with 100Mln records/10Gb data set will do) and economics of the setup (monthly spent for S3 storage, processing and data movement to and from EMR)?

    • We never performed a TeraSort benchmark against the cluster – the only important benchmark for Lands of America was providing analysis and business value that they couldn’t achieve within their current SQL Server set up. Our goal was to achieve costs under $200 a month and we succeeded admirably – some last minute changes improved performance by 350% giving us plenty of headroom for processing data more frequently or for processing more data at once. If you want more details, I cover the set up in my session at SQLintersection: Reporting in Production: A Hadoop Case Study.

  2. the link gives a 404 error? what was the processing goal of the MR cluster compared to SQL Server. the reason i am asking is to get a better understanding of the use cases for the EMR.

    • It looks like the SQLIntersection/DevIntersection site is having issues at the moment.

      We chose EMR to eliminate up front physical equipment costs while performing analysis on heavily normalized data. Since cost was a factor, we didn’t want to stand up persistent Hadoop cluster; we used Elastic MapReduce to provide on-demand Hadoop and Hive resources.

      Our primary use case was cost savings – keeping the cluster completely under control means cost can be easily quantified in relation to analysis. If a Feature XYZ is added and execution time increases, it’s relatively simple to determine whether or not Feature XYZ’s cost to the business is greater or less than the benefit.

      Other people are using EMR to reduce management overhead or to run multiple Hadoop clusters on the same core data set without having to determine how to copy data between large production clusters.

  3. Jeremiah – this is a very interesting case study. I like you’re approach for keeping everything quantifiable for the business…(alluded to in your last comment as well as the Dev’s vs DBA’s presentation in Atlanta earlier this month)…

    It seems in this case, the business had well-defined reporting needs (“activity is aggregated along several key measures”)…if that wasn’t the case, and the business wanted more self-service/slice’n dice capabilities, how might that have affected your recommendations.

    Also, can you briefly comment on the data movement from the on-premise OLTP db to EMR? Was it as simple as exporting the data to flat files and moving them up to EMR? And now that the detailed data is archived up in EMR, is it safe to assume that has been removed from the on-premise OLTP db?

    Sad that I won’t be able to make Dev/SQL Intersection 🙁

    • If the business has more self-service requirements, this would have been structured very differently. It’s possible to do self-service BI at scale using the Hadoop stack, but there are enough choices available that covering them in any detail would be worthy of at least a single blog post. Some of these are Shark/Spark, Impala, DataCube, and Lattice. The choice you make is going to depend on the feature set that you need and how you plan to interact with the data.

      Data movement between OLTP and EMR was as simple as exporting a chunk of data as written in SQL Server and loading it into Hive via flat files. The goal is to start removing on-premise data from the OLTP site in the near future.

  4. Pingback: What Does Big Data Mean to You? «

  5. Brent,

    Since the customer was already leveraging Microsoft technologies I’m curious if you looked at SQL Server Analysis Services and if so why this wasn’t chosen. Thanks in advance.

    • Tim – great question. Reread Jeremiah’s post and pay particular attention to the cost requirements. SSAS requires a server license wherever you run it. Sure, you could save money by running it on the same server as your production OLTP instance – but then you sacrifice performance for both OLTP and reporting. You could buy a new server and SQL licensing, but there goes tens of thousands of dollars. Jeremiah’s solution here costs just a couple hundred dollars per month.

      • That is a valid observation Brent. I’m also curious about the Data Analysis tools that were used by the end users. Us geeks can talk about data storage and data movement all day long but unless the end users are able to access it and draw conclusions from it then it’s just a waste of time & money. Thanks again.

        • Brent’s advice of re-reading the article still holds true: “The aggregated data is stored in S3 before being imported into an on-premise SQL Server for interactive querying and reporting.”

          For this deployment, the business users kept working with familiar reporting tools like Excel and the reporting components built into their internal dashboards. No tooling changed, but the users suddenly had more data points to play with.

          Most major reporting vendors have written connectors to the queryable Hadoop interfaces (typically being Hive or Impala). I’ve even seen sophisticated drill through reporting built using nothing but Hadoop for pre-processing, text files, and data cubes written in client side JavaScript.

  6. Great post Jeremiah, thanks for sharing.

    Just curious if sqoop was used in moving data between OLTP and Hive, and Hive to reporting SQL server.

    • I’m glad you liked the post. We didn’t use sqoop to move data – everything was done using C# and T-SQL to export data. Because data was moved between on premise and AWS, there was no direct connection between the systems.

Leave a Reply

Your email address will not be published. Required fields are marked *