LandsofAmerica.com is the largest rural listing service in the Nation. The Network specializes in land for sale, which includes farms, ranches, mountain property, lake houses, river homes, beachfront homes, country homes, and residential homes in smaller towns across the country. These properties have many diverse uses including recreational and agricultural activities like hunting, fishing, camping, backpacking, horseback riding, four wheeling, grazing cattle, gardening, vineyards, cropland, raising horses, and other livestock.
LandsofAmerica.com (LoA) has been collecting an immense amount of data about visitor usage and search patterns for several years using Microsoft SQL Server as their data storage solution. LoA is happy using SQL Server for OLTP workloads, but with nearly 1 billion rows of data, previous attempts at combining OLTP and analytical queries on the same SQL Server instances caused poor reporting performance. LOA’s goal was to provide this data to several thousand clients so it needed to be optimal for their use.
The real time requirements for data analysis are relaxed: a 24-hour lag between data collection and analysis delivery is acceptable for the business users. Many of the analysis questions took the form of “What is the average price range users are searching for in these 7 counties over the last 6 months?” This presented a problem: how could the business provide a product reporting on this data about user trends without sacrificing core application performance? The issue was LoA’s production SQL Server was unable to answer these questions and still serve OLTP data.
LoA’s team was faced with two choices: they could purchase a second server used solely for analytical querying or they could evaluate other options. During discussions with the company’s development team, I reviewed solutions that would let LoA use their historical data, help the business make better decisions, and move the data processing load outside of SQL Server.
Working closely with LoA, I designed and implemented a solution using Apache Hive hosted in Amazon Elastic MapReduce (EMR) – EMR delivers managed Hadoop and Hive services, low cost storage, and flexible computing resources. LoA was up and running with EMR and Hive in just several weeks.
Use Case Description
LandsofAmerica.com already leverages components of Amazon Web Services in conjunction with Microsoft SQL Server. Extending their usage to Elastic Map Reduce and Hive was an easy addition. Long term data storage is offloaded from SQL Server to Amazon S3 to lower storage costs compared to traditional storage options. S3 stores the detailed source records exported from SQL Server and data stored in S3 is accessed through Hive.
Hive is used as a separate data processing system. User search and activity is aggregated along several key measures through Hive. The aggregated data is stored in S3 before being imported into an on-premise SQL Server for interactive querying and reporting.
Multi-dimensional search data provides many opportunities for complex analysis. This analysis is typically both CPU and disk intensive – it’s difficult to provide effective indexing techniques for large analytic queries. Through Hive’s ability to conduct large-scale analysis, LandsofAmerica.com is able to uncover trends that would otherwise remain hidden in their data. Other options exist to perform analysis, but carry a significant hardware and licensing cost, like the Microsoft SQL Server Fast Track Data Warehouse. By utilizing commodity cloud computing resources and Apache Hive, LandsofAmerica.com is able to gain insight across their collected data without a significant investment of capital – resources are consumed on-demand and paid for on-demand.
By using Hive as the definitive store of historical data, LandsofAmerica.com is able to reduce their local storage requirements. Older historical data can be removed from Microsoft SQL Server as it is loaded into Hive.
Interested in learning more about Hadoop? Check out our Introduction to Hadoop training class.