An Introduction to Hive’s Partitioning
You’re probably thinking about building a data warehouse (just about every company is if they haven’t already). After reading SQL Server Partitioning: Not the Best Practice for Anyone and Potential Problems with Partitioning you’re wondering why anyone would partition their data: it can be harder to tune queries, indexes take up more space, and SQL Server’s partitioning requires Enterprise Edition on top of that expensive SAN you’re adding to cope with the extra space. Anyone who is looking at implementing table partitioning in SQL Server would do well to take a look at using Hive for their partitioned database.
Setting up partitioning functions in SQL Server is a pain. It’s left up to the implementor to decide if the partition function should use range right or range left and how partitions will be swapped in and out. Writing robust partitioning functions is stressful the first time around. What if we didn’t have to define a partition function? What if the database knew how to handle partitioning for us? Hive does just that.
Rather than leave the table partitioning scheme up to the implementor, Hive makes it easy to specify an automatic partition scheme when the table is created:
CREATE TABLE sales ( sales_order_id BIGINT, order_amount FLOAT, order_date STRING, due_date STRING, customer_id BIGINT ) PARTITIONED BY (country STRING, year INT, month INT, day INT) ;
As we load data it is written to the appropriate partition in the table. There’s no need to create partitions in advance or set up any kind of partition maintenance; Hive does the hard work for us. The hardest part is writing queries. It’s a rough life, eh?
You might have noticed that while the partitioning key columns are a part of the table DDL, they’re only listed in the
PARTITIONED BY clause. This is very different from SQL Server where the partitioning key must be used everywhere in a partitioned table. In Hive, as data is written to disk, each partition of data will be automatically split out into different folders, e.g.
country=US/year=2012/month=12/day=22. During a read operation, Hive will use the folder structure to quickly locate the right partitions and also return the partitioning columns as columns in the result set.
This approach means that we save a considerable amount of space on disk and it can be very fast to perform partition elimination. The downside of this approach is that it’s necessary to tell Hive which partition we’re loading in a query. To add data to the partition for the United States on December 22, 2012 we have to write this query:
INSERT INTO sales PARTITION (country = 'US', year = 2012, month = 12, day = 22) SELECT sales_order_id, order_amount, due_date, customer_id, cntry, yr, mo, d FROM source_view WHERE cntry = 'US' AND yr = 2012 AND mo = 12 AND d = 22 ;
This is a somewhat inflexible, but effective, approach. Hive makes it difficult to accidentally create tens of thousands of partitions by forcing users to list the specific partition being loaded. This approach is great once you’re using Hive in production but it can be tedious to initially load a large data warehouse when you can only write to one partition at a time. There is a better way.
With a few quick changes it’s easy to configure Hive to support dynamic partition creation. Just as SQL Server has a
SET command to change database options, Hive lets us change settings for a session using the
SET command. Changing these settings permanently would require opening a text file and restarting the Hive cluster – it’s not a difficult change, but it’s outside of our scope.
SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict;
Once both of these settings are in place, it’s easy to change our query to dynamically load partitions. Instead of loading partitions one at a time, we can load an entire month or an entire country in one fell swoop:
INSERT INTO sales PARTITION (country, year, month, day) SELECT sales_order_id, order_amount, due_date, customer_id, cntry, yr, mo, d FROM source_view WHERE cntry = 'US' ;
When inserting data into a partition, it’s necessary to include the partition columns as the last columns in the query. The column names in the source query don’t need to match the partition column names, but they really do need to be last – there’s no way to wire up Hive differently.
Be careful using dynamic partitions. Hive has some built-in limits on the number of partitions that can be dynamically created as well as limits on the total number of files that can exist within Hive. Creating many partitions at once will create a lot of files and creating a lot of files will use up memory in the Hadoop Name Node. All of these settings can be changed from their defaults, but those defaults exist to prevent a single
INSERT from taking down your entire Hive cluster.
What About Partition Swapping?
Much like SQL Server, Hive makes it possible to swap out partitions. Partition swapping is an important feature that makes it easy to change large amounts of data with a minimal impact on database performance. New aggregations can be prepared in the background
How do we perform a partition swap with Hive? A first guess might be to use the
INSERT OVERWRITE PARTITION command to replace all data in a partition. This works but it has the downside of deleting all of the data and then re-inserting it. Although Hive has no transaction log, we’ll still have to wait for data to queried and then written to disk. Your second guess might be to load data into a different location, drop the original partition, and then point Hive at the new data like this:
ALTER TABLE sales DROP IF EXISTS PARTITION (country = 'US', year = 2012, month = 12, day = 22) ; ALTER TABLE sales ADD PARTITION (country = 'US', year = 2012, month = 12, day = 22) LOCATION 'sales/partitions/us/2012/12/22' ;
It’s that easy: we’ve swapped out a partition in Hive and removed the old data in one step. . Truthfully, there’s an even easier way using the
SET LOCATION clause of
ALTER TABLE sales PARTITION (country = 'US', year = 2012, month = 12, day = 22) SET LOCATION = 'sales/partitions/us/2012/12/22' ;
Just like that, the new partition will be used. There’s one downside to this approach – the old data will still exist in Hadoop, only the metadata will be changed. If we want to clear out the old data, it’s going to be necessary to run drop down to HDFS commands and delete the old data out of Hadoop itself.
Is Hive Partitioning Right For You?
If you’re thinking about partitioning a relational database, you should give serious consideration to using partitioned tables in Hive. One of the advantages of Hive is that storage and performance can be scaled horizontally by adding more servers to the cluster – if you need more space, just add a server; if you need more computing power, just add a server. Hive’s approach to data skips some of the necessary costs of partitioning in SQL Server – there’s no Enterprise Edition to purchase, minimal query tuning involved (hint: you should almost always partition your data in Hive), and no expensive SAN to purchase.
For better or for worse – if you’re thinking about partitioning a data warehouse in SQL Server, you should think about using Hive instead.
Databases exist to store data and answer questions about that data. The way we ask questions has to change based on the database itself – SQL changes between different RDBMS vendors. Things get even stranger when you venture outside the RDBMS world and start storing your data in Hive. Even though Hive Query Language (HiveQL) looks like SQL, there are some subtle but important differences to querying that first timers will want to keep in mind.
It All Starts With A Question
Everything starts with a business requirement. In this case the requirement was simple: “We want to see a rolling average of price ranges over the last 6 months.”
Depending on which version of SQL Server you’re using, this is a pretty easy query to build; it can be accomplished using a CTE or a sliding window function. Since a CTE is the simpler example, we’ll start using it for our sample query.
WITH cte ( SELECT city_id, date, price FROM the_table ) SELECT x.city_id, x.date, AVG(y.price) AS average_price FROM cte AS x JOIN cte AS y ON x.city_id = y.city_id AND y.date > DATEADD(month, -6, x.date) GROUP BY x.city_id, x.date;
As a user of the database, we can safely assume that the database server is going to compute an average for the six months prior to
x.date. SQL Server’s optimizer is going to do a lot of work behind the scenes to make sure this query gets executed efficiently. The upside of using a database with a complex query optimizer is that the optimizer does the hard work for us – it determines the best way to deliver the data and we only have to worry about the questions we want answered. Things don’t work quite the same way with Hive.
Rolling Averages with Big Data
Hive can be queried using the Hive Query Language (HiveQL). While based on the SQL-92 standard, HiveQL doesn’t conform to the standard; even where Hive supports SQL functionality, there are gotchas to keep in mind. The first of these is language support: Hive doesn’t support CTEs. Admittedly, CTEs aren’t a part of SQL-92 but developers will find many language features missing or slightly different. Back to the topic at hand: it’s possible to write this query without using a CTE. The first step is transforming the body of the CTE into a sub-select and joining to the derived table:
SELECT x.city_id, x.date, AVG(y.price) as average_price FROM the_table x JOIN (SELECT city_id, date, price FROM the_table) y ON x.city_id = y.city_id AND y.date > DATEADD(month, -6, x.date) GROUP BY x.city_id, x.date ;
Unfortunately, this won’t fly in Hive – only equality conditions are supported on a join. There’s a valid reason for this – it’s very difficult to translate this type of join into a map/reduce job behind the scenes in Hive. Hive 0.10, just released last week, has support for
CROSS JOIN, but this version of Hive isn’t a viable option since it is still under heavy development.
Although not quite as simple, it’s still possible to make this query work so we can produce rolling averages:
FROM ( SELECT city_id, date, date_sub(from_unixtime(date), 180) AS 180_days_ago FROM table ) y SELECT y.city_id, y.date, AVG(x.price) FROM table x JOIN y ON x.city_id = y.city_id WHERE x.date > y.180_days_ago GROUP BY y.city_id, y.date ;
Tuning the Rolling Average
This query isn’t going to win any awards for performance. In fact, that rolling average query is a good candidate to win an award for terrible performance. Although Hive will perform some optimizations to the query we provided, the re-write process is nothing like what happens with full fledged cost-based optimizer. This query will produce a large amount of intermediate data before results are delivered and moving that data around the cluster takes a lot of time. What if there were a way to make this execute faster?
Let’s re-write this query so instead of computing a rolling average, we only compute a six month average based on today’s date. The first step is to grab just the data for today. Experienced SQL Server developers would say “Ah ha, I can do this using some kind of date math function in conjunction with
GETDATE().” Hive has its own function names that accomplish the same things:
SELECT * FROM table t WHERE date > date_sub(cast(unix_timestamp() as TIMESTAMP), 180) ;
What we’re trying to accomplish here is tell Hive, “Hey, I want to see all rows where
date is greater than the current time (in seconds since 1970), converted to a
TIMESTAMP, minus 180 days.” This doesn’t work, but it’s not by lack of syntactical correctness. There’s a bug in how Hive handles converting Unix time to a
TIMESTAMP (HIVE-3454). The documenation and spec says that during a
CAST, Hive should treat an
INTEGER as seconds since epoch and other numeric types as milliseconds since epoch.
What HIVE-3454 means is that our query will give us an average of data since some time in early January, 1970. While interesting, that’s not what we’re looking for. To make sure that we get the right time, the best approach is to perform date comparison using seconds instead of time functions. This next query almost works (it finds anything less than 180 days from right now):
SELECT * FROM table t WHERE unix_timestamp() - created < 180 * 24 * 60 * 60 ;
Since we really want to get an average based on the previous 180 days, we tell Hive to round to the current time to the nearest day
from_unixtime(unix_timestamp(),'yyyy-MM-dd'), subtract 180 days
datesub(value, 180) and then treat that as a temporal value
SELECT * FROM table t WHERE created > unix_timestamp(date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),180),'yyyy-MM-dd') ;
What’s The Best Solution?
Through developing these queries, I found that the best approach to computing a rolling average in Hive is to approach the problem in chunks. It’s necessary to be more explicit in the queries presented to Hive. While Hive doesn’t have a sophisticated query optimizer, Hive does give developers the ability to run queries across many servers and perform a massive amount of data processing very quickly. Although we have a lot of power at our disposal, we have to make a bit of extra effort when developing queries to ensure that they perform well.
I’m a Microsoft SQL Server MVP – I like to talk about SQL Server a lot. But, as Brent loves to point out, I really like data; I’m open to alternative database lifestyles like PostgreSQL or NoSQL when they solve a business problem. And, frankly, I like some of these databases so much that I’m using them to build stuff for clients; I went so far as to become a Cloudera Certified Developer for Apache Hadoop this week.
How I’m Using Hadoop and Hive
“What kind of information gold mine are we sitting on?” That’s the question one of our clients was asking themselves earlier this year. The client had been tracking user’s search parameters for several years. Over time the data grew to where it was impossible to query the search logs without bringing the line of business application to its knees. Faced with the prospect of buying a second SQL Server for analytics, they were considering trimming data out of the logging database.
When I sat down with the client, they said “We want to get a better understanding of how users are interacting with the site, the types of searches being performed, and uncover richer information around product pricing. We just can’t answer these questions right now.” I talked through different options like a relational data warehouse or SQL Server Analysis Services before we settled on using Hive hosted in Amazon’s Elastic MapReduce (EMR). Using Hive hosted in Elastic MapReduce lets the business meet their goals while minimizing costs – the entire Hive cluster is turned off once data processing is done.
Money is important to businesses – everyone wants more of it and nobody wants to spend any of it. When faced with the idea of buying a second server, a second SQL Server license, and a second set of really fast disks, the client balked. By using Hive hosted on EMR we are able to run the data warehouse on demand and only pay for the resources used – this keeps costs under $200 per month.
How I Approach An Engagement
I love new technology, but that doesn’t mean I view it as a cure all. As I worked with the client, we worked together to understanding the data and the business’s questions before proposing a solution. One of the most important parts of our conversation was focusing the scope of questions into different buckets – the majority of the questions were traditional data warehouse queries.
When we began the process, we used a list of questions to kick off our investigation.
- What are the current problems you have querying this data?
- Just how much data are we talking about?
- What types of queries do you need to answer?
- How does this data interact with the rest of your data?
- How will this data be consumed?
- What does your team’s skill set look like?
Once we went through the list of questions, I took the client’s requirements and technical experience and used that to find the best fit product for the business. In this case, the solution was Hive running on top of Elastic MapReduce. I discussed the pros and cons of the approach and once I had the go ahead on a technology choice, I started building a prototype of the data warehouse that the business could continue to build on using their existing querying skills, without having to learn new technologies and platforms.
How I Can Help You
In this case, I was able to help a business get started on the Hadoop platform using Hive. If your company is like most companies, you’re probably asking yourself questions like “Is Hadoop or Hive right for us?”, “How could we get started with this project?”, or “How would Hadoop or Hive fit into our current environment?” This is where I can help out – I will work with your team and create a plan that meets your goals, works with the existing skills and technology that you have on hand, and create a high level road map. I can even help you prototype your first system using less than $200 in computing time in Amazon, no servers required. Contact me to set up a time to talk.
There’s a lot of buzz around Hadoop. If you’re like most people, you’ve looked into Hadoop and found a bewildering array of products, terms, and technology surrounding Hadoop – Hive, Pig, HDFS, MapReduce, HBase, co-processors, ZooKeeper, etc. Knowing where to start can be daunting. Trying to make sense of the Hadoop world is certainly possible, but takes a lot of time. Thankfully, I’ve gone ahead and done the work for you.
This is an update of last year’s presentation Hadoop Basics for DBAs.
What is Hadoop?
Hadoop is a group of tools to help developers create bigger tools. More specifically, Hadoop is a basic set of tools that help developers create applications spread across multiple CPU cores on multiple servers – it’s parallelism taken to an extreme. Although Hadoop is a set of tools and libraries, there are a number of products that are lumped into the same bucket as Hadoop and, frequently, they’re all referred to as Hadoop. Instead of describing every piece of Hadoop in detail, I’m going to focus on the functionality that’s of the most interest to SQL Server professionals.
Data Warehousing in Hadoop
If you need to work with big data, Hadoop is becoming the _de facto_ answer. But once your data is in Hadoop, how do you query it?
If you need big data warehousing, look no further than Hive. Hive is a data warehouse built on top of Hadoop. Hive is a mature tool – it was developed at Facebook to handle their data warehouse needs. It’s best to think of Hive as an enterprise data warehouse (EDW) – Hive can be used to research complex interactions across your company’s entire history of data, in exchange for that power you have to understand that queries will return in minutes. Unlike traditional EDWs, Hive is spread across tens, hundreds, or even thousands of commodity grade servers.
Hive was designed to be easy for SQL professionals to use. Rather than write Java, developers write queries using HiveQL (based on ANSI SQL) and receive results as a table. As you’d expect from an EDW, Hive queries will take a long time to run; results are frequently pushed into tables to be consumed by reporting or business intelligence tools. It’s not uncommon to see Hive being used to pre-process data that will be pushed into a data mart or processed into a cube.
While Hive can operate on large volumes of data, it’s not the most efficient tool: Impala seeks to overcome some of the limitations of Hive by making better use of CPU, memory, and disk resources. Impala operates more like SQL Server – data is cached in memory to improve query performance. Although Impala uses a separate query engine than hive, it uses data that’s already in Hadoop, making it easy to query massive amounts of data without having to store your data twice.
Both Impala and Hive are great for businesses querying of amounts of data while avoiding expense of massively parallel EDW solutions like Microsoft SQL Server PDW or Oracle Exadata. Hive is in a stable release cycle and, although Impala is still a beta product, many organizations are deploying one or both solutions to tackle their largest workloads.
SQL Server professionals are familiar with using SQL Server Integration Services (SSIS) to move data around their organization. SSIS provides a rich set of functionality for manipulating, but it’s difficult to make SSIS operations run across multiple CPU cores, much less multiple servers.
Pig is a tool for creating parallel data workflows. Pig takes advantage of the Hadoop tools to provide rich functionality across huge amounts of data. Pig makes it easy to perform step-by-step data manipulation over large data sources using a combination of different tools and functionality. There are a number of great reasons to use Pig (parallel processing, sampling, and loose schema requirements), but it’s safe to say that Pig is a great tool for processing data with Hadoop.
Deep Analysis of Data
SQL Server professionals are used to having analytic insight available, either through SQL Server’s windowing functions or through SQL Server Analysis Services. Although Hadoop doesn’t natively provide tools for OLAP style cubes or for windowing functions, it’s possible to gain insight from your data using Hadoop. Unfortunately, deep analytics are not Hadoop’s strong suit out of the box. Teams looking to take advantage of large scale data analytics will be doing a lot of heavy lifting themselves.
Mahout is a set of libraries that can be used to distribute analytics around a cluster, but there are limitations to the flexibility and interactivity of Mahout. Developers looking for the ad hoc interactive capabilities of SQL Server Analysis Services (or even of a relational data warehouse) will be disappointed. Bulk computation can be performed disconnected from users, but Mahout and Hadoop don’t provide any kind of ad hoc querying capability.
Real Time Querying
So far, all of the use cases we’ve explored have been based on distributed batch processes and large scale querying. Even though Impala is a vast performance improvement over Hive, Impala is still responding in a matter of several seconds to several minutes – hardly fast enough for interactive querying. Databases are used for more than running massive reports, and this is where HBase comes in to play.
HBase is a real time, random access, read-write database built on top of Hadoop. This isn’t a database like SQL Server with tables and joins; HBase is a NoSQL database that’s loosely based on Google’s BigTable database. There are tables, there are columns, but the schema isn’t as rigid as a relational database. Developers will be able to solve many problems with HBase, but there will be a bit of a learning curve as they understand the data model and update their data structures to work effectively with HBase. Data stored in HBase can even be queried through Hive or Impala making it possible to combine transactional and reporting data in the same Hadoop cluster – the scale and redundancy of Hadoop make it easier to reduce load on any single system and avoid many problems associated with reporting from a transactional data source.
When Should You Use Hadoop?
Ultimately, you’re looking for an answer to the question “When should I use Hadoop?” This is a difficult question to answer. Hadoop may make sense for part of a workload, or even for all of it. The best way is to start by looking at your environment and asking questions like:
- Can I keep my data on a single instance?
- Can I keep my data on a single instance and doing it cheaply?
- Are my queries running fast enough?
- Do I need complex, interactive, ad hoc analytics?
- What type of latency is acceptable between data arrival, analysis, and queryability?
Understanding your workload is critical to determining if you’ll be able to use Hadoop to meet your needs. Having realistic expectations of Hadoop is equally critical. No part of Hadoop will solve all of the problems an organization is facing. Hadoop can mitigate some problems, but it presents a different set of challenges – being comfortable with the limitations of Hadoop will go a long way toward having a successful implementation.
Where Can You Use Hadoop?
“Where can you use Hadoop?” isn’t an easy question to answer. An enterprising or creative person could probably figure out ways to replace the entire database infrastructure with various components of Hadoop. I’m sure it’s being done right now and I’m sure that someone is being incredibly successful with it.
Asking the question “where can I do XYZ” will inevitably lead to the answer “everywhere… if you’re creative!” There’s a better question that we can ask.
Where Should I Start Using Hadoop?
Let’s face it: Hadoop is something that you should start thinking about. Microsoft are clearly investing Hadoop as part of their enterprise data warehouse products. Microsoft has partnered with Hortonworks to bring Hadoop to Windows.
One of the most obvious places to implement Hadoop is for ETL processes. ETL jobs are typically difficult to tune – data is streamed from an OLTP data source, processed in memory, and then streamed to another data source. Tuning the process to run faster on a single machine requires specific skills – a good ETL expert knows T-SQL, SSIS, and more than a little bit of .NET. These are important skills for an ETL expert to have; but we don’t always need an expert to get the job done.
How Can I Start Using Hadoop?
What if you could make a process run four times faster by running it on four computers? This is the basic premise of Hadoop – workloads are made faster by splitting them across multiple workers. Just as SQL Server splits a query across multiple threads, Hadoop is able to parallelize across multiple computers and each computer may parallelize the work across multiple threads.
We can take advantage of Hadoop’s easy scale out without really changing our tools. There’s a tool called Hive – it sits on top of Hadoop and translates SQL into MapReduce jobs in the back end. Hive isn’t going to be useful for real time querying, but it it gives us the ability to perform translations on huge amounts of data using a familiar language. If we need custom functionality, we just track down an enterprising developer to write a custom function. Just like SQL Server, it’s easy to grab custom functions from another source, install them, and use them in queries.
Where Else Can Hadoop Help?
While ETL is an obvious place to start using Hadoop, there are other places where we can start using Hadoop. Just like SQL Server, Hadoop is a rich ecosystem – it’s more than a one dimensional tool. Portions of Hadoop can be used to create a distributed file system, machine learning tools, data processing frameworks, and large scale random read-write data. You can use Hadoop to scale your data needs in many different directions. The most important thing is to pick a single pain that you’re having – typically ETL or reporting – and experiment with using Hadoop to make things faster or operate at a much bigger scale.
Want to Know More?
If you’d like to more, make sure you check out my video on Hadoop Revisited.