Celebrity cooking shows are popular around Brent Ozar Unlimited. We watch Top Chef for the creative cooking as much as the human drama. Contestants on Top Chef face huge challenges – they’re working alone, have a limited set of tools, have a fixed set of ingredients, and operate under ridiculously strict time guidelines. What makes the creativity of chefs even more interesting is the limitations they’re under: contestants have to work with strange, disgusting or difficult ingredients in order to win the favor of the judges.
Building successful software isn’t much different. Sure, there’s a team of collaborators to help us make decisions; but everyone on that team is frequently responsible for one area of the application. Success or failure depends on your ability to work with the requirements you’re given and please the final judges – the end users.
The Rules of the Contest
Top Chef contestants work within the rules of the contest – they have a limited amount of time with a limited number of ingredients to make an appealing meal. The core ingredients and the time period are non-negotiable. The show’s producers call these the rules, in the world of software these are business requirements, and in the world of programming we might call these application invariants. But no matter what you call them, these things can’t change. Requirements might be something like
- A picture can’t be viewed until four thumbnails have been generated.
- Property listings cannot be viewed until approved by the listing agent.
Contestants on Top Chef don’t immediately start cooking – although careful editing makes it look that way. Development teams, even agile teams, shouldn’t get immediately start coding once they have their hands on requirements. It’s important to look carefully at the requirements and make sure it’s possible to deliver something that the business (the judges) are happy to see. Winning chefs don’t immediately reach for a bottle of Frank’s Red Hot to give a dish a bit of pizzazz, they consider all of the options and match their condiments to the meal, so why do we always reach for the same tools?
It’s easy to be lulled by the familiar – hot dogs and burgers are easy, but they don’t win Top Chef. While you don’t need award winning code to win the game of delivering software, you do need to make the right choices to make life easier.
Working with black chicken and monkfish liver may not be the easiest thing, but contestants on Top Chef are routinely able to turn strange ingredients into masterpieces. Business requirements spell out how our applications have to behave at the end of the day, but you’ll notice that it doesn’t matter how you get there. Just make sure you get there – solve the business problem and move on.
Start with the Ingredients
It isn’t uncommon for Top Chef contestants to scrap their first ideas after a few minutes of work. Likewise, don’t be afraid to throw away your first idea. If you’re a pack rat, write your idea on a piece of paper and hide it from yourself. It’s okay to come back to your first idea, but it’s important to think about the problem in a different way.
What are the core ingredients of your application? Just because you have chicken, that doesn’t mean you should make chicken cordon bleu. Ask yourself, “What am I supposed to create?” A few applications I’ve come across in the last year are:
- Single sign-on systems
- Hosted property listings
- Utility easement tracking
- Document tracking and signing
Each of these applications has a different set of features and functionality. Would you use the same solution for each one? Looking at it a different way – would you serve the same meal for the Superbowl as you would for Christmas dinner?
Make an itemized list of the ingredients that you have on hand. Your requirements are your ingredients. They drive the way the users will interact with the data. As you investigate the requirements, make sure you ask the users questions like, “Do you need point in time recovery for easement property maps?” or “Is it a requirement that a user have a first name, last name, bio, and profile picture or would a user name and password be acceptable?” Understanding your requirements drives your choices.
It’s All About the Ingredients
Under all of your application code, you need somewhere to store your data. One of the Top Chef judges frequently asks “Where’s the protein?” when served a salad. As you work through application requirements, use these to ask yourself “Where’s the data?”
Top Chef contestants typically aren’t told that they need to make sweet glazed salmon, they’re told to use a set of ingredients and produce a fine meal. It’s up to the chef to determine whether to use rémoulade or tartar sauce and it’s up to you to make technical decisions. The business user isn’t going to know the answers to your technical questions, but they do know that a user only needs a user name and password to use the application.
Use the business requirements to help make your database design decisions – if an image doesn’t need to be transactionally consistent with all other data, you don’t need to store it in your relational database. The rules of the contest – the business requirements – should shape how you design your application. They give you both the restrictions and freedom you need to be creative.
What Will They Eat?
Food falls into distinct cuisines. If I gave you a choice between sushi or tapas, you’d be able to make an informed choice because you know the ingredients and style used for each style of cooking.
As you evaluate the business requirements, dig deeper and imagine the types of answers that users might look for in the data. Will users look for property along the path of a tornado where repairs need to be made? Are users searching for houses with specific features – e.g. find single family homes with 2 or more bathrooms and an attached garage? Or are users’ questions difficult to predict and completely free form?
Understanding how people will use the data guides the choices we make. If users will be performing free form text searches, a full text search engine like SOLR should be considered. If an application is pure OLTP, it’s possible that you can use a key-value database. Understanding application requirements means that you can decide whether you need to use SQL Server or you can investigate other options.
Some of the database cuisines to consider are:
- Relational database (SQL Server, PostgreSQL)
- Document database (CouchDB, MongoDB)
- Text Search (Lucene/SOLR, Elastic Search)
- Key-Value database (Riak, Cassandra)
The processing of picking a database can lead to conflict. Developers have their favorite new technologies they want to try and entrenched products are frequently favored above all others. Understanding how one database meets application requirements is important – if you don’t know which ingredients you have, you don’t know what to make; if you don’t understand the application invariants involved, you can’t know which option is the best.
Ultimately, making sure you pick the right tool for the job can lead to faster development, easier support, and better throughput.
What Do the Judges Think?
The most important thing, though, is what the judges think. It doesn’t matter if you’ve made the greatest chicken salad sandwich ever, if your work doesn’t meld with the judges’ expectations you won’t be taking home the prize. Understanding how the requirements influence the ways that users will work with data is critical if you want to be successful. Once you know how people will work with the tools, you’ll be able to make the right decisions for your application.
Everyone wants to make sure they’re getting the best performance out of their solid state storage. If you’re like a lot of people, you want to make sure you’re getting what you paid for, but how do you know for sure that the drive is performing well?
Watch that Average
The first way to monitor performance it to use some perfmon counters. Although there are a lot of perfmon counters that seem helpful, we’re only going to look at two:
- PhysicalDisk\Avg. Disk Sec/Read
- PhysicalDisk\Avg. Disk Sec/Write
As soon as you get a solid state drive in your server, start monitoring these numbers. Over time you’ll be able to trend performance over time and watch for poor performance. When the SSDs pass out of your valid performance guidelines (and they probably will), you can pull them out of the storage one at a time and reformat them before adding them back into the RAID array. Note it isn’t necessary to do this
Although it’s risky, this approach can work well for detecting performance problems while they’re happening. The downside is that we don’t have any idea that the drives are about to fail – we can only observe the side effects of writing to the SSDs. As SSD health gets worse, this average is going to trend upwards. Of course, you could also be doing something incredibly dumb with your hardware, so we can’t really use average performance as a potential indicator of impending hardware failure.
Which SMART Attributes Work for SSDs?
What if we could watch SSD wear in real time? It turns out that we’ve been able to do this for a while. Many vendors offer SMART status codes to return detailed information about the status of the drive. Rotational drives can tell you how hot the drive is, provide bad sector counts, and a host of other information about drive health.
SSDs are opaque, right? Think again.
SSD vendors started putting information in SMART counters to give users a better idea of SSD performance, wear, and overall health. Although the SMART counters will vary from vendor to vendor (based on the disk controller), Intel publish documentation on the counters available with their SSDs – check out the “SMART Attributes” section of the Intel 910 documentation. These are pretty esoteric documents, you wouldn’t want to have to parse that information yourself. Thankfully, there are easier ways to get to this information; we’ll get to that in a minute.
Which SMART Attributes Should I Watch?
There are a few things to watch in the SMART status of your SSDS:
- Write Amplification
- Media Wear-out Indicator
- Available Reserved Space
Write Amplification, roughly, is a measure of the ratio of writes issued by your OS compared to the number of writes performed by the SSD. A lower score is better – this can even drop below 1 when the SSD is able to compress your data. Although the Write Amplification doesn’t help you monitor drive health directly, it provides a view of how your use pattern will change the SSD’s lifespan.
The Media Wear-Out Indicator gives us a scale from 100 to 0 of the remaining flash memory life. This starts at 100 and drifts toward 0. It’s important to note that your drive will keep functioning after Media Wear-Out Indicator reports 0. This is, however, a good value to watch.
Available Reserved Space measures the original spare capacity in the drive. SSD vendors provide additional storage capacity to make sure wear leveling and garbage collection can happen appropriately. Like Media Wear-Out Indicator, this starts at 100 and will drift toward 0 over time.
It’s worth noting that each drive can supply additional information. The Intel 910 also monitors battery backup failure and provides two reserved space monitors – one at 10% reserved space available and a second at 1% reserved space available. If you’re going to monitor the SMART attributes of your SSDs, it’s worth doing a quick search to find out what your SSD controllers support.
How do I Watch the SMART Attributes of my SSD?
This is where things could get ugly. Thankfully, we’ve got smartmontools. There are two pieces of smartmontools and we’re only interested in one:
smartctl. Smartctl is a utility to view the SMART attributes of a drive. On my (OS X) laptop, I can run
smartctl -a disk1 to view the SMART attributes of the drive. On Windows you can either use the drive letter for a basic disk, like this:
smartctl -a X:
Things get trickier, though, for certain PCI-Express SSDs. Many of these drives, the Intel 910 included, present one physical disk per controller on the PCI-Express card. In the case of the Intel 910, there are four. In these scenarios you’ll need to look at each controller’s storage individually. Even if you have configured a larger storage volume using Windows RAID, you can still read the SMART attributes by looking at the physical devices underneath the logical disk.
The first step is to get a list of physical devices using WMI:
wmic diskdrive list brief
The physical device name will be in the
DeviceID column. Once you have the physical device name, you can view the SMART attributes with
smartctl like this:
smartctl -a /dev/pd0 -q noserial
Run against my virtual machine, it looks like this:
C:\Windows\system32> smartctl -a /dev/pd0 -q noserial smartctl 6.1 2013-03-16 r3800 [x86_64-w64-mingw32-win8] (sf-6.1-1) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: Windows 8-0 SSD Serial Number: 0RETRD4FE6AMF823QE7R Firmware Version: F.2FKG1C User Capacity: 68,719,476,736 bytes [68.7 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ATA8-ACS, ATA/ATAPI-5 T13/1321D revision 1 SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Sat Apr 27 08:35:03 2013 PDT SMART support is: Unavailable - device lacks SMART capability.
Unsurprisingly, my virtual drive doesn’t display much information. But a real drive looks something like this:
Holy cow, that’s a lot of information. The Intel 910 clearly has a lot going on. There are two important criteria to watch, simply because they can mean the difference between a successful warranty claim and an unsuccessful one
- SS Media used endurance indicator
- Current Drive Temperature
The Intel 910 actually provides more information via SMART, but to get to it, we have to use Intel’s command line tools. By using the included isdct.exe, we can get some very helpful information about battery backup failure (yup, your SSD is protected by a battery), reserve space in the SSD, and the drive wear indicator. Battery backup failure is a simple boolean value – 0 for working and 1 for failure. The other numbers are stored internally as a hexadecimal number, but the isdct.exe program translates them from hex to decimal. These numbers start at zero and work toward 100.
If you’re enterprising, you can take a look at the vendor specification and figure out how to read this data in the SMART payload. Or, if you’re truly lazy, you can parse the text coming out of smartcl or isdct (or the appropriate vendor tool) and use that to fuel your reports. Some monitoring packages even include all SMART counters by default.
The Bad News
The bad news is that if you’re using a hardware RAID controller, you may not be able to see any of the SMART attributes of your SSDs. If you can’t get accurate readings from the drives and you’ll have to resort to using the Performance Monitor counters I mentioned at the beginning of the article. RAID controllers that support smartmontools are listed in the smartctl documentation.
Special thanks go out to a helpful friend who let us abuse their QA Intel 910 cards for a little while in order to get these screenshots.
SQL Server execution plans provide a roadmap to query performance. Once you understand how to read the execution plan, you can easily identify bottlenecks and detours. In this high level session, Jeremiah Peschka will introduce you to the concepts of reading SQL Server execution plans including how to get an actual execution plan, how to read the plan, and how to dive deeper into the details of the pieces of the plan. This session is for developers and DBAs who have never looked at SQL Server execution plans before.
In this talk I mentioned a few tools.
There are many ways to scale out your database; many of these techniques require advanced management and expensive add-ons or editions. Database sharding is a flexible way of scaling out a database. In this presentation, Jeremiah Peschka explains how to scale out using database sharding, covers basic techniques, and shares some of the pitfalls. This talk is for senior DBAs, database architects, and software architects who are interested in scaling out their database.
More resources are available over in our sharding article.
Without looking in your wallet, do you know how much cash you have? Most of us know within a few dollars. Now, without looking in your SQL Server, do you know much data is cached in memory? You probably don’t and that’s okay; you shouldn’t know how much data SQL Server is caching in memory. We can’t control how much data SQL Server is caching, but we can control how we cache data.
Different Types of Cache
There are a lot of different ways to approach caching. One of the most prevalent ways involves thinking about cache in two different levels (much like CPU cache): first level cache and second level cache.
First level cache is an immediate, short-lived cache that works within a single session to attempt to minimize database calls. Unfortunately, first level cache is only used for the duration of a current session or transaction (depending on your terminology). This is very short lived and it’s only useful to the current process. While helpful, first level cache has a limited scope.
There’s another type of cache: second level cache. Second level cache exists outside of the current process and can be shared between multiple transactions, processes, servers, or even applications. When we talk about adding cache to an application, we really mean second level cache.
A Bit of Cache
Even the most basic of ORMs have a little a bit of cache available. The first level cache is used as a short lived buffer to reduce the amount of work that the ORM has to do. First level cache is used for caching objects in the current transaction and query text. Although this cache can be helpful for the current process, this cache isn’t shared across multiple processes or even multiple database batches. If we want to have a more robust cache, we have to look elsewhere.
ORMs like Entity Framework or the LLBLGen Framework don’t have a second level cache. It’s up to developers to add a cache when and where they need it. This exposes developers to additional concerns like cache invalidation, cache updates, and query caching. All of these features and functionality may not be necessary, but that’s an acceptable trade off – it’s up to developers to implement cache features in ways that support application requirements.
Although it takes up developer time, building the second level cache yourself has the benefit of creating a cache that’s suited to the application’s requirements. For many application level features, this is good enough. It’s important, though, that developers pick a caching layer capable of meeting their operational requirements. Operational requirements include horizontal scalability, redundancy and fail over, recovery of cached data, or customizable cache expiration on an object-by-object basis.
These basic ORMs aren’t really all that basic – they have full features in other parts of the ORM, but they only offer basic support for automatic caching through the ORM.
A Lot of Cache
You’ve got memory. You want to use it to cache data. What’s the easiest way to do that?
One of the easiest approaches to adding caching to your application is to use a framework that supports it out of the box. A number of ORMs, including both Hibernate and NHibernate, provide this support. Enabling cache is easy – just change a few lines in a configuration file and the cache will be available to your application. Things start getting tricky, though, when you examine the richness of the caching that’s provided by these tools.
Power comes with a price. When you’re getting starting with tools like Hibernate or NHibernate, there’s a lot to take in and many developers overlook these features. Developers can choose on an object by object basis which caching strategy should be applied. Based on business requirements we can choose to treat certain cacheable objects as read only while others can be used as a read/write cache. Some objects can be cached while others bypass the secondary cache entirely – there’s a lot of complexity for developers to manage.
While this can be overwhelming, this flexibility serves a purpose – not all features of an application have the same requirements. Some features can serve old data to users, other features need to be up to the minute or up to the second. Giving developers the ability to make these choices means that there is a choice to be made. Even if it’s a difficult one, developers can choose how the application behaves and can tailor performance and functionality to business requirements.
Making the Choice
If you’ve already got an existing project and you’re planning on adding a caching layer, don’t think that you have to re-implement your data access layer just to get better support for caching. Both approaches have their benefits and it’s far more important to be aware of which data needs to be cached and the best way to cache it.
Session state frequently ends up on a busy SQL Server. What seemed like a good idea in development turns into a problem in production. While there are valid business reasons for persisting session state to permanent storage; there are equally valid reasons to avoid using SQL Server as the permanent storage. We’ll investigate why session state poses problems for SQL Server and cover an alternate solution that allows for persistent session state. This talk is for developers and DBAs who want a better way to safely track ASP.NET session state.
Links and References
LandsofAmerica.com is the largest rural listing service in the Nation. The Network specializes in land for sale, which includes farms, ranches, mountain property, lake houses, river homes, beachfront homes, country homes, and residential homes in smaller towns across the country. These properties have many diverse uses including recreational and agricultural activities like hunting, fishing, camping, backpacking, horseback riding, four wheeling, grazing cattle, gardening, vineyards, cropland, raising horses, and other livestock.
LandsofAmerica.com (LoA) has been collecting an immense amount of data about visitor usage and search patterns for several years using Microsoft SQL Server as their data storage solution. LoA is happy using SQL Server for OLTP workloads, but with nearly 1 billion rows of data, previous attempts at combining OLTP and analytical queries on the same SQL Server instances caused poor reporting performance. LOA’s goal was to provide this data to several thousand clients so it needed to be optimal for their use.
The real time requirements for data analysis are relaxed: a 24-hour lag between data collection and analysis delivery is acceptable for the business users. Many of the analysis questions took the form of “What is the average price range users are searching for in these 7 counties over the last 6 months?” This presented a problem: how could the business provide a product reporting on this data about user trends without sacrificing core application performance? The issue was LoA’s production SQL Server was unable to answer these questions and still serve OLTP data.
LoA’s team was faced with two choices: they could purchase a second server used solely for analytical querying or they could evaluate other options. During discussions with the company’s development team, I reviewed solutions that would let LoA use their historical data, help the business make better decisions, and move the data processing load outside of SQL Server.
Working closely with LoA, I designed and implemented a solution using Apache Hive hosted in Amazon Elastic MapReduce (EMR) – EMR delivers managed Hadoop and Hive services, low cost storage, and flexible computing resources. LoA was up and running with EMR and Hive in just several weeks.
Use Case Description
LandsofAmerica.com already leverages components of Amazon Web Services in conjunction with Microsoft SQL Server. Extending their usage to Elastic Map Reduce and Hive was an easy addition. Long term data storage is offloaded from SQL Server to Amazon S3 to lower storage costs compared to traditional storage options. S3 stores the detailed source records exported from SQL Server and data stored in S3 is accessed through Hive.
Hive is used as a separate data processing system. User search and activity is aggregated along several key measures through Hive. The aggregated data is stored in S3 before being imported into an on-premise SQL Server for interactive querying and reporting.
Multi-dimensional search data provides many opportunities for complex analysis. This analysis is typically both CPU and disk intensive – it’s difficult to provide effective indexing techniques for large analytic queries. Through Hive’s ability to conduct large-scale analysis, LandsofAmerica.com is able to uncover trends that would otherwise remain hidden in their data. Other options exist to perform analysis, but carry a significant hardware and licensing cost, like the Microsoft SQL Server Fast Track Data Warehouse. By utilizing commodity cloud computing resources and Apache Hive, LandsofAmerica.com is able to gain insight across their collected data without a significant investment of capital – resources are consumed on-demand and paid for on-demand.
By using Hive as the definitive store of historical data, LandsofAmerica.com is able to reduce their local storage requirements. Older historical data can be removed from Microsoft SQL Server as it is loaded into Hive.
An Introduction to Hive’s Partitioning
You’re probably thinking about building a data warehouse (just about every company is if they haven’t already). After reading SQL Server Partitioning: Not the Best Practice for Anyone and Potential Problems with Partitioning you’re wondering why anyone would partition their data: it can be harder to tune queries, indexes take up more space, and SQL Server’s partitioning requires Enterprise Edition on top of that expensive SAN you’re adding to cope with the extra space. Anyone who is looking at implementing table partitioning in SQL Server would do well to take a look at using Hive for their partitioned database.
Setting up partitioning functions in SQL Server is a pain. It’s left up to the implementor to decide if the partition function should use range right or range left and how partitions will be swapped in and out. Writing robust partitioning functions is stressful the first time around. What if we didn’t have to define a partition function? What if the database knew how to handle partitioning for us? Hive does just that.
Rather than leave the table partitioning scheme up to the implementor, Hive makes it easy to specify an automatic partition scheme when the table is created:
CREATE TABLE sales ( sales_order_id BIGINT, order_amount FLOAT, order_date STRING, due_date STRING, customer_id BIGINT ) PARTITIONED BY (country STRING, year INT, month INT, day INT) ;
As we load data it is written to the appropriate partition in the table. There’s no need to create partitions in advance or set up any kind of partition maintenance; Hive does the hard work for us. The hardest part is writing queries. It’s a rough life, eh?
You might have noticed that while the partitioning key columns are a part of the table DDL, they’re only listed in the
PARTITIONED BY clause. This is very different from SQL Server where the partitioning key must be used everywhere in a partitioned table. In Hive, as data is written to disk, each partition of data will be automatically split out into different folders, e.g.
country=US/year=2012/month=12/day=22. During a read operation, Hive will use the folder structure to quickly locate the right partitions and also return the partitioning columns as columns in the result set.
This approach means that we save a considerable amount of space on disk and it can be very fast to perform partition elimination. The downside of this approach is that it’s necessary to tell Hive which partition we’re loading in a query. To add data to the partition for the United States on December 22, 2012 we have to write this query:
INSERT INTO sales PARTITION (country = 'US', year = 2012, month = 12, day = 22) SELECT sales_order_id, order_amount, due_date, customer_id, cntry, yr, mo, d FROM source_view WHERE cntry = 'US' AND yr = 2012 AND mo = 12 AND d = 22 ;
This is a somewhat inflexible, but effective, approach. Hive makes it difficult to accidentally create tens of thousands of partitions by forcing users to list the specific partition being loaded. This approach is great once you’re using Hive in production but it can be tedious to initially load a large data warehouse when you can only write to one partition at a time. There is a better way.
With a few quick changes it’s easy to configure Hive to support dynamic partition creation. Just as SQL Server has a
SET command to change database options, Hive lets us change settings for a session using the
SET command. Changing these settings permanently would require opening a text file and restarting the Hive cluster – it’s not a difficult change, but it’s outside of our scope.
SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict;
Once both of these settings are in place, it’s easy to change our query to dynamically load partitions. Instead of loading partitions one at a time, we can load an entire month or an entire country in one fell swoop:
INSERT INTO sales PARTITION (country, year, month, day) SELECT sales_order_id, order_amount, due_date, customer_id, cntry, yr, mo, d FROM source_view WHERE cntry = 'US' ;
When inserting data into a partition, it’s necessary to include the partition columns as the last columns in the query. The column names in the source query don’t need to match the partition column names, but they really do need to be last – there’s no way to wire up Hive differently.
Be careful using dynamic partitions. Hive has some built-in limits on the number of partitions that can be dynamically created as well as limits on the total number of files that can exist within Hive. Creating many partitions at once will create a lot of files and creating a lot of files will use up memory in the Hadoop Name Node. All of these settings can be changed from their defaults, but those defaults exist to prevent a single
INSERT from taking down your entire Hive cluster.
What About Partition Swapping?
Much like SQL Server, Hive makes it possible to swap out partitions. Partition swapping is an important feature that makes it easy to change large amounts of data with a minimal impact on database performance. New aggregations can be prepared in the background
How do we perform a partition swap with Hive? A first guess might be to use the
INSERT OVERWRITE PARTITION command to replace all data in a partition. This works but it has the downside of deleting all of the data and then re-inserting it. Although Hive has no transaction log, we’ll still have to wait for data to queried and then written to disk. Your second guess might be to load data into a different location, drop the original partition, and then point Hive at the new data like this:
ALTER TABLE sales DROP IF EXISTS PARTITION (country = 'US', year = 2012, month = 12, day = 22) ; ALTER TABLE sales ADD PARTITION (country = 'US', year = 2012, month = 12, day = 22) LOCATION 'sales/partitions/us/2012/12/22' ;
It’s that easy: we’ve swapped out a partition in Hive and removed the old data in one step. . Truthfully, there’s an even easier way using the
SET LOCATION clause of
ALTER TABLE sales PARTITION (country = 'US', year = 2012, month = 12, day = 22) SET LOCATION = 'sales/partitions/us/2012/12/22' ;
Just like that, the new partition will be used. There’s one downside to this approach – the old data will still exist in Hadoop, only the metadata will be changed. If we want to clear out the old data, it’s going to be necessary to run drop down to HDFS commands and delete the old data out of Hadoop itself.
Is Hive Partitioning Right For You?
If you’re thinking about partitioning a relational database, you should give serious consideration to using partitioned tables in Hive. One of the advantages of Hive is that storage and performance can be scaled horizontally by adding more servers to the cluster – if you need more space, just add a server; if you need more computing power, just add a server. Hive’s approach to data skips some of the necessary costs of partitioning in SQL Server – there’s no Enterprise Edition to purchase, minimal query tuning involved (hint: you should almost always partition your data in Hive), and no expensive SAN to purchase.
For better or for worse – if you’re thinking about partitioning a data warehouse in SQL Server, you should think about using Hive instead.
Getting data out of SQL Server is easy—you don’t need special tools. With just a few lines of code, you can easily send the results of a query to a file. When you’re building a prototype, this is also the easiest way to work; you create quick and dirty tools that get the job done. ETL (extract, transform, load) processes are no exception. During a recent project working with SQL Server and Hive, I assumed that most of the pain would involve Hive. I was sorely mistaken: SQL Server and dirty data were the source of our problems.
Extracting the Errors
Getting data out of SQL Server seemed painless. It only took a few SQL statements and several lines of C# before I had a working data extractor. Data came out of SQL Server and was written to disk in a delimited file format—to make life easy, I decided to use a tab character as the delimiter. The hardest part of the extraction process was checking the data for funny characters.
Hive doesn’t support quoted strings; this made it especially important to make sure that the source data was as clean as possible. Hive’s schema on read makes it possible to support files with a different number of fields in each record, but the downside of this flexibility is that stray delimiter characters will be picked up as a new column.
The initial portion of the data looked correct, so I proceeded with data loads. As we began loading more and more data, problems started appearing in the queries. Data wasn’t where it should be and some of the aggregations were failing—Hive was reporting errors converting strings to numbers. I was perplexed and started looking at the data in the source system. Everything looked correct. Rather than examine the source data, I looked into the extracted data and rapidly got lost in a sea of massive files because Hive’s MapReduce jobs create a detailed log of all activity.
While digging through the chain of MapReduce logs, I found references to the problem data. Ultimately, I found several lines of code where an enterprising user had put a tab character in a string. Not to be outdone, a few other users had managed to sneak newline characters into text as well. After creating additional data cleansing rules, I reloaded all of the source data. Somewhat surprisingly, everything began working at this point. The users and their desire to put random characters in text had been defeated by a simple
A Transformative Experience
Applying data transformations in Hive was anticlimactic. Really. The process of transforming source data into a queryable format was painless—barring date and time bugs, there’s nothing to report. Hive was able to fly through the source data, pivot several hundred gigabytes of tables, and produce complex aggregations in a matter of minutes.
The only difficulty came from figuring out the right techniques for aggregating data. That, however, is a different story.
There’s always a letdown. After a reasonable run of success building a prototype, something had to break and break it did.
After getting data out of SQL Server, pushing data into Hive, running queries, and pulling results out of Hive, you would assume that the hard part was over. After all, the only thing left to do was load the results into SQL Server. And that’s where the problems started.
There are a lot of techniques to load data into SQL Server and they pose different problems. Although the most efficient techniques involve using
BULK INSERT, they require that SQL Server has access to the files. Although that works, troubleshooting a combination of Active Directory and SQL Server security is nobody’s idea of a good time; sorting out the required permissions proved time consuming and difficult. Once security issues were resolved, a new error cropped up:
BULK INSERT may encounter problems loading data into tables created with LOB data types. It was easy to get around this problem by using the correct data types with appropriate precision. This wasn’t difficult to fix, but it did slow down our end to end testing.
In the end…
After a lot of troubleshooting security issues and data formatting issues, we were able to resolve our problems and get an end to end test running. It’s rewarding to watch data move through a processing pipeline, especially after struggling with multiple ETL problems.
ETL remains the most difficult part of database projects. Moving data between different instances of SQL Server can be problematic, depending on the methods used; moving data between SQL Server and different databases can cause problems in ways that you didn’t expect. Building end to end systems is difficult. If you haven’t done it before make sure you pad your schedule; I was able to solve these problems quickly through previous experience and by reaching out to my network, but everyone may not be so lucky. ETL can be problematic even for the most seasoned SQL Server professional.
Curious if you might have a problem suitable for Hadoop but don’t want to run into some of the same problems with your implementation? Take a look at the other Hadoop articles I’ve written or get in touch.
Aggregate functions are convenient – they solve a business need and they make development easy. Unfortunately, not all business requirements are so easy to solve. Let’s look at one example: rolling averages.
The Rolling Average
A rolling average is a simple concept; an average is computed over a fixed subset of data. Rolling average calculations are most frequently used with time series data and help remove short term fluctuations while highlighting long term trends – utility bills often feature a rolling average of consumption to help the customer understand their usage. Consumers aren’t concerned about electricity usage being high on one day in August when there was a heat wave; they want to see how their consumption is changing over time.
Rolling Averages with Common Table Expressions: 2005 – 2008R2
Starting with SQL Server 2005, it became easy to write a rolling average in a single T-SQL statement using a Common Table Expression. CTEs rapidly became popular; an incredible amount of prevously difficult functionality was made possible through CTEs including recursive queries and rolling averages. Take a look at this example in the
AdventureWorks2012 sample database:
WITH cte AS ( SELECT DENSE_RANK() OVER ( ORDER BY tm.YearName, tm.MonthOfYearNumber ) AS r , tm.YearName AS [year] , tm.MonthOfYearNumber AS [month] , SUM(SubTotal) AS SubTotal FROM dbo.TimeMaster tm LEFT JOIN Sales.SalesOrderHeader AS soh ON tm.ActualDate = soh.OrderDate WHERE tm.ActualDate BETWEEN '2005-07-01' AND '2008-08-01' GROUP BY tm.YearName , tm.MonthOfYearNumber ) SELECT cte1.[year] , cte1.[month] , AVG(cte1.SubTotal) AS AverageSubTotal FROM cte AS cte1 JOIN cte AS cte2 ON cte1.r > ( cte2.r - 12 ) AND cte1.r <> cte2.r GROUP BY cte1.[year] , cte1.[month] ORDER BY cte1.[year] , cte1.[month] ;
While not the most straightforward approach to constructing a rolling average, the CTE manages to get the job done. In this query, we are using the CTE to create a work table and then performing a self-join. This same sort of thing is possible using a temporary table or table variable, but the CTE accomplishes it in one statement and is, arguably, easier to read.
Common Table Expressions also hide a dark secret – SQL Server executes the CTE body every time the CTE expression,
cte in this example, is referenced. The more complex the Common Table Expression is, the more work that has to be performed. Running this rolling average with
STATISTICS IO turned on, it’s easy to see the multiple executions in the form of two scans on each table:
Table 'TimeMaster'. Scan count 2, logical reads 14, physical reads 1, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'SalesOrderHeader'. Scan count 2, logical reads 208, physical reads 1, read-ahead reads 102, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 1, logical reads 147, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
On a database this small, this doesn’t pose major performance problems, but this will cause big problems for a moderately sized database.
Rolling Averages with Window Functions: 2012 and beyond
SQL Server 2012 provided better support for windowing functions. Although support for
OVER() was already available in SQL Server 2005, SQL Server 2012 brings considerably more functionality to the table. By using the
ROW or RANGE clause of the windowing function, it’s possible to simplify the query and improve performance. Take a look:
SELECT YearName , MonthOfYearNumber , AVG(st) OVER ( PARTITION BY YearName, MonthOfYearNumber ORDER BY YearName, MonthOfYearNumber ROWS 12 PRECEDING ) FROM ( SELECT tm.YearName , tm.MonthOfYearNumber , SUM(COALESCE(SubTotal, 0)) AS st FROM dbo.TimeMaster tm LEFT JOIN Sales.SalesOrderHeader AS soh ON tm.ActualDate = soh.OrderDate WHERE tm.ActualDate BETWEEN '2005-07-01' AND '2008-08-01' GROUP BY tm.YearName , tm.MonthOfYearNumber ) AS x ;
Although the two queries are remarkably different, the biggest difference is the introduction of
ROWS 12 PRECEDING. This takes the place of the self join in the previous example. Instead of writing out a join ourselves, we simply tell SQL Server that we’d like an average of
st over the last 12 rows sorted by year and month. What kind of effect does this have on the work SQL Server performs?
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'TimeMaster'. Scan count 1, logical reads 7, physical reads 1, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'SalesOrderHeader'. Scan count 1, logical reads 104, physical reads 1, read-ahead reads 102, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
111 reads instead of 369 reads. Clearly this change makes for a substantial performance improvement for SQL Server. We’ve reduced the number of reads, eliminated some query complexity, and made it somewhat obvious to the future developers how they could modify or build on this going forward. Changing the new query to a rolling average by day instead of by month is simple and requires even fewer lines of code:
SELECT ActualDate , st AS SubTotal , AVG(st) OVER ( ORDER BY ActualDate ROWS 365 PRECEDING ) AS RollingAverageSales FROM ( SELECT ActualDate , SUM(COALESCE(soh.SubTotal, 0)) AS st FROM dbo.TimeMaster tm LEFT JOIN Sales.SalesOrderHeader soh ON tm.ActualDate = soh.OrderDate WHERE tm.ActualDate BETWEEN '2005-07-01' AND '2008-08-01' GROUP BY ActualDate ) AS x ORDER BY x.ActualDate;
This performs the name number of logical and physical reads as the monthly rolling average using a window function.
There you have it – two different ways to perform a rolling average in SQL Server. One method is clearly a lot easier than the other. There are a number of optimizations in SQL Server 2012 to make it easy for you to build this functionality and to improve SQL Server performance at the same time.