Tag Archive: sqlserver

Understanding Database Choice

Everyone needs a distributed database right? Wrong. It’s easy to get so excited about trying out a new technology that you close your eyes to the problem you’re trying to solve. I came across a question on Stack Overflow where the poster was asking what kind of distributed database they needed. Rather than jump right into the fray and say “You don’t need it,” I took a step back and asked myself “What’s the real question here?”

Oh boy, it's an untested database!

Understand Your Requirements

We all hope that some day our business will be as successful as Google or Facebook or whatever the next big thing is. Those business got that way by first concentrating on doing something incredibly well and then growing as needed. Nobody reasonable wakes up and says “We’re going to build this incredibly robust application and here’s how we’re going to do it.” The internet is full of articles about how different companies have dealt with their problems of scaling. Do your job well and worry about success when it happens.

In the StackOverflow question, Ted (we’ll call this person Ted) was asking which distributed database they should use to scale their system. They gave some vague requirements for data throughput, but left out why the system needed to be distributed across multiple servers.

This triggered my buzzword detector. I think distributed databases are incredibly cool, but they have a place; Ted’s requirements didn’t match at all. without much explanation for why a distributed database would be important here, it was hard to even refute the argument about using a distributed database.

Distributed databases have some operational advantages – they tend to be more robust and tolerant of equipment failures, but that’s based on certain configuration details like using multiple server replicas. RDBMSes, out of the box, aren’t distributed across multiple servers, but there are a lot of features that have been built to make it possible to replicate data across data centers or to shard the database across multiple servers. The closest thing to a business requirement was that the database needed to be free; open source would be nice, too.

Understand Your Hardware

Business requirements only matter so much. Eventually somebody has to write data to a disk. Once that drive head starts moving, the best designed software won’t matter if the underlying hardware can’t keep up with the load. That’s why it’s important to know what your hardware is capable of handling. It’s just as important to know what your application is capable of producing

Ted needed a system that would be handling less than 100 transactions per second and would probably end up writing data at a rate of around 400 kilobytes per second. Neither of these requirements are show stoppers. Assuming that the server was going to be writing at a constant rate, the amount of data generated and kept would be around 10 terabytes of data a year. While it’s nothing to scoff at, it’s not an unheard of data generate rate. The thing is, almost any off the shelf database software can handle these kinds of load. Almost any off the shelf server can handle this kind of data throughput.

The requirement to handle ~100 requests per second at around 4 kilobytes per record isn’t a matter of choosing a database product, it’s a matter of designing a storage solution that can handle the ongoing needs of the business. When SAN space can be purchased from around $15,000 per terabyte, 10TB per year becomes a minor budget line item for all but the most cash strapped startup.

Understand Your Data

There was one feature I left out until now. It’s important to understand how your data will be used. Graph databases excel at helping users explore the relationship between different data points. Relational databases make it easy to build incredibly flexible, good enough solutions to most problems. Key value databases make it possible to do single key lookups in near constant time. The way that data is read limits the playing field.

Ted mentioned that almost of the data lookups were going to be by primary key lookups. If this were the only requirement for reading data, this problem could be solved by any database. Then he threw in a little hook – there would be some joins in the data. In the world of databases once you use the j-word, your options get very limited very quickly. You have to start thinking about querying patterns, potential optimizations, and the trade offs of read vs write optimizations.

If you do need joins, you can take one of two approaches – let the database do it, or write it yourself. Neither approach is difficult (and one is certainly easier than the other), but they’re both feasible – heck the hard part has already been done for you: someone else came up with the algorithm.

Understand Obscenity

Justice Potter Stewart in an attempt to classify obscenity said “I shall not today attempt further to define the kinds of material I understand to be embraced… but I know it when I see it.” Right now, there’s no good definition of what makes data into Big Data. Some people say that you’ve hit Big Data when you can no longer predict query performance. Some people use hard and fast numbers of data volume in bytes, in data churn rate, or in the massively parallel nature of the database. There’s no right or wrong answer and Big Data is something that varies from organization to organization.

It’s important to understand what problem you’re trying to solve, understand the volume of data, and understand how the data is going to be used before making the final selection. There are many ways to store data.

What would I have done in this situation? Taking into account that I know SQL Server well, I would use SQL Server. SQL Server can perform admirably as a glorified key value store. B+trees are pretty quick in most use cases and they balance many of the problems of simultaneously reading and writing data to provide a good enough solution to the problem (with great management tools on top). When business users demand better querying capability, it’s easy enough to start adding non-clustered indexes on top of the solution.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

Website - Twitter - Facebook - More Posts

Availability, Data Locality, and Peer-to-Peer Replication

I want to make something clear: high availability is not load balancing. The two options don’t have to be mutually exclusive, but they aren’t the same thing. Several months ago, I wrote about resolving write conflicts. Some of the approaches I mentioned for resolving write conflicts (such as taking a “last write wins” approach) involved using peer-to-peer replication. It’s important to understand conflict resolution and peer-to-peer replication. Since I’ve already talked about conflict resolution, let’s dig into how peer-to-peer replication fits into the mix.

Peer-to-Peer Replication and You

Peer-to-peer replication is a special and magical kind of replication, it works in a ring or mesh to make sure that one row’s updates will magically spread to all servers. You’d think that this would mean every server is equal, right?

In some distributed databases actions can take place on any server. Using Riak as an example, when you want to add a new record to the database you can write the record to any server and the record will be routed to the server responsible for handling that data. This is part of the beauty of a specific type of distributed database: the database is a collection of nodes that serve reads and writes without regard for hardware failure. There’s a lot of software trickery that goes into making this work, but it works quite well.

SQL Server’s peer-to-peer replication is a distributed database, just not in the sense that I’ve used the term previously. In SQL Server peer-to-peer replication every node is an exact copy of every other node: the same tables, rows, indexes, and views exist on every node. This is where the difficulty begins – if every row exists on every node, how do we know where to update data? The problem is that we don’t know where to update a row. There is no out of the box mechanism for determining row ownership.

Distributed database systems like Riak, Cassandra, HBase, and Dynamo work by assigning an owner to every record, shouldn’t we do the same thing with SQL Server? When we’re spreading data across a number of servers, we have to ensure that writes go to the correct location, otherwise we need to build a large number of checks in to ensure that all nodes have the appropriate updates and that everyone is working on the correct version of data. Otherwise, we run into conflicts. This is the reason I hinted at using peer-to-peer replication combined with write partitioning and a last write wins method of conflict detection. If changes to a row can only occur on server A, we don’t need to worry about updates on other servers – those updates can be ignored since they did occur not in the correct location.

The difficulty lies in finding a way to do all of this. SQL Server’s replication offers no routing functionality, it just replicates data to the appropriate subscribing servers. In order to make sure that data gets to the right place, there must be another piece to the puzzle. There must be a way to correctly locate data.

Record Ownership

If you absolutely must use peer-to-peer replication as a form of load balancing, record ownership is an important concept to consider. Regardless of whether the distributed database is relational or not, software still needs to be aware of where the definitive version of a record can be found. If there’s no way to determine which version of a record is the definitive version of a record, two updates can occur in different locations. This will undoubtedly lead to painful conflict scenarios. Instead of worrying about handling conflicts, we worry about getting data to the right place. Once we know that the data is in the right place, we can trust our database to be as accurate as possible.

I use the term record instead of row for an important reason: a record represents a complete entity in a system. A row may be part of a record (e.g. a line item in an order) but the record is the complete order.

Record ownership is a tricky thing to think about; how do you determine who owns any single row? What’s a fair and efficient way to handle this? Let’s take a look at different techniques and see how they stack up. Here’s a quick list of possible ways we can distribute row ownership in a database:

  • Random
  • Range-based
  • Static

Random Record Ownership

Randomness is frequently used to ensure an even distribution. Randomly bucketizing data turns out to be a very effective way of ensuring that data will be split very close to evenly across any arbitrary number of locations. The difficulty is in ensuring randomness.

Some systems like Riak and Cassandra use a hash function to distribute data ownership around the database cluster. Different nodes are assigned a range of values – if there are four servers in the distributed database, each one is roughly responsible for 1/4 of the data in the database (I’m simplifying, of course, but you get the drift). Special routing code takes care of getting data to clients and sending writes to the appropriate place. The location of a record is typically determined by applying a hashing function to the record’s key. In this way, we can always find a row at a later date: by applying a function to the key we can quickly find the row even if the number of servers in the cluster has changed.

This mechanism provides a reliable way to uniquely identify data and distribute it among many servers. This technique is difficult to accomplish with SQL Server. There is no peer-to-peer replication functionality in the SQL Server space that makes it easy to say “This record belongs on server A and this record belongs on server B.” There’s a reason for this: peer-to-peer replication is a high availability technology. It exists to make life easier in the unfortunate event that your data center slides into the ocean. It’s possible to build some kind of load balancing layer in SQL Server using SQL Server Service Broker (or just about any other technology), but the point remains that SQL Server doesn’t provide out of the box functionality to automatically implement random record based ownership.

Range-Based Record Ownership

Range-based ownership is far simpler than random record ownership. In range-based ownership a range of records are claimed by a single server. This could be users 1 through 100,000 or it could be users whose names start with ‘A’ through users whose names start with ‘K’. At a quick glance range-based record seems like it doesn’t have many down sides: it’s easy to determine where an appropriate record goes. My data goes to server A, your data goes to server B, his data goes to server C.

Range-based record ownership has a major flaw: some servers will experience more load than others. For example, if we’re partitioning by name we will quickly discover that first names aren’t very unique, at least not in Western cultures. In a survey of first names conducted in the UK, one quarter of women were likely to have only one of ten first names in 1994. One in three women was likely to be named Emily. Needless to say, data distribution will cause skew in the activity distribution on different servers. If one server accumulates a clump of very active users (e.g. a group of active early adopters), that server may experience a higher load than the others.

Designing an effective range-based record ownership scheme for SQL Server peer-to-peer replication is possible but very difficult. The effectiveness of the scheme depends on intimate knowledge of write patterns. Most of us don’t have the time to develop a deep understanding of how data is written and then develop a scheme that takes into account those patterns.

Static Record Ownership

With static record ownership, we assign each record to a server when it is created. This could be as simple as assigning a user to the closest server or it could mean assigning records to a server by some other arbitrary means. However this is accomplished, it’s important to remember that some piece code still must able to determine where a record should go and that the mechanism for identifying that initial location should be general purpose enough to meet your user’s needs in the long term.

There are several common ways to split out data. If you have a system that’s multi-tenant, it becomes easy to assign ownership for all of a single client/customer’s data to a single server. If that customer grows, you can buy a separate server or move them onto a different server with fewer users. Every record ends up having a composite key made up of the record identifier and the client identifier, but this is a small price to pay for clearly being able to separate data responsibility by client.

Another way to split out data is geographically. If I sign up for a service, it’s nice if the primary place to write my data is as close to me as possible. In this case, the service might have three data centers: in LA, in New York, and one in London. Much like using a multi-client architecture, a geographic method to determine ownership would use the location as part of the key for each record – records stored in LA would use a composite key with the data center location (‘LA’) and some other arbitrary key value to identify a unique record.

No matter what scheme you decide to use, static record ownership is an easy way to determine which SQL Server should be responsible for writes to a single record. An advantage of static record ownership is that routing can be handled in the application or a sufficiently sophisticated router can handle routing writes without any additional application code being added to the application – just a few load balancer rules will need to be created or changed
.

In Summation

Here’s the trick: throughout all of this we’ve ignored that order of events is important. We’ve just assumed that when data is being written, we’re guaranteeing the order of events. If the data is being written to random servers, there’s no guarantee of event order. In a naive system, a record might be written to one server and an update applied to a second server before the original record even shows up! Distributing data is difficult. Randomly distributing data is even more difficult. No matter how you distribute your data or distribute writes, remember that distributing data in SQL Server through peer-to-peer replication is a high availability technology. It can be co-opted for scale out performance improvement, but there are some design decisions that must be made.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

Website - Twitter - Facebook - More Posts

A DBA’s Guide to ORMs Webcast

Developers often use tools, like Entity Framework or NHibernate, to make it easier to work with the database. These tools sometimes cause problems for developers and DBAs alike. In this talk, we’ll talk about the terminology and techniques used with an ORM. We’ll also uncover ways to help DBAs work with developers, detect problematic queries, and improve performance in both the database and the application.

This 30 minute session is for DBAs who are unfamiliar with ORMs and who aren’t sure where to start.

Using ORMs with Stored Procedures

Most ORMs can use stored procedures instead of writing their own SQL. This is important when data is more complex than a single object to table mapping.

The n+1 Selects problem

The n+1 selects problem frequently occurs when displaying a list of items to a user. This can happen through a combination of looping in application code and lazy loading (only loading data when it’s explicitly needed). The ORM will generate multiple calls to the database, one for each object that’s used. Solving this problem depends on the particulars of the ORM that you’re using.

Query Cache Pollution

ORMs can cause the same problems that ad hoc SQL can cause – many plans will be generated and consume SQL Server’s memory. Grant Fritchey (blog | twitter) documented how this problem appears in NHibernate and how to detect it in NHibernate Recompiles and Execution Plans. Solutions abound and there’s an excellent write up of the history of this problem in NHibernate Parameter Sizes Controversy.

General ORM Links

Links to Common ORM Tools

  • NHibernate – a commonly used .NET ORM that is based on Hibernate
  • Hibernate – the grandaddy of Java ORMs and the inspiration for many others.
  • LLBLGen Pro – this is the Cadillac of ORMs. If there’s something you wished an ORM could do, odds are LLBLGen Pro can do it. It even provides tools to generate code for other ORMs.
  • NHibernate Profiler – it’s a profiler that developers can locally to grab only their own queries to the database.
  • Ruby on Rails – Ruby on Rails uses an ORM named ActiveRecord to do the heavy lifting.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

Website - Twitter - Facebook - More Posts

Notes on Scalability

We all hope that we’re going to succeed beyond our wildest expectations. Startups long for multi-billion dollar IPOs or scaling to hundreds, or even thousands, of servers. Every hosting provider is touting how their new cloud offering will help us scale up to unheard of heights. I’ve built things up and torn them down a few times over my career

Build it to Break

Everything you make is going to break, plan for it.

Whenever possible, design the individual layers of an application to operate independently and redundantly. Start with two of everything – web servers, application servers, even database servers. Once you realize that everything can and will fail, you’ll be a lot happier with your environment, especially when something goes wrong. Well designed applications are built to fail. Architects accept that failure is inevitable. It’s not something that we want to consider, but it’s something that we have to consider.

Distributed architecture patterns help move workloads out across many autonomous servers. Load balancers and web farms help us manage failure at the application server level. In the database world, we can manage failure with clustering, mirroring, and read-only replicas. Everything computer doesn’t have to be duplicated, but we have to be aware of what can fail and how we respond.

Everything is a Feature

As Jeff Atwood has famously said, performance is a feature. The main thrust of Jeff’s article is that making an application fast is a decision that you make during development. Along the same lines, it’s a conscious decision to make an application fault tolerant.

Every decision that has a trade off. Viewing the entire application as a series of trade offs leads to a better understanding about how the application will function in the real world. The difference between being able to scale up and being able to scale out can often come down to understanding key decisions that were made early on.

Scale Out, Not Up

This isn’t as axiomatic as it sounds. Consider this: cloud computing like Azure and AWS is at its most flexible when we can dynamically add servers in response to demand. To effectively scale out means that we need also to be able to scale back in.

Adding additional capacity is usually in the application tier; just add more servers. What happens when we need to scale the database? The current trend is to buy a faster server with faster disks and more memory. This process keeps repeating itself. Hopefully your demand for new servers will continue at a pace that is less than or equal to the pace of innovation. There are other problems with scaling up. As performance increases, hardware gets more expensive for smaller and smaller gains. The difference in cost between the fastest CPU and second fastest CPU is much larger than the performance gained – scaling up often comes at a tremendous cost.

don't be afraid to change everything

The flip side to scaling up is scaling out. In a scale out environment, extra commodity servers are added to handle additional capacity. One of the easiest ways to manage scaling out the database is to use read-only replica servers to provide scale out reads. Writes are handled on a master server because scaling out writes can get painful. But what if you need to scale out writes? Thankfully, there are many techniques available to horizontally scaling the database layer – features can be broken into distinct data silos, metadata is replicated between all servers while line of business data is sharded, or automated techniques like SQL Azure’s federations can be used.

The most important thing to keep in mind is that it’s just as important to be able to contract as it is to expand. As a business grows it’s easiest to keep purchasing additional servers in response to load. Purchasing more hardware is faster and usually cheaper than tuning code. Once the application reaches a maturity level, it’s important to tune the application to run on fewer resources. Less hardware equates to less maintenance. Less hardware means less cost. Nobody wants to face the other possibility, too – the business may shrink. A user base may erode. A business’s ability to respond to changing costs can be the difference between a successful medium size business and a failed large business.

Buy More Storage

In addition to scaling out your servers, scale out your storage. If you have the opportunity to buy a few huge disks or a large number of small, fast disks give serious thought to buying the small, fast disks. A large number of small, fast drives is going to be able to rapidly respond to I/O requests. More disks working in concert means that less data will need to be read off of each disk.

The trick here is that modern databases are capable of spreading a workload across multiple database files and multiple disks. If multiple files/disks/spindles/logical drives are involved in a query, then it’s possible to read data from disk even faster than if only one very large disk were involved. The principle of scaling out vs. scaling up applies even at the level of scaling your storage – more disks are typically going to be faster than large disks.

You’re Going to Do It Wrong

No matter how smart or experienced your team is, be prepared to make mistakes. There are very few hard and fast implementation guidelines about scaling the business. Be prepared to rapidly iterate through multiple ideas before finding the right mix of techniques and technologies that work well. You may get it right on the first try. It may take a number of attempts to get it right. But, in every case, be prepared to revisit ideas.

On that note, be prepared to re-write the core of your application as you scale. Twitter was originally built with Ruby on Rails. Over time they implemented different parts of the application with different tools. Twitter’s willingness to re-write core components of their infrastructure led them to their current levels of success.

Don’t be afraid to change everything.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

Website - Twitter - Facebook - More Posts

Silent CPU Killers

It’s always important to make sure that SQL Server is running at peak efficiency – nobody wants to waste resources. In the past we’ve concentrated on ways to reduce I/O – either network I/O or disk I/O. The increasing number of cores available in modern CPUs meant that CPU tuning wasn’t as much of a concern. With the licensing changes coming in SQL Server 2012, tuning SQL Server for CPU should be in front of every DBA and developer’s mind.

Old or Missing Statistics

Statistics help SQL Server determine the best way to access data. When we have accurate statistics, the database is able to examine the statistics and make good guesses about how much data will be returned by a given query. The statistics are then used to determine the most appropriate join or data access method (seek vs scan, for instance). The problem with statistics is that they’re only good for a short period of time. Statistics are recorded as a histogram that describes how data is distributed in a table or index. If the histogram is inaccurate, then SQL Server has no reliable way of knowing how much data is present.

Looking at the AdventureWorks sample database, we can see there’s an index on the SalesOrderHeader table IX_SalesOrderHeader_OrderDate. If we use DBCC SHOW_STATISTICS to look at the index, we can see that the most recent valued for RANGE_HI_KEY is ‘2008-07-31 00:00:00’. What happens when we query for data that’s outside of that range? SQL Server assumes that there’s only 1 row present no matter how many rows there are. There could be 0 rows, 5 rows, or 500,000 rows. SQL Server doesn’t have any statistics about the data, so it cannot make an accurate decision.

Old, or missing, statistics can be cause huge performance problems because SQL Server will have to make a bad guess (1 row) regardless of how much data is present. SQL Server may choose an inappropriate join type and perform a large number of additional reads; a nested loop join may be used when a merge or hash join would be more appropriate.

The solution is to make sure that statistics are kept up to date.

  1. Understand data change rates. If data is loaded as part of regular jobs, make sure that index maintenance is performed on a regular basis and that statistics are updated during index maintenance.
  2. Understand the type of data change. If data is written at the end of table, there’s a chance its data won’t be included in statistics (especially in the case of something like an order date). If a lot of data coming into the system is time sensitive, regular statistics updates will be necessary to keep things performing well.
  3. Make sure that AUTO_UPDATE_STATISTICS is enabled for all databases. Even if your jobs to update statistics fail to run, having AUTO_UPDATE_STATISTICS enabled will make sure that SQL Server updates your data as it changes. Statistics will not get updated as often as we’d like, but it is better than nothing.

Ordering Data

The ORDER BY clause exists with good reason, right? Users don’t just need to get data back from the database, they want to get it back in a specific order. While that’s true, ordering data has a definite cost associated with it, if the data isn’t already in the correct order. Take a look at this query:

SELECT SalesOrderID, OrderDate, DueDate, ShipDate, SalesOrderNumber, PurchaseOrderNumber FROM Sales.SalesOrderHeader WHERE OrderDate = '2007-02-01 00:00:00.000' ORDER BY SalesOrderNumber ; 

When we look at the actual execution plan, sorting the results take up 79% of the cost of the query. While this is a simple query, it’s easy to see how sorting can chew up a lot of CPU in SQL Server.

There’s no easy solution. However, many programming languages have the ability to use functional programming techniques to sort data in memory. The .NET framework makes it even easier by providing datatypes like the DataTable and DataSet. Both of these datatypes make it easy to sort data in memory in an application server. The idea here is that CPU is cheap on a Windows Server – we only pay to license the server, not each socket or CPU core. It is also much easier to horizontally scale the middle tier of an application than it is to horizontally scale SQL Server.

Functions

Functions are one of the biggest offenders when it comes to wasting SQL Server’s CPU time. The tricky part about functions is that they seem like they an easy way to encapsulate and re-use functionality. While functions accomplish this goal, there are other aspects of functions that make them a bad choice for many activities.

Simple, single statement table valued functions can be easily inlined by SQL Server. That is to say SQL Server will be able to include the execution plan of the function in the calling execution plan. The function’s execution plan can be optimized with the rest of the query. This can potentially lead to additional optimizations within the function. Multi-statement table valued functions and scalar functions won’t be optimized much of the time. Instead SQL Server will repeatedly execute the underlying function – once for every row in the result set. Over very large result sets this can result in a considerable amount of CPU utilization. IT’s often best to replace functions like this with inlined query logic. For a look at the behavior of functions in SQL server, take a gander at SQL Server Functions: The Basics.

Another downside to functions in SQL Server is that they frequently don’t get good statistics from the optimizer. When a table valued function returns data, SQL Server has no way to know (at compile time) how much data will be returned from the function. Instead it makes the assumption that there is only 1 row present in the result set from the TVF. This can result in poor join choice as well as bad data lookup decisions, just as if there were incorrect or missing statistics on an index or table.

Summing It Up

Keeping statistics up to date, moving ordering out of the database, and eliminating costly functions will go a long way to keeping your CPU utilization low. There are, of course, many other ways to combat high CPU, but these are some of the biggest culprits and some of the items that can be most easily fixed by database administrators and developers alike.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

Website - Twitter - Facebook - More Posts

What Good is a Pig?

Data matters. Every day we generate huge volumes of data. Processing all of this data presents challenges for many people.

Pig is a data flow language. It sits on top of Hadoop and makes it possible to create complex jobs to process large volumes of data quickly and efficiently. Best of all, it supports many relational features, making it easy to join, group, and aggregate data. If you think this sounds a lot like an ETL tool, you’d be right. Pig has many things in common with ETL tools, if those ETL tools ran on many server simultaneously.

Where would you use Pig?

Case 1 – Time Sensitive Data Loads

Loading data is a key part of many businesses. Data comes in from outside of the database in text, XML, CSV, or some other arbitrary file format. The data then has to be processed into a different formats and loaded into a database for later querying. Sometimes there are a lot of steps involved, sometimes the data has to be translated into an intermediate format, but most of the time it gets into the database, some failure it to be expected, right?

Loading large volumes of data can become a problem as the volume of data increases: the more data there is, the longer it takes to load. To get around this problem people routinely buy bigger, faster servers and use more fast disks. There comes a point when you can’t add more CPUs or RAM to a server and increasing the I/O capacity won’t help. Parallelizing ETL processes can be hard even on one machine, much less scaling ETL out across several machines.

Pig is built on top of Hadoop, so it’s able to scale across multiple processors and servers which makes it easy to processes massive data sets. Many ETL processes lend themselves to being decomposed into manageable chunks; Pig is no exception. Pig builds MapReduce jobs behind the scenes to spread load across many servers. By taking advantage of the simple building blocks of Hadoop, data professionals are able to build simple, easily understood scripts to process and analyze massive quantities of data in a massively parallel environment.

Parallel rockets make a single pig faster

An advantage of being able to scale out across many servers is that doubling throughput is often as easy as doubling the number of servers working on a problem. If one server can solve a problem in 12 hours, 24 servers should be able to solve it in 30 minutes.

Case 2 – Processing Many Data Sources

Knowing the effectiveness of an advertisement is big business. For people buying ad space, it’s critical to know that just how effective their advertising is in both the physical and virtual space. Combining advertising information from multiple sources and mixing it together with web server traffic, IP geo-location, and click through metrics it’s possible to gain a deeper understanding of customer behavior and judge just how effective certain ads are in certain parts of the country.

Pig isn’t just designed to scale out over many servers. Pig can be used to complex data flows and extend them with custom code. A job can be written to collect web server logs, use external programs to fetch geo-location data for the users’ IP addresses, and join the new set of geo-located web traffic to click maps stored as JSON, web analytic data in CSV format, and spreadsheets from the advertising department to build a rich view of user behavior overlaid with advertising effectiveness.

Creating this rich view of data is possible because Pig supplies complex features like joins, sorting, grouping, and aggregation. The syntax is different than developers are used to but Pig’s focus on data flow makes it easy to write complex jobs. Rather than creating complex logic in SQL, developers can create jobs that walk through data step by step to deliver the best results. It’s easy to rapidly prototype these procedural jobs and performance tuning can be accomplished with relative ease.

Case 3 – Analytic Insight Through Sampling

Even in case 2, we’ve seen how Pig can provide some analytical insight into the massive quantities of data that are generated every day in the datacenter. It’s easy to fall into the trap of thinking that Pig is an ETL glue that moves data from a log file, processes it, and drops it off for another database to consume. Pig is more than just an ETL tool.

One of Pig’s strengths is its ability to perform sampling of large data sets. As Pig manipulates data, it’s easy to reduce the set of data that we’re operating on using sampling. By sampling with a random distribution of data, we can reduce the amount of data that needs to be analyze and still deliver meaningful results.

Summing Up

Pig isn’t a replacement for SQL Server Integration Services. Their use cases overlap for many tasks, but they also solve very different problems. Using Pig for all ETL processes will be overkill when the data can reasonably be handled within a single SQL Server instance. On the flip side, there are problems that are too large to quickly solve within a single SSIS process or package. In either situation you should pick the best tool for the job.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

Website - Twitter - Facebook - More Posts

SQL Server in EC2

The cloud is robust and reliable. The cloud solves all of our scaling needs. The cloud makes my poop smell like roses. While all of these statements are theoretically true it takes some effort to make them true in reality, especially when a database is involved.

Who Is Deploying SQL Server in EC2?

A question I hear a lot is, “Who is putting SQL Server into EC2?” Sometimes there’s a hint of incredulity in that question, hinting that people can’t seriously be deploying SQL Server into EC2. That’s far from the truth: many companies, large and small, are deploying SQL Server into Amazon’s cloud, with varying degrees of success.

Back to the question of who’s deploying SQL Server in EC2…

I work with a start up building a platform on a Microsoft stack. The front end is hosted on mobile devices, the middle tier is written in .NET, and the database is SQL Server. They’re just starting out and they don’t have the time to spend speccing out servers, ordering them, and waiting for delivery. It’s easy to spin up servers to test new features and functionality and if a feature doesn’t catch on the servers are turned back off. Traditional hosting facilities didn’t offer the the flexibility or response times that they were looking for, so they chose Amazon EC2.

Another company I work with is a ISV transitioning to a Software as a Service (SaaS) model. The ISV works in an industry where there is a lot of specialized knowledge and technical ability, but that knowledge and ability doesn’t have anything to do with keeping a server up and running. In the past they’ve added additional maintenance routines to the software that they sell to customers, but sometimes customers don’t have anywhere to host this mission critical software. The ISV has hosted a few customers on their own servers, but they don’t want to keep buying new hardware in response to customer demands – what happens when a customer leaves? The ISV hasn’t made the move to cloud computing yet, they’re feeling out the different options available, but EC2 provides a level of flexibility that they won’t get through traditional IT solutions.

What are some of the problems?

There are, of course, problems with every way that you can possibly deploy SQL Server. When you deploy on premises, you have to purchase hardware, wait for it to arrive, set it up, and then you have to maintain it. When you host your database with another hosting provider, you’re limited by the amount of server space they have available. When you host in EC2 there are a different set of problems.

Cost

One of the perceived problems with deploying into the cloud is the cost. You’re renting something by the hour (like that “friend” your uncle brought to Thanksgiving last year); and renting something by the hour can get really expensive when you want that something to be up and running all the time.

How much would it cost to keep a reasonable SQL Server running 365 days a year? $10,862.40 assuming that you’re using a Double Extra Large Hi-Memory instance (that’s 4x 2.61GHz Xeons with 34.2 GB of memory). You can really kick out the jams and get yourself a Quadruple Extra Large Hi-Memory instance (8x 2.66 GHz Xeons and 68.4GB of memory) for twice the price. Yeah, that’s expensive. Of course, you can also reserve the instance and just buy it outright for a fraction of that cost, but who wants to do that?

What would a similar server cost you from HP? You can get an HP ProLiant DL380 G7 with a pair of fancy pants Intel E5640 CPUs (that’s right around the same speed as the EC2 instance), 32GB of RAM, and 1.2TB of in chassis storage for about $8,600. That price doesn’t include an OS or any other licensing costs. It also doesn’t include power, cooling, or even a spare server sitting around ready to pick up the load if your primary server fails.

Storage isn’t tremendously expensive in Amazon’s cloud – 10 cents per GB per month of provisioned storage. Over the course of a year, 1 terabyte of storage is only $102.40 a month, and that storage is fairly redundant within a single data center.

Despite the cost, Amazon’s cloud is still incredibly popular with many businesses. Why? Simple: it’s easy to start with a small amount of resources and rapidly scale in response to customer demands. If something doesn’t pan out, you just stop the servers and delete the storage and you don’t have to worry about it anymore. The costs go away as soon as you stop using the compute cycles.

Noisy Neighbor

I used to live in an apartment. It wasn’t a great apartment. In fact, it wasn’t even a nice apartment. I could hear everything that my neighbors did. After a while, I knew about their girlfriends, their love of Super Tecmo Bowl, and I learned that they liked to listen to loud music on week nights when they didn’t have to work the next day.

My noisy neighbors made it difficult for me to get things done. When you’re sharing a host, noisy neighbors can make it difficult to get things done, too. This sort of thing can happen on a server right now – when one program is aggressive writing to disk, read performance will suffer. If you’re sharing a physical server with another guest operating system, you have no idea what that other OS is doing. It could be sitting there doing nothing, it could be chugging along doing 3d rendering, or it could be a BitTorrent server. You have no idea what’s going on in the next room and how it could be affecting your performance.

Unfortunately, there’s not a lot that you can do about noisy neighbors apart from moving. In the real world, you move to a new apartment or buy a house. It takes time and money to get a new place, but it’s feasible. In EC2 it’s a lot easier to get a new place: you just upgrade your instance. You pay more by the hour, but it’s easy to get a bigger place.

Crashes in general

To be fair, you can’t do much about crashes if you’re using someone else’s hosting service; when the data center loses power your server is going to go down. So, how do you protect yourself from crashes? Redundancy, of course.

Unfortunately, redundancy gets expensive fast. If you want a redundant mirror for that 10k a year server, you need a second 10k a year server. If you want multiple servers spread across the country, you’re paying for each one. Thankfully the cost of the servers includes the Windows license, but it’s still a large cost for many people to stomach.

SQL Server has a number of options to help you keep your servers up and running in case something horrible happens. Traditional database mirroring will work in EC2. You are free to implement whatever Rube Goldberg machine you want. It’s important to keep in mind that you have little control over your storage (there are no SAN snapshots or SAN mirroring that you can control) and there is no dedicated network that you can rely on. All of your connectivity goes over the open infrastructure within Amazon’s data centers. You’re fighting for bandwidth with Amazon, Netflix, Heroku, and everyone else in the cloud.

Limits: Number of Drives

How many drives can you connect to an installation of Windows locally? 26 or so, right? That’s how many letters there are in the alphabet. You can go beyond that using mount points to boost the amount of storage you have. For most purposes, though, you effectively can attach an unlimited number of drives to a Windows machine. Certainly more than you’ll need. SANs make it easy to create huge volumes, after all.

In EC2, that’s not quite the case. You can only attach 16 drives (Amazon call them volumes) to a server. It’s important to keep this limitation in mind when designing your strategy to move into Amazon’s cloud. If you need to store more than 16 terabytes of data, you’ll need to find a way to spread that data out across multiple EC2 instances. Considering some of what we’ll cover later on, this is going to be a really good idea.

Limits: Network Throughput

Did you know that there’s a pretty solid limit on network throughput in EC2? I didn’t either until I start performing some storage benchmarks (more on this soon). It turns out that there’s a single active gigabit connection into each host. That’s right: a gigabit connection per host. You might be sharing that server with a lot of other servers. If there are 4 other SQL Servers sharing the same gigabit connection, you could end up with some very poor performance.

Amazon recently added instances with 10 gigabit network cards, but they’re limited to Linux machines in the US-East region right now. In the future there might be instances with 10 gig connections that run Windows, but there’s nothing in EC2 right now to make the path between a Windows server and the storage any faster. To keep up to date on which instances have 10 gigabit ethernet, visit the list of EC2 Instance Types and search for “I/O Performance: Very High”.

In short, if your applications are chatty, you’re going to need to learn to break them apart into many smaller pieces to keep throughput high.

Limits: Instance Sizes

We just briefly touched on this in the last section – not every size of instance is available in every region. Typically, the US-East region gets the newest fanciest servers, at least as far as I can tell. Other regions slowly get new hardware over time. The name of the instance type (m1.small, m2.4xlarge) will stay the same over time, but it may not be possible to bring up instances of the same size in every region.

Performance: Instances

Just like any other computer, there are a finite number of resources available on any host. Likewise, the VMs that you spin up have a finite amount of resources available to them. Unlike your local data center, you have no control over which VMs share the same host.

Normally you could put different machines together and stagger their functionality around the clock so that guests that did bulk processing at night would be on the same host as machines that record orders during business hours. When everything is under your control, it’s very easy to make sure that any host isn’t overcommitted during a certain time window.

The only way that you can guarantee performance is to allocate larger instances that take up all of the physical resources of the host but cost more. By doing this you’re able to eliminate the effect of noisy neighbors. It’s important to understand how to get the best performance out of your resources.

Performance: Storage

If you want to keep data around for any length of time, it’s important to persist it somewhere. Persisting your data means that you have to write it and, if you aren’t writing it to your users’ computers you have to write it to the database. Hiding beneath all of that fancy database software is something as simple as a pile of disks for storing data in rows and tables. Making sure the disks are working as fast as they can is critical to keeping a database running at peak performance.

A Note About How Gigabit Ethernet Sucks and Infiniband is Your Friend

Gigabit ethernet sucks. If disk I/O is your absolute bottleneck, you only have two options – you need to write to as many drives as possible. We’ll find out why that isn’t entirely possible in EC2. The other option is to make the connection between your database server and the storage as fast as possible. Since you can’t get super fast storage in EC2, you’ll have to find another way to make your database fast.

Just How Fast is Gigabit Ethernet?

The question isn’t rhetorical. It’s important to understand how fast gigabit ethernet is. Gigabit ethernet can really only transfer, at most, 120 megabytes per second. That’s the theoretical maximum assuming that everything along the line is operating perfectly and there is no latency or additional hops to make.

We’re not likely to see this in the real world. You’re not likely to see this in your own server room and you’re sure as hell not going to see it in EC2.

Why is your storage going to be slow? You have keep in mind that not everything involved may be able to push data at 120 MB/s; the drives could be slow, fragment, or you could be sharing space with someone else who is doing a lot of reading and writing. Other people could be reading from the same host as you or they could be streaming movies from the host to a customer in the suburbs.

In an attempt to find an answer, I conducted several tests on an EC2 instance over the course of several days to answer the question, “Just how fast is gigabit ethernet?”

Testing Methodology

For all of the tests, I used the same m2.4xlarge instance. I installed it from a standard Windows Server 2008R2 SP1 AMI that Amazon supply. Nothing was changed with the operating system. This Windows was about as vanilla as it gets.

For my tests with the drives, I used EBS volumes that were allocated in complete 1TB chunks. I made two rounds of tests. In both cases, the drives were formatted with a 64k sector size. The first round of tests was done with drives that were quick formatted. The second round of tests were performed with drives that were not quick formatted. FYI, formatting 10TB of drives in EC2 takes about 2 days. I used a Large Windows instance to format the drives at a cost of about $26 in CPU time plus another $57.20 in I/O cost just to format the drives.

The first set of tests I ran involved using CrystalDiskMark. I used the default configuration and ran it across the quick formatted drives. Drives were configured with 1 EBS volume, 2 EBS volumes, 3 EBS volumes, and 4 EBS volumes. I conducted a second set of CrystalDiskMark tests using fully formatted drives with 1, 2, 3, 4, and 10 EBS volumes in a single drive.

The second set of tests was more complex and involved using SQLIO to simulate sequential and random reads from SQL Server over a period of time. All tests were performed on a single EBS volume, two striped EBS volumes, four striped EBS volumes, and a pair of striped four EBS volumes.

EC2 Storage Test Results

CrystalDiskMark Results - Quick Format

Right away, we can see that performance quickly spikes for sequential reads and writes. While the results for read decrease, this isn’t a reflect of the storage itself so much as it is a reflection of trying to cram a lot of data through a tiny ethernet pipe. Overall, the quick formatted drives don’t show a lot of performance improvement no matter how many spindles we throw at the problem.

CrystalDiskMark Results - Full Format

There are some things that we can conclude about the performance of disks in Amazon’s platform. First and foremost: format your drives. Don’t just use a quick format; use the slow format. The downside is that it takes time to format disks and time is money, especially in the cloud.By using a full format, disk performance improved by 30% for 4K writes with a queue depth of 32 and improved, on the whole, by around 15%. There were a few anomalies – specifically the 4k write performance spike with 4 EBS volumes and the sequential read performance spike for 2 and 4 EBS volumes. The only conclusion I can come to is that reads were cached and did not hit disk.

SQLIO - IOPS

The SQLIO tests show a different story. Right away we can see that sequential I/O performance peaks with a maximum of two volumes, but random reads and writes keep scaling right up until the tests stop – the more spindles that were used, the more random I/O that we could perform. And here’s where things get interesting again. Looking at the graphs you can see that sequential I/O falls apart a little bit north of 1000 IOPS. That’s not a limit of the underlying disks, we’re just maxing out the ethernet connection. When there’s a finite pipe limiting how fast we can read, of course the number of IOPS is going to stop pretty quickly.

SQLIO - MB per second

Looking at the throughput in terms of the amount of data (rather than the number of reads and writes) that we can move through EC2, it’s very easy to see where we hit the ceiling. With a theoretical maximum of 120 MB/s, it’s easy to see that sequential read and write max out when two volumes are use. It’s clear, though, that random reads and writes continue to scale as we add more drives. I suspect that if I had been able to add more spindles, the random write performance would have continued to scale up to 120 MB/s.

What does this mean for SQL Server in EC2?

You can get the random I/O performance that you want, but sequential read/write performance is not going to be good. Taking that a step further, here’s what you don’t want to do in EC2:

  • Data warehousing
  • Large ETL
  • Bulk data processing that requires sequential reads
  • Anything else that requires a lot of sequential read and write

On the flip side, what should you use SQL Server in EC2 for?

  • OLTP
  • Off site processing that isn’t time sensitive
  • Distributed write loads (sharding)

SQL Server hosted in EC2 is going to be very effective for highly random loads and especially for loads where data can be kept in memory (currently less than 68.4 GB).

A Sidebar about Ephemeral Storage

Don’t use it. It goes away and may never come back. Well, it goes away when your instance reboots.

Because it’s the drives that live in your server chassis. You can’t predict where your server is going to be when you turn it back on again, so you can’t predict if that data is going to stay with you or not. This is a best practice for virtualization, if you think about it. When you’re setting things up using VMware or Hyper-V you don’t want to attach long-term storage that’s in the server chassis, it makes your virtualization more brittle. Using local storage, if any thing happens (a motherboard fails, a drive fails, a CPU fan goes out, or memory fails) you’ve lost that host. You can’t bring that VM up on another server because the OS is still on hard drives in that broken server.

Local storage is great for temporary (ephemeral) things. Make sure that you take that into account when you’re designing your infrastructure. Fast, cheap, temporary storage is great for fast, cheap, temporary operations like TempDB or local scratch disks.

First Touch Penalty

Amazon mention in a lot of their documentation that there’s a penalty, and potentially a large one, for the first write to a block of storage. When you look at the numbers below you’ll realize that they weren’t kidding. Amazon further go on to claim that if you benchmark the storage (as I did) you’ll be horrified by the results (as I was) but that you shouldn’t worry because the first write penalty is only really obvious during benchmarking and will be amortized away for time in the real world.

Here’s where I have to call bullshit. If your workload consists almost entirely of writing new data, you’re going to see a first touch penalty until the drive is full. It won’t end. Writes will suck until your drive is full. Then you’ll add more drives and the suck will continue. This is compounded for relational databases where the currently accepted best practice is to ensure sequential writes by clustering on an arbitrarily increasing integer.

The only way to guarantee acceptable performance is to format your drives and write to them block by agonizing block until the storage is full formatted.

Staying Up in the Clouds – HA/DR Options

Clustering: Sad Trombone of the Cloud

You can’t cluster in EC2. Don’t try. There’s no shared storage. EBS volumes can only be attached to a single server at a time. You could share the drives, but you really shouldn’t do that. You could use some Windows based SAN system that mounts drives and pretends it’s a SAN, but without any guarantees of performance or longevity, why would you want to?

We’ll just leave this one as: You can’t cluster in EC2.

Availability zones and regions

In case you didn’t know, Amazon’s magical cloud is divided up into regions. Each region is made up of multiple availability zones. They’re all right next door to each other, but they’re in separate buildings with separate internet access and separate power.

Availability zones exist to protect users from localized failure. Hopefully, if there’s a problem in a data center, it’s only going to be a problem in one data center (one availability zone). If that zone goes down, the others should be up and running. If you want to keep your server up and running through local failures, you need to make sure you keep a copy of your data in multiple availability zones. If you want to keep your business up and running in the event of a regional failure (the smoking crater scenario), you need to keep copies of your data in multiple regions.

Region – national level.

Availability zone – local level.

Got it? Good.

Mirroring SQL Server in EC2

One of the easier ways to keep things up and running in the event of a failure is to use SQL Server’s mirroring. Using synchronous mirroring with a witness is going to put you in a great place in terms of disaster recovery. It’s both possible and advisable to put the primary in one zone, the secondary in another, and the witness in a third zone. We can feasibly survive some pretty heavy duty hardship by configuring SQL Server mirroring in multiple availability zones. The applications talking to SQL Server need to be able to fail over between the mirrors but that should be pretty easy for today’s modern developers to handle, right guys?

SQL Server Replication in EC2

Replication is always an option and is probably a better option when you want some kind of geographic fail over. Instead of diving into contentious issues around designing a replication topology, I’m going to hit on the pros and cons of replication in the cloud.

We’ve already seen that there significant limits to network throughput in EC2. We’re limited to, at most, a gigabit of storage throughput. We’re also limited by our ability to write to disk in 64k chunks – only a little bit more than 1000 IOPS are possible at a 64k write size. These limitations become a problem for transactional replication. In an on premise set up, transaction log performance is frequently a problem when replication is involved. If you think that the performance of the transaction log won’t be an issue in the cloud, think again. The I/O characteristics of Elastic Block Storage guarantee that the performance of the transaction log will become a problem under heavy load.

Replication latency is another concern for DBAs in EC2. Not only do reads and writes occur slower, but that read/write slowness means that transactions will be committed to the subscriber further and further behind the publisher. Of course most applications aren’t writing constantly, but it’s important to consider what might happen if the subscriber isn’t able to commit the load that’s being written throughout the business day – will it be able to catch up after hours? If it can catch up today, will it be able to catch up tomorrow?

Crashes happen in EC2, entire regions have gone offline. Unfortunately, at any kind of scale something is bound to be broken right now. What do you do when that happens? What do you do when that happens to the distribution database? (You are running the distributor on a separate server, right?) What do you do when the distributor goes offline and never comes back up, drives corrupted irreparably?

I don’t have the answers, but that’s something your recovery model must take into account when you’re deploying into the cloud.

SQL Server Backups in the Cloud

You gotta keep that data safe, right? Taking backups and hoping for the best isn’t going to cut it once you’ve moved all of your data up into the cloud. You need a retention and recovery strategy.

EBS Data Protection

Amazon’s Elastic Block Storage (EBS) looks like some kind of SAN drives that you can attach to your server. When you look more closely at the disks themselves, they look like drives attached to some kind of storage device powered by Red Hat Enterprise Linux. I’m overusing the phrase “some kind” because you just don’t know what’s back there.

There are, however, some certainties around EBS. An EBS volume lives within a single availability zone. There are multiple drives involved and multiple block level copies of your data. There could be a failure of a back end drive and you’ll never know about it.

The downside of EBS is that there’s no global availability of your data. There is no SAN replication, no global single back up of your data. If you want that, you’ll need to look into using Amazon S3 to provide greater safety.

Be Safe with S3

Amazon Simple Storage Service is a very simple storage service indeed. It’s a block level data store that refers to big binary chunks of data by name. It’s like a filesystem that exists everywhere. Sort of. The important part of S3 is that your data can be persisted at a very paranoid level over a very long period of time. In their own words S3 is “Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year.”

Why mention S3 at all? It’s not very fast and it’s not very easy to mount S3 storage to Windows. There are two things you can do with S3:

  1. Store SQL Server backups.
  2. Snapshot EBS volumes.

Storing SQL Server backups in S3 should be a no brainer for anyone. S3 storage is cheaper than EBS storage and it’s far more durable. Since S3 is global, it’s also possible to push data up to S3 from one region and pull it down in another. Your servers will live on.

S3 can also be used to store snapshots of EBS volumes. Don’t be fooled by the word “snapshot”. The first time you take a snapshot you will likely make a very bad joke about how this isn’t a snapshot, it’s an oil painting. Don’t do that.

The first EBS snapshot to S3 is going to be slow because there is a block level copy of the entire EBS volume happening in the background. All of your data is zipping off into the internet to be safely stored in a bucket somewhere. The upside is that only the blocks of data that get changed are going to get snapshotted. If you perform frequent snapshots then very little data will need to be sent to S3, snapshots will be faster, and recovery should be a breeze. Keep in mind, of course, that these are not crash safe SAN snapshots, so they shouldn’t be used for live databases. I’m sure you can think of cunning ways to work with this.

A Plan for Safe & Secure SQL Backups in EC2

What’s the safe way to manage backups in EC2? Carefully.

Were it up to me, here’s what you’d do:

  1. Backup your data to an EBS volume.
  2. Mount an S3 drive to your Server.
  3. Copy the data to the S3 drive using xcopy with the /J flag

Of course, if it really were up to me, you’d have mirroring set up with a primary in one availability zone, a secondary in another, and the witness in a third. Oh, and you’d be using synchronous mirroring.

Scaling SQL Server in EC2: Scaling Up

This is, after all, why we’re here, right? You’re not just reading this article because you want to learn about database backup best practices and how many IOPS some magical storage device can perform. You want to know how to give your SQL Server some more zip in EC2. There’s a finite limit to how much you can scale up SQL Server in EC2. That limit is currently 8 cores and 68.4GB of RAM. Those 8 cores currently are Intel Xeon X5550s that clock in at 2.66 GHz, but that might change by the time you’re reading this.

In EC2, there’s a limitation on how much we can scale up. Commodity hardware really does mean commodity hardware in this case. If you can’t scale up, the only thing left is to scale out.

Scaling SQL Server in EC2: Scaling Out

It’s incredibly easy to create more instances of SQL Server in EC2. Within a few clicks, you’re most of the way there. You can automate the process further using Amazon’s developer tools to script out additional configuration, attach local storage, and attach EBS storage. The developer tools make it a breeze to create and deploy your own custom images so you have pre-configured instances up and running. You can even use tools like Puppet or Chef to completely automate adding more servers to your environment.

Summing It Up

So, there you have it – it’s possible to get great performance for certain types of SQL Server operations in EC2. If you’re looking for a cheap way to set up a data warehouse, you should look elsewhere. You might be able to slap something together in EC2 that performs well enough, but there will be a lot of man power involved in making that solution work. If you’re looking for a quick way to scale your application up and out without worrying about buying hardware and making sure it’s being used appropriately, then you’ve come to the right place.

Want to read it later? Download the PDF.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

Website - Twitter - Facebook - More Posts

Indexing for Deletes

If you’re only indexing to make reads faster, you need to think again. Indexes can make every operation in the database faster, even deletes.

The Problem: Deletes are Very Slow

I was lazily researching new development techniques one day when I received an email from a client asking why deletes could be slow. I rattled off a few quick possibilities and promised that I’d look into it as soon as I was able to. Due to the vagaries of travel, it took me more time than I’d expected to dig into the problem, but I found something interesting that I should have come to mind right from the start.

This database contains some hierarchical data. My initial thought was that there was a cascading delete taking place in the hierarchy. After some initial checks into cascading deletes, fragmentation, statistics, and triggers, I ran an actual delete on a test system and found something very interesting – almost all of the time spent deleting the row was spent performing a clustered index scan on a different table.

What Happens During a Delete?

When you try to delete a row, a few things happen. SQL Server says “OK, let’s make sure that we can actually delete this row, what else depends on it?” SQL Server will check for dependent rows by examining all foreign keys. It will then check any related tables for data. If there is an index, SQL Server will use that index to check for related data. If there isn’t an index, though, SQL Server will have to scan the table for data.

Deletes and Table Scans

Don’t believe me? Try this out yourself.

Make a new database. Copy data in from the Sales.SalesOrderHeader and Sales.SalesOrderDetail tables in AdventureWorks. I use the Import Data wizard to quickly copy data from one database to another.

ALTER TABLE Sales.SalesOrderHeader
ADD CONSTRAINT PK_SalesOrderHeader PRIMARY KEY (SalesOrderID);

ALTER TABLE Sales.SalesOrderDetail
ADD CONSTRAINT PK_SalesOrderDetail PRIMARY KEY (SalesOrderDetailID);

ALTER TABLE Sales.SalesOrderDetail
ADD CONSTRAINT FK_SalesOrderDetail_SalesOrderHeader
FOREIGN KEY (SalesOrderID) REFERENCES Sales.SalesOrderHeader(SalesOrderID)
ON DELETE CASCADE;

With these three statements in place, we’re able to create a situation where SQL Server has to perform a full table scan just to delete a single row. Make sure you’ve told SQL Server to include the actual execution plan and run this:

DELETE FROM Sales.SalesOrderHeader
WHERE SalesOrderID = 51721;

Once that query runs, the execution plan is going to look a bit like the execution plan below. If you add it up, 99% of the work comes from finding the rows to delete in the SalesOrderDetail table and then actually deleting them.

Who knew that deleting a row was so much work?

Making Deletes Faster

How would we go about making deletes like this faster? By adding an index, of course. Astute readers will have noticed the missing index information in that execution plan I took a screenshot of. In this case, the missing index looks something like this:

CREATE NONCLUSTERED INDEX IX_SalesOrderDetail_SalesOrderID
ON Sales.SalesOrderDetail(SalesOrderID);

Before adding the index, the query had a cost of 2.35678. After adding the index, the delete has a cost of 0.0373635. To put it another way: adding one index made the delete operation 63 times faster. When you have a busy environment, even tiny changes like this one can make it faster to find and delete records in the database.

What’s All of This Mean?

When you’re looking into database performance problems, remember that you aren’t always reading just to return data to the user, sometimes you need to find data in order to delete it. Even when we’re trying to get rid of data, it can be helpful to have an index to make deletes go faster.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

Website - Twitter - Facebook - More Posts

Nine Reasons Developers Should Learn SQL

Let’s face it, there are a lot of cool new things you could be learning right now. It seems like there’s a new technology coming out every 12.8 seconds. Why the hell would you want to spend your free time learning a crufty old language like SQL? My reasons, let me show you them.

It’s portable

Computer science students are taught hundreds of techniques and theoretical constructs while they earn their degree. A lot of that information isn’t directly applicable in day to day programming tasks, but it introduces students to fundamental constructs that they can move between platforms – they have a common vocabulary and tool kit that they can take with them wherever they go.

Although every database vendor implements their own extensions, with every new version the vendors are moving their databases to be in line with the ANSI/ISO SQL standard. Standards compliance, while tricky, makes it possible to take your knowledge from platform to platform. If you learn SQL, you’ll be prepared to move from one database to another.

It never changes

My good friend and business partner jokes that he became a DBA because SQL hasn’t changed in 35 years. This is, largely, true. Vendors have implemented their own extensions that eventually make it into the SQL standard, but the core of SQL doesn’t change. Once you understand the basics of SQL and relational theory (it’s not that hard), you’ll find that you can continue to build on that knowledge and add features and functionality that you were relying on other tools or developers to implement.

It’s an easy place for performance gains

There are only a few places to implement performance gains in an application – the presentation layer, the application layer, and the storage layer. Let’s face it, your code is already well written and well tuned; getting any performance gains there is going to be like getting blood from a stone. The database, on the other hand, is an easy place to make a few simple changes (add an index, change a query slightly) and see tremendous performance improvements. Having spent a considerable portion of my career as an application developer staring at a profiler, I can attest to this. It’s possible to pry performance improvements out of application code, but modern frameworks and toolkits are typically so well-written that the database is usually a better place (read as easier place) to find low hanging fruit for performance improvements.

It’ll make you a better developer

The Pragmatic Programmer challenges developers to learn a new language a year; not because the landscape is constantly changing but because learning new languages exposes developers to new paradigms. There are different ways of thinking about problems that can lend themselves well to different solutions (I’ve recently learned a lot from diving into functional programming). Learning SQL will teach you to think in sets rather than iteratively. In the long term, this will change the way you think about working with data and lead to improvements in your database code.

Improve communication across teams

Have you ever tried to talk to someone who spoke your language but spoke a wildly different dialect? Communicating across a language barrier can be difficult outside of work, but it can be outright maddening when the success of a project depends on it. Learning SQL will give you a leg up when you’re communicating your goals to the DBA team. They won’t have to decipher your meaning and you can tell them exactly what you need. In the end there will be less miscommunication, things will get done faster, and you’ll no longer be “that frustrating developer.”

Job security

I hate to say it, but learning SQL may mean that you get to keep your job when Ted in the next cube gets canned during layoff season. The more skills you have and the more job functions you can perform, the more valuable you become to your current employer (and to a future employer). If you can work with the database team and the development team, you’re now a valuable asset that both teams depend on for success.

It’s really not that hard

Contrary to popular belief, SQL is not a difficult language to learn. It’s a different way of thinking, that’s for sure, but it’s not difficult. There are only a handful of commands, operators, and data types documented in the ANSI/ISO standard. While vendors may add their own features, there’s a compact core of knowledge that you can learn and apply everywhere you go.

Know when it’s not appropriate to do something in the database

The database is a phenomenal tool for solving many problems, but it’s also a horrible tool for solving even more problems. To put it another way: you probably shouldn’t be digging a trench with a hammer. By learning and understanding SQL, you’ll be able to make better decisions and move poorly performing code out of the database. In fact, you’ll be able to spot these problems before they’re even problems.

Once you understand SQL, you’ll have a much better grasp on the limitations of an RDBMS. You’ll know which portions of an application can safely live in a database and which will need to be moved further up the stack to a different layer. Some data validations should live with the data, some shouldn’t. Understanding how SQL works will help you determine which rules should remain in the database.

Simplify troubleshooting

Live applications are notoriously difficult to troubleshoot. The more complexity and layers that are involved, the more difficult it is to troubleshoot an application. A good understanding of SQL makes it possible to rapidly isolate problems that do exist in the database. To put it a different way: understanding SQL makes it easy to locate the problem in one of many different layers of your application.

Want to Learn More?

Where do you go if you want to learn more about SQL?

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

Website - Twitter - Facebook - More Posts

Better Living Through Caching

The fastest query is one you never execute.

The premise is that one of the slowest parts of starting up an application isn’t starting the application itself, it’s loading the initial application state. This can become a problem when you’re loading many copies of your application on many servers, especially you’re in the cloud and paying for CPU cycles. In that article, a commenter proposes reading application start up state from a serialized blob; basically a chunk of memory written to disk. The trick is that the serialized blob is stored in cache rather than on disk or in a database. Sometimes you need to hit disk in order to refresh the cache, but the general idea is that all configuration info is stored in a single binary object that can be quickly read and used to start up an application to a known good state.

Caching for More Than Start Up Times

Once you start caching application start state, it’s natural to look for more places to introduce additional caching. Remember, the fastest query is the one that you never execute.

Most people already know that they can add caching to their application to improve performance and get around slower parts of the system. There are a number of well understood design patterns that focus around caching and its place in software architecture. A lot of people don’t take this one step further and use caching as a trick to avoid down time when they roll out updates.

You might be thinking “Wait a minute, doesn’t my database/SAN/operating system have some kind of cache?” You’re right, it does. Storage cache is your last line of defense before reading from disk. Why not cache things in your application and skip the network hit?

So what happens when you need to update the application? In the past you probably scheduled an outage in the middle of the night. Or maybe you performed rolling outages from server to server and then slowly brought features online across groups of servers. However you did it, it’s complicated, requires down time, and you need to have a rollback plan; rollbacks on large databases can take a lot of time.

What if instead of just caching configuration to avoid slow start up, you start caching all data (or as much as can fit into memory)? You’re doing that already, right? Why mention it again?

If you’re caching data already, it seems logical that your application is written with multiple tiers. Those tiers are probably divided out by application or by service. If so, there’s a lot of logical separation between different features and functionality. You might even be calling a read/write API as if it were a service provided by a third party. This is a perfect example of how you can cache your reads and avoid hitting lower layers of the application; the front end never needs to know that anything exists apart from the services that provide data.

If you can cache data at the service level, you can theoretically take your back end systems offline for maintenance and bring them back online with minimal disruption to your users. Ideally, there would be no disruption. You could queue up modifications during your maintenance window and then commit them to the database once the updated database, services, or features are back online.

The Beauty of Isolation

By isolating features and layers from each other, you can make your applications more responsive. Rather than relying on servers to respond quickly during application start times, you can make it possible to load binary configuration data from cache. Frequently run queries can be served even faster by caching results in memory. Down times can even be avoided by caching reads and writes during the maintenance window. Of course, caching writes can be difficult. You can start by caching reads and keep your application up most of your users; it’s better than shutting everyone out completely.


To learn more about caching on Windows, read up on AppFabric Cache. On the *nix side of things, there’s the tried and true memcache. More novel and exotic solutions exist, but AppFabric Cache and memcache are great places to get started.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

Website - Twitter - Facebook - More Posts