<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Brent Ozar Unlimited &#187; Jeremiah Peschka</title>
	<atom:link href="http://www.brentozar.com/archive/author/jeremiah-peschka/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.brentozar.com</link>
	<description></description>
	<lastBuildDate>Wed, 19 Jun 2013 15:28:11 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Database Benchmarks &#8211; Big Data Edition</title>
		<link>http://www.brentozar.com/archive/2013/06/database-benchmarks-big-data-edition/</link>
		<comments>http://www.brentozar.com/archive/2013/06/database-benchmarks-big-data-edition/#comments</comments>
		<pubDate>Mon, 10 Jun 2013 13:00:16 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[benchmarking]]></category>
		<category><![CDATA[bigdata]]></category>
		<category><![CDATA[cloud]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=19361</guid>
		<description><![CDATA[Until recently, database benchmarks have been performed by vendors in carefully controlled labs or by engineers at companies reporting on application specific workloads. TPC benchmarks, like TPC-H, provide metrics about the number of queries per hour and a cost per &#8230; <a href="http://www.brentozar.com/archive/2013/06/database-benchmarks-big-data-edition/">Continue reading <span class="meta-nav">&#8594;</span></a><p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></description>
				<content:encoded><![CDATA[<p>Until recently, database benchmarks have been performed by vendors in carefully controlled labs or by engineers at companies reporting on application specific workloads. TPC benchmarks, like TPC-H, provide metrics about the number of queries per hour and a cost per query per hour. While these results can give us a guess about the total cost to implement a system, they have no bearing on the cost to operate a system. How much will it cost to maintain a high throughput system?</p>
<p><a href="http://amplab.cs.berkeley.edu/">UC Berkeley&#8217;s AMPLab</a> has provided a benchmark that makes it easier to compare both performance and cost of different database solutions. The <a href="http://amplab.cs.berkeley.edu/benchmark/">AMPLab Big Data Benchmark</a> provides a benchmark for several large scale analytic frameworks. Most importantly &#8211; it&#8217;s possible for anyone to reproduce these benchmarks using the tools provided by AMPLab.</p>
<h3>Performance Analysis</h3>
<p><div id="attachment_19363" class="wp-caption alignright" style="width: 190px;  border: 1px solid #dddddd; background-color: #f3f3f3; padding-top: 4px; margin: 10px; text-align:center; float: right;"><a href="http://www.flickr.com/photos/alq666/84092165/"><img src="http://cdn.prod.brentozar.com/wp-content/uploads/2013/06/84092165_57db4877dd_m.jpg" alt="Just how fast is a Shark?" width="180" height="240" class="size-full wp-image-19363" /></a><p style=' padding: 0 4px 5px; margin: 0;'  class="wp-caption-text">Just how fast is a Shark?</p></div>First and foremost, the AMPLab benchmark provides a performance analysis of four products &#8211; Amazon Redshift, Hive, Shark, and Impala. Several query types are used to provide a general view of analytic framework performance. Not all frameworks are implemented in the same way, and providing a broad set of queries makes it possible for users to evaluate how a workload might perform in production.</p>
<p>Several exploratory queries, an aggregate, an aggregate with joins, and custom UDFs are tested at several sizes and with several variations. If these aren&#8217;t representative of a given workload, it&#8217;s possible to extend the <a href="https://github.com/amplab/benchmark">benchmark framework</a> to include representative queries on the sample data set. Ambitious teams could even go so far as to point the benchmark their own data to discover which product provides the most benefit.</p>
<p>A great deal of flexibility is available to let teams benchmark potential solutions in Amazon Web Services &#8211; different servers and data sets can be repeatably tested and evaluated before settling on a platform.</p>
<h3>Price Analysis</h3>
<p>Typically, benchmarks are based on hardware, different database engines are compared on the same hardware. Other benchmarks are based on performance: how much performance (based on an arbitrary metric) can be eked out of any set of hardware.</p>
<p>Neither approach addresses the real concern of many businesses: cost.</p>
<p>Interestingly, the AMPLab benchmark is not based on hardware configuration. Instead of fixing on specific hardware types, the AMPLab benchmark is based on a cost metric. All systems were created in Amazon Web Service making it easy to compare cost based on published instance costs. In the case of the initial AMPLab benchmark, the systems cost $8.20 per hour (the Amazon Redshift system cost $8.50 per hour).</p>
<p>This is important for the simple reason that we now have a far more important way to compare the performance of different databases. For $8.20 &#8211; $8.50 an hour, on the workloads tested, I can make an easy decision about how I should perform my data analysis.</p>
<p>Taking the AMPLab benchmark a step further, we can customize the benchmark and how our workloads will perform at different cost levels. If you&#8217;ve wondered whether you should use one Hadoop variant or another, SQL Server, or Amazon Redshift for cloud analytics, you can easily find out. For teams already using cloud based analytics frameworks, it&#8217;s easy to use these benchmarks to determine how workloads would fare on different systems or with different instance sizes.</p>
<h3>The Verdict</h3>
<p>The AMPLab benchmark produces results that most people in the RDBMS world would be happy about &#8211; Amazon Redshift comes out ahead of the competition. Equally unsurprising, the results are fastest when the entire result set can be coerced into memory. What&#8217;s surprising, though, is how well newcomers Shark and Impala perform when stacked up against an MPP database like Redshift. Sure, Redshift is about twice as fast as Shark, but Shark is a new product (the <a href="https://github.com/amplab/shark/commit/cdb7c241dcd8924322ef2c515f8bdf8d7771bb97">first source code commits</a> occurred on April 23, 2011) and I&#8217;m sure we can expect big improvements in the future. This is important, though, because it shows that tools like Shark and Impala complete in the same realm as MPP databases like Redshift, Teradata, and PDW.</p>
<h3>What&#8217;s It All Mean?</h3>
<p>Using the AMPLab benchmark we have an easy tool that lets us compare analytic database performance in a hosted environment. We can perform multiple tests to understand how our workload will perform within different database products and hardware environments. Continued improvements to both the underlying database platforms and the test framework itself should lead to interesting discussions, prototypes, and technology decisions.</p>
<p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2013/06/database-benchmarks-big-data-edition/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mix and Match Databases: Dealing with Data Types</title>
		<link>http://www.brentozar.com/archive/2013/05/mix-match-databases-dealing-data-types/</link>
		<comments>http://www.brentozar.com/archive/2013/05/mix-match-databases-dealing-data-types/#comments</comments>
		<pubDate>Tue, 28 May 2013 12:15:57 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=10729</guid>
		<description><![CDATA[Moving between databases is hard enough, try using multiple databases in the same application and you might start thinking you&#8217;ve gone insane. Different application demands for accessibility, redundancy, backwards compatibility, or interoperability make this a possibility in the modern data &#8230; <a href="http://www.brentozar.com/archive/2013/05/mix-match-databases-dealing-data-types/">Continue reading <span class="meta-nav">&#8594;</span></a><p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></description>
				<content:encoded><![CDATA[<p>Moving between databases is hard enough, try using multiple databases in the same application and you might start thinking you&#8217;ve gone insane. Different application demands for accessibility, redundancy, backwards compatibility, or interoperability make this a possibility in the modern data center. One of the biggest challenges of running a heterogeneous database environment is dealing with a world of data type differences. There are two main ways to work through this situation:</p>
<ol>
<li>Using a subset of data types.</li>
<li>Creating custom data type mappings.</li>
</ol>
<p>To make comparisons easier, I&#8217;m going to focus on SQL Server, PostgreSQL, and Azure Table Services. </p>
<h3>Using a Subset of Data Types</h3>
<p>The ANSI standard defines a number of data types that should be supported by database vendors but, as with all standards, there&#8217;s no guarantee that vendors will support all data types or even support them equally. The SQL Standard defines the following data types: <code>bigint</code>, <code>bit</code>, <code>bit varying</code>, <code>boolean</code>, <code>char</code>, <code>character varying</code>, <code>character</code>, <code>varchar</code>, <code>date</code>, <code>double precision</code>, <code>integer</code>, <code>interval</code>, <code>numeric</code>, <code>decimal</code>, <code>real</code>, <code>smallint</code>, <code>time</code> (with or without time zone), <code>timestamp</code> (with or without time zone), <code>xml</code> <a href="http://www.postgresql.org/docs/current/interactive/datatype.html">(1)</a>.</p>
<blockquote>
<p>As an example of differences between the ANSI standard and vendor implementations, the ANSI standard defines a <code>TIMESTAMP</code> data type that is implemented as a date and time with an optional time zone whereas SQL Server defined <code>TIMESTAMP</code> as an arbitrary auto-incrementing unique binary number.</p>
</blockquote>
<p>Taking a look around it&#8217;s easy to see that there are major differences between databases. An easy way to resolve this problem is to use only a small subset of the available data types. This choice seems attractive when we&#8217;re working with a language that doesn&#8217;t support rich data types. Some languages only have support for a limited number of data types (C provides characters, numeric data types, arrays, and custom <code>struct</code>s), while more advanced languages provide rich type systems. </p>
<p>Comparing our database solutions, <a href="http://msdn.microsoft.com/en-us/library/dd179338.aspx">Azure Table Services Data Model</a> supports a constrained set of data types. While rich type systems are valuable, the Table Services data model provides everything needed to build complex data structures. The simple data model also makes it easy to expose Azure Table Services data as ATOM feeds that can be consumed by other applications. By opting for simplicity, this simple data model makes it possible to communicate with a variety of technologies, regardless of platform.</p>
<p>The downside of restricting an application to a limited set of data types is that it may become very difficult to store certain data in the database without resorting to writing custom serialization mechanisms. Custom serialization mechanisms make it impossible for users to reliably report on our data without intimate knowledge of how the data has been stored. </p>
<p>Compare the supported <a href="http://msdn.microsoft.com/en-us/library/dd179338.aspx">Azure Table Services data types</a> with <a href="http://msdn.microsoft.com/en-us/library/ms187752.aspx">SQL Server 2008 R2&#8242;s data types</a> and <a href="http://www.postgresql.org/docs/current/interactive/datatype.html">PostgreSQL&#8217;s data types</a>. There&#8217;s some overlap, but not a lot. Limiting your application to a subset of datatypes is really nothing more than limiting your application to a subset of data that it can accurately store, model, and maniuplate. Everything else </p>
<h3>Custom Data Type Mappings</h3>
<p>Let&#8217;s assume we have an application that is built using PostgreSQL as the primary OLTP back end. We can expose a lot of our functionality through our cloud services as simple integers and strings, but there are some things that aren&#8217;t assured to work well when we move across different OLTP platforms. We can&#8217;t always map data types &#8211; how does <a href="http://www.postgresql.org/docs/current/interactive/datatype-net-types.html#DATATYPE-INET"><code>inet</code></a> map to SQL Server or Azure Table Services? There&#8217;s no immediately apparent way to map the <code>inet</code> data type to any other data type. </p>
<p>Clearly, custom data type mappings are not for the faint of heart. Decisions have to be made about gracefully degrading data types between databases so they can be safely reported on and reconstituted in the future. Depending on the application, <code>inet</code> could be stored as <code>Edm.String</code> in Azure Table Services or <code>VARCHAR(16)</code> (which only works if we&#8217;re ignoring IPv6 addresses and the netmask). </p>
<p>If this sounds like a recipe for confusion and disaster, you might be on to something. Using custom data type mappings across different databases can create confusion and requires custom documentation, but there is hope.</p>
<p>Applications using the database only need to know about the data types that are in the database. Reporting databases can be designed to work with business users&#8217; reporting tools. As long as the data type mappings do not change, it&#8217;s easy enough to keep the reporting databases up to date through automated data movement scripts.  </p>
<h3>What Can You Do?</h3>
<p>There&#8217;s a lot to keep in mind when you&#8217;re planning to deploy an application across multiple databases. Understanding how different databases handle different data types can ease the pain querying data in multiple databases. There&#8217;s no reason to limit your application to one database, just be aware that there are differences between platforms that need to be taken into account.</p>
<h3>Further Reading</h3>
<p>Google have created their own cross application/platform data serialization layer called <a href="http://code.google.com/apis/protocolbuffers/docs/overview.html">protocol buffers</a>. If you&#8217;re looking at rolling your own translation layer, protocol buffers may fit your needs.</p>
<p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2013/05/mix-match-databases-dealing-data-types/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Database Quick Fire Challenge</title>
		<link>http://www.brentozar.com/archive/2013/05/database-quick-fire-challenge/</link>
		<comments>http://www.brentozar.com/archive/2013/05/database-quick-fire-challenge/#comments</comments>
		<pubDate>Tue, 21 May 2013 13:00:35 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[SQL Server 2012 News]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=18986</guid>
		<description><![CDATA[Celebrity cooking shows are popular around Brent Ozar Unlimited. We watch Top Chef for the creative cooking as much as the human drama. Contestants on Top Chef face huge challenges &#8211; they&#8217;re working alone, have a limited set of tools, &#8230; <a href="http://www.brentozar.com/archive/2013/05/database-quick-fire-challenge/">Continue reading <span class="meta-nav">&#8594;</span></a><p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></description>
				<content:encoded><![CDATA[<p>Celebrity cooking shows are popular around Brent Ozar Unlimited. We watch <em>Top Chef</em> for the creative cooking as much as the human drama. Contestants on <em>Top Chef</em> face huge challenges &#8211; they&#8217;re working alone, have a limited set of tools, have a fixed set of ingredients, and operate under ridiculously strict time guidelines. What makes the creativity of chefs even more interesting is the limitations they&#8217;re under: contestants have to work with strange, disgusting or difficult ingredients in order to win the favor of the judges.</p>
<p>Building successful software isn&#8217;t much different. Sure, there&#8217;s a team of collaborators to help us make decisions; but everyone on that team is frequently responsible for one area of the application. Success or failure depends on your ability to work with the requirements you&#8217;re given and please the final judges &#8211; the end users.</p>
<p><div id="attachment_18994" class="wp-caption alignright" style="width: 250px;  border: 1px solid #dddddd; background-color: #f3f3f3; padding-top: 4px; margin: 10px; text-align:center; float: right;"><a href="http://www.flickr.com/photos/portofsandiego/8248733614/"><img src="http://cdn.prod.brentozar.com/wp-content/uploads/2020/01/8248733614_1ea4f96343_m.jpg" alt="Know your requirements" width="240" height="161" class="size-full wp-image-18994" /></a><p style=' padding: 0 4px 5px; margin: 0;'  class="wp-caption-text">Know your requirements</p></div><br />
<h3>The Rules of the Contest</h3>
<p><em>Top Chef</em> contestants work within the rules of the contest &#8211; they have a limited amount of time with a limited number of ingredients to make an appealing meal. The core ingredients and the time period are non-negotiable. The show&#8217;s producers call these the rules, in the world of software these are business requirements, and in the world of programming we might call these application invariants. But no matter what you call them, these things can&#8217;t change. Requirements might be something like</p>
<ul>
<li>A picture can&#8217;t be viewed until four thumbnails have been generated.</li>
<li>Property listings cannot be viewed until approved by the listing agent.</li>
</ul>
<p>Contestants on <em>Top Chef</em> don&#8217;t immediately start cooking &#8211; although careful editing makes it look that way. Development teams, even agile teams, shouldn&#8217;t get immediately start coding once they have their hands on requirements. It&#8217;s important to look carefully at the requirements and make sure it&#8217;s possible to deliver something that the business (the judges) are happy to see. Winning chefs don&#8217;t immediately reach for a bottle of Frank&#8217;s Red Hot to give a dish a bit of pizzazz, they consider all of the options and match their condiments to the meal, so why do we always reach for the same tools?</p>
<p>It&#8217;s easy to be lulled by the familiar &#8211; hot dogs and burgers are easy, but they don&#8217;t win <em>Top Chef</em>. While you don&#8217;t need award winning code to win the game of delivering software, you do need to make the right choices to make life easier.</p>
<p>Working with <a href="http://www.bravotv.com/foodies/recipes/poached-black-chicken-mousse-amp-roulade-monkfish-liver-torchon-buttered-leeks">black chicken and monkfish liver</a> may not be the easiest thing, but contestants on <em>Top Chef</em> are routinely able to turn <a href="http://en.wikipedia.org/wiki/Top_Chef_Masters_(season_2)#Episode_6:_Scary_Surf_and_Turf">strange ingredients</a> into masterpieces. Business requirements spell out how our applications have to behave at the end of the day, but you&#8217;ll notice that it doesn&#8217;t matter how you get there. Just make sure you get there &#8211; solve the business problem and move on.</p>
<h3>Start with the Ingredients</h3>
<p>It isn&#8217;t uncommon for <em>Top Chef</em> contestants to scrap their first ideas after a few minutes of work. Likewise, don&#8217;t be afraid to throw away your first idea. If you&#8217;re a pack rat, write your idea on a piece of paper and hide it from yourself. It&#8217;s okay to come back to your first idea, but it&#8217;s important to think about the problem in a different way.</p>
<p>What are the core ingredients of your application? Just because you have chicken, that doesn&#8217;t mean you should make chicken cordon bleu. Ask yourself, &#8220;What am I supposed to create?&#8221; A few applications I&#8217;ve come across in the last year are:</p>
<ul>
<li>Single sign-on systems</li>
<li>Hosted property listings</li>
<li>Utility easement tracking</li>
<li>Document tracking and signing</li>
</ul>
<p>Each of these applications has a different set of features and functionality. Would you use the same solution for each one? Looking at it a different way &#8211; would you serve the same meal for the Superbowl as you would for Christmas dinner?</p>
<p>Make an itemized list of the ingredients that you have on hand. Your requirements are your ingredients. They drive the way the users will interact with the data. As you investigate the requirements, make sure you ask the users questions like, &#8220;Do you need point in time recovery for easement property maps?&#8221; or &#8220;Is it a requirement that a user have a first name, last name, bio, and profile picture or would a user name and password be acceptable?&#8221; Understanding your requirements drives your choices.</p>
<h3>It&#8217;s All About the Ingredients</h3>
<p>Under all of your application code, you need somewhere to store your data. One of the <em>Top Chef</em> judges frequently asks &#8220;Where&#8217;s the protein?&#8221; when served a salad. As you work through application requirements, use these to ask yourself &#8220;Where&#8217;s the data?&#8221;</p>
<p><em>Top Chef</em> contestants typically aren&#8217;t told that they need to make sweet glazed salmon, they&#8217;re told to use a set of ingredients and produce a fine meal. It&#8217;s up to the chef to determine whether to use rémoulade or tartar sauce and it&#8217;s up to you to make technical decisions. The business user isn&#8217;t going to know the answers to your technical questions, but they do know that a user only needs a user name and password to use the application.</p>
<p>Use the business requirements to help make your database design decisions &#8211; if an image doesn&#8217;t need to be transactionally consistent with all other data, you don&#8217;t need to store it in your relational database. The rules of the contest &#8211; the business requirements &#8211; should shape how you design your application. They give you both the restrictions and freedom you need to be creative.</p>
<h3>What Will They Eat?</h3>
<p>Food falls into distinct cuisines. If I gave you a choice between sushi or tapas, you&#8217;d be able to make  an informed choice because you know the ingredients and style used for each style of cooking.</p>
<p><div id="attachment_18997" class="wp-caption alignleft" style="width: 250px;  border: 1px solid #dddddd; background-color: #f3f3f3; padding-top: 4px; margin: 10px; text-align:center; float: left;"><a href="http://www.flickr.com/photos/carbonnyc/6144729060/"><img src="http://cdn.prod.brentozar.com/wp-content/uploads/2020/01/6144729060_058ffcf4b9_m.jpg" alt="Picky customers dictate features. Make them happy." width="240" height="159" class="size-full wp-image-18997" /></a><p style=' padding: 0 4px 5px; margin: 0;'  class="wp-caption-text">Picky customers dictate features. Make them happy.</p></div>
<p>As you evaluate the business requirements, dig deeper and imagine the types of answers that users might look for in the data. Will users look for property along the path of a tornado where repairs need to be made? Are users searching for houses with specific features &#8211; e.g. find single family homes with 2 or more bathrooms and an attached garage? Or are users&#8217; questions difficult to predict and completely free form?</p>
<p>Understanding how people will use the data guides the choices we make. If users will be performing free form text searches, a full text search engine like SOLR should be considered. If an application is pure OLTP, it&#8217;s possible that you can use a key-value database. Understanding application requirements means that you can decide whether you need to use SQL Server or you can investigate other options.</p>
<p>Some of the database cuisines to consider are:</p>
<ul>
<li>Relational database (SQL Server, PostgreSQL)</li>
<li>Document database (CouchDB, MongoDB)</li>
<li>Text Search (Lucene/SOLR, Elastic Search)</li>
<li>Key-Value database (Riak, Cassandra)</li>
</ul>
<p>The processing of picking a database can lead to conflict. Developers have their favorite new technologies they want to try and entrenched products are frequently favored above all others. Understanding how one database meets application requirements is important &#8211; if you don&#8217;t know which ingredients you have, you don&#8217;t know what to make; if you don&#8217;t understand the application invariants involved, you can&#8217;t know which option is the best.</p>
<p>Ultimately, making sure you pick the right tool for the job can lead to faster development, easier support, and better throughput.</p>
<p><div id="attachment_18996" class="wp-caption alignright" style="width: 250px;  border: 1px solid #dddddd; background-color: #f3f3f3; padding-top: 4px; margin: 10px; text-align:center; float: right;"><a href="http://www.flickr.com/photos/avlxyz/2970777195/"><img src="http://cdn.prod.brentozar.com/wp-content/uploads/2020/01/2970777195_01b6d5f4bc_m.jpg" alt="Presentation is everything" width="240" height="180" class="size-full wp-image-18996" /></a><p style=' padding: 0 4px 5px; margin: 0;'  class="wp-caption-text">Presentation is everything</p></div><br />
<h3>What Do the Judges Think?</h3>
<p>The most important thing, though, is what the judges think. It doesn&#8217;t matter if you&#8217;ve made the greatest chicken salad sandwich ever, if your work doesn&#8217;t meld with the judges&#8217; expectations you won&#8217;t be taking home the prize. Understanding how the requirements influence the ways that users will work with data is critical if you want to be successful. Once you know how people will work with the tools, you&#8217;ll be able to make the right decisions for your application.</p>
<p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2013/05/database-quick-fire-challenge/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Monitoring SSD Performance</title>
		<link>http://www.brentozar.com/archive/2013/05/monitoring-ssd-performance/</link>
		<comments>http://www.brentozar.com/archive/2013/05/monitoring-ssd-performance/#comments</comments>
		<pubDate>Thu, 16 May 2013 13:00:25 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[SQL Server 2012 News]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=18906</guid>
		<description><![CDATA[Everyone wants to make sure they&#8217;re getting the best performance out of their solid state storage. If you&#8217;re like a lot of people, you want to make sure you&#8217;re getting what you paid for, but how do you know for &#8230; <a href="http://www.brentozar.com/archive/2013/05/monitoring-ssd-performance/">Continue reading <span class="meta-nav">&#8594;</span></a><p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></description>
				<content:encoded><![CDATA[<p>Everyone wants to make sure they&#8217;re getting the best performance out of their solid state storage. If you&#8217;re like a lot of people, you want to make sure you&#8217;re getting what you paid for, but how do you know for sure that the drive is performing well?</p>
<h3>Watch that Average</h3>
<p>The first way to monitor performance it to use some <a href="http://brentozar.com/go/perfmon">perfmon counters</a>. Although there are a lot of perfmon counters that seem helpful, we&#8217;re only going to look at two:</p>
<ul>
<li>PhysicalDisk\Avg. Disk Sec/Read</li>
<li>PhysicalDisk\Avg. Disk Sec/Write</li>
</ul>
<p>As soon as you get a solid state drive in your server, start monitoring these numbers. Over time you&#8217;ll be able to trend performance over time and watch for poor performance. When the SSDs pass out of your valid performance guidelines (and they probably will), you can pull them out of the storage one at a time and reformat them before adding them back into the RAID array. <em>Note</em> it isn&#8217;t necessary to do this</p>
<p>Although it&#8217;s risky, this approach can work well for detecting performance problems while they&#8217;re happening. The downside is that we don&#8217;t have any idea that the drives are about to fail &#8211; we can only observe the side effects of writing to the SSDs. As SSD health gets worse, this average is going to trend upwards. Of course, you could also be doing something incredibly dumb with your hardware, so we can&#8217;t really use average performance as a potential indicator of impending hardware failure.</p>
<h3>Which SMART Attributes Work for SSDs?</h3>
<p>What if we could watch SSD wear in real time? It turns out that we&#8217;ve been able to do this for a while. Many vendors offer SMART status codes to return detailed information about the status of the drive. Rotational drives can tell you how hot the drive is, provide bad sector counts, and a host of other information about drive health.</p>
<p>SSDs are opaque, right? Think again.</p>
<p>SSD vendors started putting information in SMART counters to give users a better idea of SSD performance, wear, and overall health. Although the SMART counters will vary from vendor to vendor (based on the disk controller), Intel publish documentation on the counters available with their SSDs &#8211; check out the &#8220;SMART Attributes&#8221; section of the <a href="http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-910-series-specification.pdf">Intel 910 documentation</a>. These are pretty esoteric documents, you wouldn&#8217;t want to have to parse that information yourself. Thankfully, there are easier ways to get to this information; we&#8217;ll get to that in a minute.</p>
<h3>Which SMART Attributes Should I Watch?</h3>
<p>There are a few things to watch in the SMART status of your SSDS:</p>
<ul>
<li>Write Amplification</li>
<li>Media Wear-out Indicator</li>
<li>Available Reserved Space</li>
</ul>
<p><strong>Write Amplification</strong>, roughly, is a measure of the ratio of writes issued by your OS compared to the number of writes performed by the SSD. A lower score is better &#8211; this can even drop below 1 when the SSD is able to compress your data. Although the <a href="http://en.wikipedia.org/wiki/Write_amplification">Write Amplification</a> doesn&#8217;t help you monitor drive health directly, it provides a view of how your use pattern will change the SSD&#8217;s lifespan.</p>
<p>The <strong>Media Wear-Out Indicator</strong> gives us a scale from 100 to 0 of the remaining flash memory life. This starts at 100 and drifts toward 0. It&#8217;s important to note that your drive will keep functioning after Media Wear-Out Indicator reports 0. This is, however, a good value to watch.</p>
<p><strong>Available Reserved Space</strong> measures the original spare capacity in the drive. SSD vendors provide additional storage capacity to make sure wear leveling and garbage collection can happen appropriately. Like Media Wear-Out Indicator, this starts at 100 and will drift toward 0 over time.</p>
<p>It&#8217;s worth noting that each drive can supply additional information. The Intel 910 also monitors battery backup failure and provides two reserved space monitors &#8211; one at 10% reserved space available and a second at 1% reserved space available. If you&#8217;re going to monitor the SMART attributes of your SSDs, it&#8217;s worth doing a quick search to find out what your SSD controllers support.</p>
<h3>How do I Watch the SMART Attributes of my SSD?</h3>
<p>This is where things could get ugly. Thankfully, we&#8217;ve got <a href="http://smartmontools.org">smartmontools</a>. There are two pieces of smartmontools and we&#8217;re only interested in one: <code>smartctl</code>. Smartctl is a utility to view the SMART attributes of a drive. On my (OS X) laptop, I can run <code>smartctl -a disk1</code> to view the SMART attributes of the drive. On Windows you can either use the drive letter for a basic disk, like this:</p>
<p><code>smartctl -a X:</code></p>
<p>Things get trickier, though, for certain PCI-Express SSDs. Many of these drives, the Intel 910 included, present one physical disk per controller on the PCI-Express card. In the case of the Intel 910, there are four. In these scenarios you&#8217;ll need to look at each controller&#8217;s storage individually. Even if you have configured a larger storage volume using Windows RAID, you can still read the SMART attributes by looking at the physical devices underneath the logical disk.</p>
<p>The first step is to get a list of physical devices using <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/aa394132(v=vs.85).aspx">WMI</a>:</p>
<p><code>wmic diskdrive list brief </code></p>
<p>The physical device name will be in the <code>DeviceID</code> column. Once you have the physical device name, you can view the SMART attributes with <code>smartctl</code> like this:</p>
<pre><code>smartctl -a /dev/pd0 -q noserial
</code></pre>
<p>Run against my virtual machine, it looks like this:</p>
<pre><code>C:\Windows\system32&gt; smartctl -a /dev/pd0 -q noserial
smartctl 6.1 2013-03-16 r3800 [x86_64-w64-mingw32-win8] (sf-6.1-1)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     Windows 8-0 SSD
Serial Number:    0RETRD4FE6AMF823QE7R
Firmware Version: F.2FKG1C
User Capacity:    68,719,476,736 bytes [68.7 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS, ATA/ATAPI-5 T13/1321D revision 1
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sat Apr 27 08:35:03 2013 PDT
SMART support is: Unavailable - device lacks SMART capability.
</code></pre>
<p>Unsurprisingly, my virtual drive doesn&#8217;t display much information. But a real drive looks something like this:</p>
<p><div id="attachment_18908" class="wp-caption alignnone" style="width: 610px;  border: 1px solid #dddddd; background-color: #f3f3f3; padding-top: 4px; margin: 10px; text-align:center;"><a href="http://cdn.prod.brentozar.com/wp-content/uploads/2013/05/smartctl-910-01.jpg"><img class="size-large wp-image-18908" alt="Intel 910 smartctl output" src="http://cdn.prod.brentozar.com/wp-content/uploads/2013/05/smartctl-910-01-600x376.jpg" width="600" height="376" /></a><p style=' padding: 0 4px 5px; margin: 0;'  class="wp-caption-text">Intel 910 smartctl output</p></div>
<p>Holy cow, that&#8217;s a lot of information. The Intel 910 clearly has a lot going on. There are two important criteria to watch, simply because they can mean the difference between a successful warranty claim and an unsuccessful one</p>
<ul>
<li>SS Media used endurance indicator</li>
<li>Current Drive Temperature</li>
</ul>
<p>The Intel 910 actually provides more information via SMART, but to get to it, we have to use Intel&#8217;s command line tools. By using the included isdct.exe, we can get some very helpful information about battery backup failure (yup, your SSD is protected by a battery), reserve space in the SSD, and the drive wear indicator. Battery backup failure is a simple boolean value &#8211; 0 for working and 1 for failure. The other numbers are stored internally as a hexadecimal number, but the isdct.exe program translates them from hex to decimal. These numbers start at zero and work toward 100.</p>
<p>If you&#8217;re enterprising, you can take a look at <a href="http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-910-series-specification.pdf">the vendor specification</a> and figure out how to read this data in the SMART payload. Or, if you&#8217;re truly lazy, you can parse the text coming out of smartcl or isdct (or the appropriate vendor tool) and use that to fuel your reports. Some monitoring packages even include all SMART counters by default.</p>
<h3>The Bad News</h3>
<p>The bad news is that if you&#8217;re using a hardware RAID controller, you may not be able to see any of the SMART attributes of your SSDs. If you can&#8217;t get accurate readings from the drives and you&#8217;ll have to resort to using the Performance Monitor counters I mentioned at the beginning of the article. RAID controllers that support smartmontools are listed in the <a href="http://smartmontools.sourceforge.net/man/smartctl.8.html">smartctl documentation</a>.</p>
<p><em>Special thanks go out to a helpful friend who let us abuse their QA Intel 910 cards for a little while in order to get these screenshots.</em></p>
<p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2013/05/monitoring-ssd-performance/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Basics of SQL Server Execution Plans (video)</title>
		<link>http://www.brentozar.com/archive/2013/05/the-basics-of-sql-server-execution-plans-video/</link>
		<comments>http://www.brentozar.com/archive/2013/05/the-basics-of-sql-server-execution-plans-video/#comments</comments>
		<pubDate>Wed, 15 May 2013 13:00:06 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[SQL Server 2012 News]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=19000</guid>
		<description><![CDATA[SQL Server execution plans provide a roadmap to query performance. Once you understand how to read the execution plan, you can easily identify bottlenecks and detours. In this high level session, Jeremiah Peschka will introduce you to the concepts of &#8230; <a href="http://www.brentozar.com/archive/2013/05/the-basics-of-sql-server-execution-plans-video/">Continue reading <span class="meta-nav">&#8594;</span></a><p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></description>
				<content:encoded><![CDATA[<p><iframe width="640" height="360" src="http://www.youtube.com/embed/lH2_SI04PWQ?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>SQL Server execution plans provide a roadmap to query performance. Once you understand how to read the execution plan, you can easily identify bottlenecks and detours. In this high level session, Jeremiah Peschka will introduce you to the concepts of reading SQL Server execution plans including how to get an actual execution plan, how to read the plan, and how to dive deeper into the details of the pieces of the plan. This session is for developers and DBAs who have never looked at SQL Server execution plans before.</p>
<div class="hr"><hr /></div>
<p>In this talk I mentioned a few tools.</p>
<ul>
<li><a href="http://www.brentozar.com/blitzindex/">sp_BlitzIndex</a></li>
<li><a href="http://www.sqlsentry.net/plan-explorer/sql-server-query-view.asp">SQL Sentry Plan Explorer</a></li>
</ul>
<p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2013/05/the-basics-of-sql-server-execution-plans-video/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Basics of Database Sharding</title>
		<link>http://www.brentozar.com/archive/2013/05/the-basics-of-database-sharding/</link>
		<comments>http://www.brentozar.com/archive/2013/05/the-basics-of-database-sharding/#comments</comments>
		<pubDate>Wed, 01 May 2013 14:50:04 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[SQL Server 2012 News]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=18900</guid>
		<description><![CDATA[There are many ways to scale out your database; many of these techniques require advanced management and expensive add-ons or editions. Database sharding is a flexible way of scaling out a database. In this presentation, Jeremiah Peschka explains how to &#8230; <a href="http://www.brentozar.com/archive/2013/05/the-basics-of-database-sharding/">Continue reading <span class="meta-nav">&#8594;</span></a><p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></description>
				<content:encoded><![CDATA[<p>There are many ways to scale out your database; many of these techniques require advanced management and expensive add-ons or editions. Database sharding is a flexible way of scaling out a database. In this presentation, Jeremiah Peschka explains how to scale out using database sharding, covers basic techniques, and shares some of the pitfalls. This talk is for senior DBAs, database architects, and software architects who are interested in scaling out their database.</p>
<p><iframe width="640" height="360" src="http://www.youtube.com/embed/W6pFKihvqH4?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>More resources are available over in our <a href="http://brentozar.com/articles/sharding">sharding article</a>.</p>
<p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2013/05/the-basics-of-database-sharding/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>How Much Cache Do You Have?</title>
		<link>http://www.brentozar.com/archive/2013/04/how-much-cache-do-you-have/</link>
		<comments>http://www.brentozar.com/archive/2013/04/how-much-cache-do-you-have/#comments</comments>
		<pubDate>Thu, 18 Apr 2013 13:00:00 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[SQL Server 2012 News]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=18585</guid>
		<description><![CDATA[Without looking in your wallet, do you know how much cash you have? Most of us know within a few dollars. Now, without looking in your SQL Server, do you know much data is cached in memory? You probably don&#8217;t &#8230; <a href="http://www.brentozar.com/archive/2013/04/how-much-cache-do-you-have/">Continue reading <span class="meta-nav">&#8594;</span></a><p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></description>
				<content:encoded><![CDATA[<p>Without looking in your wallet, do you know how much cash you have? Most of us know within a few dollars. Now, without looking in your SQL Server, do you know much data is cached in memory? You probably don&#8217;t and that&#8217;s okay; you shouldn&#8217;t know how much data SQL Server is caching in memory. We can&#8217;t control how much data SQL Server is caching, but we can control how we cache data.</p>
<h3>Different Types of Cache</h3>
<p>There are a lot of different ways to approach caching. One of the most prevalent ways involves thinking about cache in two different levels (much like CPU cache): first level cache and second level cache.</p>
<div id="attachment_18590" class="wp-caption alignright" style="width: 253px;  border: 1px solid #dddddd; background-color: #f3f3f3; padding-top: 4px; margin: 10px; text-align:center; float: right;"><img class="size-medium wp-image-18590" alt="First level cache lives in the application and second level cache is in a separate service" src="http://cdn.prod.brentozar.com/wp-content/uploads/2020/01/cache-243x200.png" width="243" height="200" /><p style=' padding: 0 4px 5px; margin: 0;'  class="wp-caption-text">Green means go!</p></div>
<p>First level cache is an immediate, short-lived cache that works within a single session to attempt to minimize database calls. Unfortunately, first level cache is only used for the duration of a current session or transaction (depending on your terminology). This is very short lived and it&#8217;s only useful to the current process. While helpful, first level cache has a limited scope.</p>
<p>There&#8217;s another type of cache: second level cache. Second level cache exists outside of the current process and can be shared between multiple transactions, processes, servers, or even applications. When we talk about adding cache to an application, we really mean second level cache.</p>
<h3>A Bit of Cache</h3>
<p>Even the most basic of ORMs have a little a bit of cache available. The first level cache is used as a short lived buffer to reduce the amount of work that the ORM has to do. First level cache is used for caching objects in the current transaction and query text. Although this cache can be helpful for the current process, this cache isn&#8217;t shared across multiple processes or even multiple database batches. If we want to have a more robust cache, we have to look elsewhere.</p>
<p>ORMs like Entity Framework or the LLBLGen Framework don&#8217;t have a second level cache. It&#8217;s up to developers to add a cache when and where they need it. This exposes developers to additional concerns like cache invalidation, cache updates, and query caching. All of these features and functionality may not be necessary, but that&#8217;s an acceptable trade off &#8211; it&#8217;s up to developers to implement cache features in ways that support application requirements.</p>
<p>Although it takes up developer time, building the second level cache yourself has the benefit of creating a cache that&#8217;s suited to the application&#8217;s requirements. For many application level features, this is good enough. It&#8217;s important, though, that developers pick a caching layer capable of meeting their operational requirements. Operational requirements include horizontal scalability, redundancy and fail over, recovery of cached data, or customizable cache expiration on an object-by-object basis.</p>
<p>These basic ORMs aren&#8217;t really all that basic &#8211; they have full features in other parts of the ORM, but they only offer basic support for automatic caching through the ORM.</p>
<h3>A Lot of Cache</h3>
<p>You&#8217;ve got memory. You want to use it to cache data. What&#8217;s the easiest way to do that?</p>
<p>One of the easiest approaches to adding caching to your application is to use a framework that supports it out of the box. A number of ORMs, including both Hibernate and NHibernate, provide this support. Enabling cache is easy &#8211; just change a few lines in a configuration file and the cache will be available to your application. Things start getting tricky, though, when you examine the richness of the caching that&#8217;s provided by these tools.</p>
<p>Power comes with a price. When you&#8217;re getting starting with tools like Hibernate or NHibernate, there&#8217;s a lot to take in and many developers overlook these features. Developers can choose on an object by object basis which caching strategy should be applied. Based on business requirements we can choose to treat certain cacheable objects as read only while others can be used as a read/write cache. Some objects can be cached while others bypass the secondary cache entirely &#8211; there&#8217;s a lot of complexity for developers to manage.</p>
<p>While this can be overwhelming, this flexibility serves a purpose &#8211; not all features of an application have the same requirements. Some features can serve old data to users, other features need to be up to the minute or up to the second. Giving developers the ability to make these choices means that there is a choice to be made. Even if it&#8217;s a difficult one, developers can choose how the application behaves and can tailor performance and functionality to business requirements.</p>
<h3>Making the Choice</h3>
<p>If you&#8217;ve already got an existing project and you&#8217;re planning on adding a caching layer, don&#8217;t think that you have to re-implement your data access layer just to get better support for caching. Both approaches have their benefits and it&#8217;s far more important to be aware of which data needs to be cached and the best way to cache it.</p>
<p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2013/04/how-much-cache-do-you-have/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Saving Session State (video)</title>
		<link>http://www.brentozar.com/archive/2013/03/saving-session-state-video/</link>
		<comments>http://www.brentozar.com/archive/2013/03/saving-session-state-video/#comments</comments>
		<pubDate>Wed, 27 Mar 2013 13:00:33 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[SQL Server 2012 News]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=18449</guid>
		<description><![CDATA[Session state frequently ends up on a busy SQL Server. What seemed like a good idea in development turns into a problem in production. While there are valid business reasons for persisting session state to permanent storage; there are equally &#8230; <a href="http://www.brentozar.com/archive/2013/03/saving-session-state-video/">Continue reading <span class="meta-nav">&#8594;</span></a><p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></description>
				<content:encoded><![CDATA[<p>Session state frequently ends up on a busy SQL Server. What seemed like a good idea in development turns into a problem in production. While there are valid business reasons for persisting session state to permanent storage; there are equally valid reasons to avoid using SQL Server as the permanent storage. We’ll investigate why session state poses problems for SQL Server and cover an alternate solution that allows for persistent session state. This talk is for developers and DBAs who want a better way to safely track ASP.NET session state.</p>
<p><iframe width="640" height="360" src="http://www.youtube.com/embed/yyIV46c-XDY?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<h3>Links and References</h3>
<ul>
<li><a href="http://basho.com/riak/">Riak</a></li>
<li><a href="http://github.com/DistributedNonsense/CorrugatedIron">CorrugatedIron</a></li>
<li><a href="http://github.com/DistributedNonsense/CorrugatedIron.SessionState">CorrugatedIron.SessionState</a></li>
<li><a href="http://msdn.microsoft.com/en-us/library/hh361709.aspx">AppFabric Cache as Session State</a></li>
</ul>
<p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2013/03/saving-session-state-video/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>LandsofAmerica.com &#8211; Elastic Data Warehouse &#8211; A Case Study</title>
		<link>http://www.brentozar.com/archive/2013/03/landsofamerica-com-elastic-data-warehouse/</link>
		<comments>http://www.brentozar.com/archive/2013/03/landsofamerica-com-elastic-data-warehouse/#comments</comments>
		<pubDate>Tue, 26 Mar 2013 13:00:07 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[SQL Server 2012 News]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=18363</guid>
		<description><![CDATA[Company Overview LandsofAmerica.com is the largest rural listing service in the Nation. The Network specializes in land for sale, which includes farms, ranches, mountain property, lake houses, river homes, beachfront homes, country homes, and residential homes in smaller towns across &#8230; <a href="http://www.brentozar.com/archive/2013/03/landsofamerica-com-elastic-data-warehouse/">Continue reading <span class="meta-nav">&#8594;</span></a><p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></description>
				<content:encoded><![CDATA[<h3>Company Overview</h3>
<p><a href="http://landsofamerica.com">LandsofAmerica.com</a> is the largest rural listing service in the Nation. The Network specializes in land for sale, which includes farms, ranches, mountain property, lake houses, river homes, beachfront homes, country homes, and residential homes in smaller towns across the country. These properties have many diverse uses including recreational and agricultural activities like hunting, fishing, camping, backpacking, horseback riding, four wheeling, grazing cattle, gardening, vineyards, cropland, raising horses, and other livestock.</p>
<h3>Business Challenges</h3>
<p>LandsofAmerica.com (LoA) has been collecting an immense amount of data about visitor usage and search patterns for several years using Microsoft SQL Server as their data storage solution. LoA is happy using SQL Server for OLTP workloads, but with nearly 1 billion rows of data, previous attempts at combining OLTP and analytical queries on the same SQL Server instances caused poor reporting performance. LOA&#8217;s goal was to provide this data to several thousand clients so it needed to be optimal for their use.</p>
<p>The real time requirements for data analysis are relaxed: a 24-hour lag between data collection and analysis delivery is acceptable for the business users. Many of the analysis questions took the form of &#8220;What is the average price range users are searching for in these 7 counties over the last 6 months?&#8221; This presented a problem: how could the business provide a product reporting on this data about user trends without sacrificing core application performance? The issue was LoA&#8217;s production SQL Server was unable to answer these questions and still serve OLTP data.</p>
<p>LoA&#8217;s team was faced with two choices: they could purchase a second server used solely for analytical querying or they could evaluate other options. During discussions with the company&#8217;s development team, I reviewed solutions that would let LoA use their historical data, help the business make better decisions, and move the data processing load outside of SQL Server.</p>
<p>Working closely with LoA, I designed and implemented a solution using <a href="http://hive.apache.org/">Apache Hive</a> hosted in <a href="http://aws.amazon.com/elasticmapreduce/">Amazon Elastic MapReduce</a> (EMR) &#8211; EMR delivers managed Hadoop and Hive services, low cost storage, and flexible computing resources. LoA was up and running with EMR and Hive in just several weeks.</p>
<h3>Use Case Description</h3>
<p>LandsofAmerica.com already leverages components of <a href="http://aws.amazon.com/">Amazon Web Services</a> in conjunction with Microsoft SQL Server. Extending their usage to Elastic Map Reduce and Hive was an easy addition. Long term data storage is offloaded from SQL Server to <a href="http://aws.amazon.com/s3/">Amazon S3</a> to lower storage costs compared to traditional storage options. S3 stores the detailed source records exported from SQL Server and data stored in S3 is accessed through Hive.</p>
<p>Hive is used as a separate data processing system. User search and activity is aggregated along several key measures through Hive. The aggregated data is stored in S3 before being imported into an on-premise SQL Server for interactive querying and reporting.</p>
<h3>Impact</h3>
<p>Multi-dimensional search data provides many opportunities for complex analysis. This analysis is typically both CPU and disk intensive &#8211; it&#8217;s difficult to provide effective indexing techniques for large analytic queries. Through Hive&#8217;s ability to conduct large-scale analysis, LandsofAmerica.com is able to uncover trends that would otherwise remain hidden in their data. Other options exist to perform analysis, but carry a significant hardware and licensing cost, like the <a href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-warehousing/reference-architecture.aspx">Microsoft SQL Server Fast Track Data Warehouse</a>. By utilizing commodity cloud computing resources and Apache Hive, LandsofAmerica.com is able to gain insight across their collected data without a significant investment of capital &#8211; resources are consumed on-demand and paid for on-demand.</p>
<p>By using Hive as the definitive store of historical data, LandsofAmerica.com is able to reduce their local storage requirements. Older historical data can be removed from Microsoft SQL Server as it is loaded into Hive.</p>
<p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2013/03/landsofamerica-com-elastic-data-warehouse/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Introduction to Hive Partitioning</title>
		<link>http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/</link>
		<comments>http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/#comments</comments>
		<pubDate>Thu, 14 Mar 2013 13:00:10 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=18357</guid>
		<description><![CDATA[An Introduction to Hive&#8217;s Partitioning You&#8217;re probably thinking about building a data warehouse (just about every company is if they haven&#8217;t already). After reading SQL Server Partitioning: Not the Best Practice for Anyone and Potential Problems with Partitioning you&#8217;re wondering &#8230; <a href="http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/">Continue reading <span class="meta-nav">&#8594;</span></a><p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></description>
				<content:encoded><![CDATA[<h1>An Introduction to Hive&#8217;s Partitioning</h1>
<p>You&#8217;re probably thinking about building a data warehouse (just about every company is if they haven&#8217;t already). After reading <a href="http://www.brentozar.com/archive/2008/06/sql-server-partitioning-not-the-answer-to-everything/">SQL Server Partitioning: Not the Best Practice for Anyone</a> and <a href="http://www.brentozar.com/archive/2012/08/potential-problems-partitioning/">Potential Problems with Partitioning</a> you&#8217;re wondering why anyone would partition their data: it can be harder to tune queries, indexes take up more space, and SQL Server&#8217;s partitioning requires Enterprise Edition on top of that expensive SAN you&#8217;re adding to cope with the extra space. Anyone who is looking at implementing table partitioning in SQL Server would do well to take a look at using <a href="http://hive.apache.org">Hive</a> for their partitioned database.</p>
<h3>Partitioning Functions</h3>
<p>Setting up partitioning functions in SQL Server is a pain. It&#8217;s left up to the implementor to decide if the partition function should use <a href="http://www.brentozar.com/archive/2013/01/best-practices-table-partitioning-merging-boundary-points/#i-assumed-sql-server-would-remove-the-empty-filegroup">range right or range left</a> and how partitions will be swapped in and out. Writing robust partitioning functions is stressful the first time around. What if we didn&#8217;t have to define a partition function? What if the database knew how to handle partitioning for us? Hive does just that.</p>
<p>Rather than leave the table partitioning scheme up to the implementor, Hive makes it easy to specify an automatic partition scheme when the table is created:</p>
<pre><code>CREATE TABLE sales (
    sales_order_id  BIGINT,
    order_amount    FLOAT,
    order_date      STRING,
    due_date        STRING,
    customer_id     BIGINT
)
PARTITIONED BY (country STRING, year INT, month INT, day INT) ;
</code></pre>
<p>As we load data it is written to the appropriate partition in the table. There&#8217;s no need to create partitions in advance or set up any kind of partition maintenance; Hive does the hard work for us. The hardest part is writing queries. It&#8217;s a rough life, eh?</p>
<p>You might have noticed that while the partitioning key columns are a part of the table DDL, they&#8217;re only listed in the <code>PARTITIONED BY</code> clause. This is very different from SQL Server where the partitioning key must be used everywhere in a partitioned table. In Hive, as data is written to disk, each partition of data will be automatically split out into different folders, e.g. <code>country=US/year=2012/month=12/day=22</code>. During a read operation, Hive will use the folder structure to quickly locate the right partitions and also return the partitioning columns as columns in the result set.</p>
<p>This approach means that we save a considerable amount of space on disk and it can be very fast to perform partition elimination. The downside of this approach is that it&#8217;s necessary to tell Hive which partition we&#8217;re loading in a query. To add data to the partition for the United States on December 22, 2012 we have to write this query:</p>
<pre><code>INSERT INTO sales
PARTITION (country = 'US', year = 2012, month = 12, day = 22)
SELECT  sales_order_id,
        order_amount,
        due_date,
        customer_id,
        cntry,
        yr,
        mo,
        d
FROM    source_view
WHERE   cntry = 'US'
        AND yr = 2012 
        AND mo = 12
        AND d = 22 ;
</code></pre>
<p>This is a somewhat inflexible, but effective, approach. Hive makes it difficult to accidentally create tens of thousands of partitions by forcing users to list the specific partition being loaded. This approach is great once you&#8217;re using Hive in production but it can be tedious to initially load a large data warehouse when you can only write to one partition at a time. There is a better way.</p>
<h3>Automatic Partitioning</h3>
<p>With a few quick changes it&#8217;s easy to configure Hive to support dynamic partition creation. Just as SQL Server has a <code>SET</code> command to change database options, Hive lets us change settings for a session using the <code>SET</code> command. Changing these settings permanently would require opening a text file and restarting the Hive cluster &#8211; it&#8217;s not a difficult change, but it&#8217;s outside of our scope.</p>
<pre><code>SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
</code></pre>
<p>Once both of these settings are in place, it&#8217;s easy to change our query to dynamically load partitions. Instead of loading partitions one at a time, we can load an entire month or an entire country in one fell swoop:</p>
<pre><code>INSERT INTO sales
PARTITION (country, year, month, day)
SELECT  sales_order_id,
        order_amount,
        due_date,
        customer_id,
        cntry,
        yr,
        mo,
        d
FROM    source_view
WHERE   cntry = 'US' ;
</code></pre>
<p>When inserting data into a partition, it&#8217;s necessary to include the partition columns as the <em>last</em> columns in the query. The column names in the source query don&#8217;t need to match the partition column names, but they really do need to be last &#8211; there&#8217;s no way to wire up Hive differently.</p>
<p>Be careful using dynamic partitions. Hive has some built-in limits on the number of partitions that can be dynamically created as well as limits on the total number of files that can exist within Hive. Creating many partitions at once will create a lot of files and creating a lot of files will use up memory in the <a href="http://wiki.apache.org/hadoop/NameNode">Hadoop Name Node</a>. All of these settings can be changed from their defaults, but those defaults exist to prevent a single <code>INSERT</code> from taking down your entire Hive cluster.</p>
<h3>What About Partition Swapping?</h3>
<p>Much like SQL Server, Hive makes it possible to swap out partitions. Partition swapping is an important feature that makes it easy to change large amounts of data with a minimal impact on database performance. New aggregations can be prepared in the background</p>
<p>How do we perform a partition swap with Hive? A first guess might be to use the <code>INSERT OVERWRITE PARTITION</code> command to replace all data in a partition. This works but it has the downside of deleting all of the data and then re-inserting it. Although Hive has no transaction log, we&#8217;ll still have to wait for data to queried and then written to disk. Your second guess might be to load data into a different location, drop the original partition, and then point Hive at the new data like this:</p>
<pre><code>ALTER TABLE sales 
    DROP IF EXISTS PARTITION 
    (country = 'US', year = 2012, month = 12, day = 22) ;

ALTER TABLE sales 
    ADD PARTITION (country = 'US', year = 2012, month = 12, day = 22) 
    LOCATION 'sales/partitions/us/2012/12/22' ;
</code></pre>
<p>It&#8217;s that easy: we&#8217;ve swapped out a partition in Hive <em>and</em> removed the old data in one step. . Truthfully, there&#8217;s an even easier way using the <code>SET LOCATION</code> clause of <code>ALTER TABLE</code>.</p>
<pre><code>ALTER TABLE sales
    PARTITION (country = 'US', year = 2012, month = 12, day = 22)
    SET LOCATION = 'sales/partitions/us/2012/12/22' ;
</code></pre>
<p>Just like that, the new partition will be used. There&#8217;s one downside to this approach &#8211; the old data will still exist in Hadoop, only the metadata will be changed. If we want to clear out the old data, it&#8217;s going to be necessary to run drop down to HDFS commands and delete the old data out of Hadoop itself.</p>
<h3>Is Hive Partitioning Right For You?</h3>
<p>If you&#8217;re thinking about partitioning a relational database, you should give serious consideration to using partitioned tables in Hive. One of the advantages of Hive is that storage and performance can be scaled horizontally by adding more servers to the cluster &#8211; if you need more space, just add a server; if you need more computing power, just add a server. Hive&#8217;s approach to data skips some of the necessary costs of partitioning in SQL Server &#8211; there&#8217;s no Enterprise Edition to purchase, minimal query tuning involved (hint: you should almost always partition your data in Hive), and no expensive SAN to purchase.</p>
<p>For better or for worse &#8211; if you&#8217;re thinking about partitioning a data warehouse in SQL Server, you should think about using Hive instead.</p>
<p>...<br /><i>Attending a fall conference? <a href="http://www.brentozar.com/services-we-provide/training/">Check out our Summit, SQL Rally, and SQL Intersection pre-con list.</a></i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using memcached
Database Caching using memcached
Object Caching 1129/1223 objects using memcached
Content Delivery Network via Amazon Web Services: S3: cdn.prod.brentozar.com
Application Monitoring using New Relic

 Served from: www.brentozar.com @ 2013-06-19 11:58:52 by W3 Total Cache -->