I feel like the luckiest guy in the world: I get to take part in six major SQL Server events in 2012. Here’s where I’m going and why:
January 26-Feb 4: SQLCruise Miami
SQLCruise is the highlight of my year. This year, we’ve transitioned SQLCruise completely to Tim Ford – he owns the event start to finish. I’m really excited because I get to sit back and enjoy the event a little more, and Tim gets to do what he truly shines at: bringing good people together to learn and bond.
I’m doing three all-new in-depth sessions on SQLCruise Miami: Procedure Cache BI, Scaling SQL with Solid State, and SAN Features for Senior DBAs. All three are targeted at DBAs who have to make applications go faster. You need 3-5 years of experience with SQL Server, and you should already be familiar with the basics of reading execution plans and designing indexes.
I like SQLCruise because it’s a limited number of people together for an extended period of time. I get to know every single attendee, hear about their challenges at work, and have plenty of one-on-one time for design questions. Some attendees have started bringing their laptops armed with a SQL Server instance, a copy of their production database, and their toughest queries or design problems. I love discussing this stuff for hours over margaritas.
Trust me, this is the cheapest way to get the most honest consulting, and it’s not just from the presenters – it’s also from your fellow attendees. Learn more at SQLCruise.com.
Feb 28-Mar 2: MVP Summit Redmond and SQLSaturday Redmond
The annual international MVP shindig is the coolest benefit of being a Microsoft MVP. We have to pay for our flights to Seattle, but once we’re there, Microsoft picks up pretty much everything else. We get insider NDA sessions, and given the timing on the below event, I’m hoping that we’ll get insider content on SQL Server 2012 at the MVP Summit this year. The training really isn’t my favorite part – I just like getting the chance to spend more time with my favorite community and Microsoft people. There’s no substitute for face time with these folks.
The public isn’t invited to the MVP Summit, but the good news is that the MVP Summit will bring a ton of great speakers to SQLSaturday Redmond! This might be the best SQLSaturday all year. I try to work in SQLSaturdays whenever it matches my travel schedule, and this one fits perfectly. I bet lots of out-of-town MVPs will make a similar decision.
March 25-31: SQL Connections Las Vegas
This Connections is billing itself as a SQL Server 2012 launch event, and it just so happens that the conference date exactly lines up with the next SQLBits conference too. Does this mean we’ll see SQL Server 2012 released here? Who knows – it’s tough for conference planners and software vendors to get all the stars to line up simultaneously.
Connections feels like the most training-oriented of the conferences: it focuses on quality, not quantity. There’s less simultaneous sessions, but the presenters tend to be very polished professionals who are very accustomed to being onstage. (Presenters get free registration, hotel, and a fee per session.) It lacks the rowdy, party, community feeling of the PASS Summit: but there’s clearly an audience for a more traditional learning event. I like both events for different reasons – I meet more friends at the Summit, but I make more consulting business connections at Connections.
I’m doing a few sessions at Connections, but the official itinerary isn’t out yet. The only one I can announce for sure is my all-day Virtualization and SAN Basics for DBAs session, which will be a $399 post-con session on March 30th. Registration is open now.
May 24-June 2: SQLCruise Alaska
I’m doing the same presentations that I did on SQLCruise Miami. Neither cruise is more junior or senior than the other – it just boils down to whether you’d like to see the islands or the icebergs.
There’s one big difference, though: in Alaska, I teamed together with Jeremiah, Kendra, and a friend of ours to get the Garden Villa suite. SQLCruisers will all get to hang out with us here:
Look like fun? Book yourself a cheap inside room (they’re as low as $679 per person right now) and register for SQLCruise.
Fall: SQL Connections Las Vegas
Sometime in the fall, I’ll be returning to Vegas for another round of Connections. The dates, exact location, and session agendas are still to be determined.
November 6-9: PASS Summit Seattle
The PASS Summit is the biggest event of the year in terms of quantity: thousands of SQL Server professionals from all over the world gather together to talk shop. It’s a zoo of nonstop activity, a whirlwind of tweeting and meeting that goes by in a blur. I love getting the chance to meet up with some of my best friends for the briefest of moments.
There’s at least a dozen sessions going on simultaneously at any given time, but as big as the event is, there’s fierce competition for speaking spots. Speakers get free registration (roughly $1500 value), so speakers from all over the world vie for spots. The call for sessions hasn’t gone out yet, so I have no idea whether I’ll be accepted, but I’d pay to go out of my pocket even if I didn’t get the chance to speak.
And I Wish I Could Hit More!
The good news – heck, the GREAT news – is that I’ll be hitting so many major conferences next year. I pinch myself when I think about this because it just seems too good to be true.
The bad news is that I’d love to do more: I would love to do SQLBits, TechEd, a few CodeCamps, storage conferences, and VMworld, but with just these ones above, I’m already away from home for six weeks. In order to keep a good work/life balance, I try to only travel one week per month, and that only leaves me 6 more weeks of travel in 2012. Since I like to eat and pay rent, I gotta use those other 6 weeks for client projects to make moolah. <sigh> Decisions, decisions.
Here’s what I’ve been reading lately on the SQL Server front. I’d like to thank all three authors and publishers for providing these books for review.
Denny’s a jack of all trades, and a Microsoft Certified Master of SQL Server to boot. I’ve had the pleasure of knowing Denny for a few years, and he’s on the short list of people who I’d trust with my server password list. (I wouldn’t trust him with my laptop password, though, because he’s got a wicked sense of humor.) When Denny first told me he was writing a security book by himself, I asked him why – it seemed like such a dry topic. Amazingly, Denny was able to bring this material to life in a friendly, readable way.
It’s easy to recommend this book for companies that store financial, medical, or personally identifiable data in their databases, but going beyond that, I think every company with a multi-DBA team should have a copy of this book on their bookshelf. A lot of my clients ask questions that are answered well in this book, and this book is way cheaper than a HIPAA/PCI/SOX audit.
If you’re a solo DBA at a shop, I probably wouldn’t recommend this book, though. The topics covered in this book take time to implement, and they’re usually beyond the scope of what a single person has the time to do when they’re firefighting. Yes, I wish all our databases were secure, but solo DBAs probably need to start with more basic security concepts such as how to configure service accounts, and that’s not covered here. This focuses on senior DBA level knowledge.
Troubleshooting SQL Server by Jonathan Kehayias and Ted Kreuger
Jonathan and Ted are forum monsters: they’re constantly patrolling user questions looking for ways they can help. As a result, they’re well-versed in the typical problems DBAs face and the best ways to solve those problems. They’ve compiled the results into a book. The book reads like a very polished forum answer: if you’ve got a CPU problem, turn to Chapter 3. You’ll learn what tools to use to diagnose the issue, the most common problems, and the best solutions.
Jonathan was one of the coauthors on Professional SQL Server 2008 Internals and Troubleshooting (Paperback – Kindle), and at first glance, it might sound like those two books are similar. This book is different because it takes a problem-and-solution approach, whereas our Pro 2008 Internals teaches the reader about internals first, then explains troubleshooting tools you can use to look under the hood. I’d say that Troubleshooting SQL Server is a faster approach to getting ‘er done, and I’m adding it to my list of recommended books for SQL Server DBAs.
Lemme just start out by saying that I don’t like posting negative reviews. I first received a review copy of this, marked it up, and emailed the publisher and author with a list of questions. Despite a long dialog, I never got answers to the technical issues, so I tossed the book in the closet. I wouldn’t have posted this review, but I noticed a disturbing number of five-star reviews for the book on Amazon. It wouldn’t be fair for me to let people spend their money on this, because by buying it and following the advice, they’re hurting their SQL Servers.
Here’s a few examples of the advice:
- P335 – in a section on designing maintenance tasks in a mission-critical environment, the reader is instructed to check maintenance plans for reorganize indexes, rebuild indexes, and update statistics. This is the worst approach possible for a mission-critical environment: it will do all three tasks against every index in the database, every time. If we absolutely had to use this approach, we could accomplish the same thing by simply rebuilding all indexes, but again, even that is a bad idea in a mission-critical environment. It even tells the reader to create separate schedules, but doesn’t say how or when these tasks should run.
- P361 – “Place transaction logs on RAID5 rather than RAID1+0 for sequential workloads.” Not true, as Microsoft documented in Table1A of this doc.
- P360 – in a section on building a scalable server: “…edition-wise it must be either DataCenter or Enterprise edition to host production data.” What, Standard isn’t good enough for production?
- P70 – in the wait stats troubleshooting query, useless wait stats aren’t filtered out. The example screenshot shows that the server is bottlenecked by SQLTRACE_INCREMENTAL_FLUSH_SLEEP, XE_TIMER_EVENT, and FT_IFTS_SCHEDULER_IDLE_WAIT, none of which are wait events that a DBA should spend time investigating. Contrast this with the well-thought-out, production-tested queries in Troubleshooting SQL Server, and it’s pretty disappointing.
- P360 – “Configure the storage and memory L2 cache per CPU has a minimum of 2 MB to improve memory.” There are so many grammatical and technical errors here I don’t even know where to begin.
To top it all off, many of the book’s sample queries simply don’t work – keywords are mashed together with no spacing, like page 69’s “SELECTTOP” and “CROSSAPPLYsys”. Ouch.
Needless to say, I don’t recommend adding this book to your collection, but if you’re itching for holiday reading, check out my recommended SQL Server book list.
Microsoft have made announcements about bringing Hadoop to Windows, but what does that mean for SQL Server? In this talk, Jeremiah Peschka will cover a the Hadoop ecosystem at a high level and discuss specific use cases for Hadoop. By the end of this talk, you should have a general idea of how different parts of Hadoop can be used to augment SQL Server’s rich functionality. This talk is for database professionals who are looking for ways to extend the capability of SQL Server with different pieces of the Hadoop ecosystem.
- Apache Hadoop Project Page
- Hadoop World Follow-Up– I was lucky enough to attend Hadoop World in 2010, here’s some of the knowledge I came away with.
- Hadoop Revisited A review of the Hadoop ecosystem changes from 2012 including Hadoop, Hive, and Pig
Getting Started with Hadoop – Downloadable Virtual Machines
- Cloudera Hadoop Downloads (includes virtual machines)
- Hortonworks Sandbox
- Microsoft HDInsight Server
Data Warehousing with Hadoop and Hive
- Getting Started with Hive
- Introduction to Table Partitioning in Hive
- Computing Rolling Averages in Hive
- ETL and Data Processing with Hive, Hadoop, and SQL Server
- Hive and Hadoop Data Warehouse: Lessons Learned (video)
Hungry for more? Take a look at Hadoop Revisited – a survey of the Hadoop landscape at the close of 2012.
What is Hadoop?
Hadoop is a group of tools to help developers create bigger tools. More specifically, Hadoop is a basic set of tools that help developers create applications spread across multiple CPU cores on multiple servers – it’s parallelism taken to an extreme. Although Hadoop is a set of tools and libraries, there are a number of products that are lumped into the same bucket as Hadoop and, frequently, they’re all referred to as Hadoop. Instead of describing every piece of Hadoop in detail, I’m going to focus on the functionality that’s of the most interest to SQL Server professionals.
Data Warehousing in Hadoop
If you need to work with big data, Hadoop is becoming the _de facto_ answer. But once your data is in Hadoop, how do you query it?
If you need big data warehousing, look no further than Hive. Hive is a data warehouse built on top of Hadoop. Hive is a mature tool – it was developed at Facebook to handle their data warehouse needs. It’s best to think of Hive as an enterprise data warehouse (EDW) – Hive can be used to research complex interactions across your company’s entire history of data, in exchange for that power you have to understand that queries will return in minutes. Unlike traditional EDWs, Hive is spread across tens, hundreds, or even thousands of commodity grade servers.
Hive was designed to be easy for SQL professionals to use. Rather than write Java, developers write queries using HiveQL (based on ANSI SQL) and receive results as a table. As you’d expect from an EDW, Hive queries will take a long time to run; results are frequently pushed into tables to be consumed by reporting or business intelligence tools. It’s not uncommon to see Hive being used to pre-process data that will be pushed into a data mart or processed into a cube.
While Hive can operate on large volumes of data, it’s not the most efficient tool: Impala seeks to overcome some of the limitations of Hive by making better use of CPU, memory, and disk resources. Impala operates more like SQL Server – data is cached in memory to improve query performance. Although Impala uses a separate query engine than hive, it uses data that’s already in Hadoop, making it easy to query massive amounts of data without having to store your data twice.
Both Impala and Hive are great for businesses querying of amounts of data while avoiding expense of massively parallel EDW solutions like Microsoft SQL Server PDW or Oracle Exadata. Hive is in a stable release cycle and, although Impala is still a beta product, many organizations are deploying one or both solutions to tackle their largest workloads.
Data Flow and ETL with Hadoop and Pig
SQL Server professionals are familiar with using SQL Server Integration Services (SSIS) to move data around their organization. SSIS provides a rich set of functionality for manipulating, but it’s difficult to make SSIS operations run across multiple CPU cores, much less multiple servers.
Pig is a tool for creating parallel data workflows. Pig takes advantage of the Hadoop tools to provide rich functionality across huge amounts of data. Pig makes it easy to perform step-by-step data manipulation over large data sources using a combination of different tools and functionality. There are a number of great reasons to use Pig (parallel processing, sampling, and loose schema requirements), but it’s safe to say that Pig is a great tool for processing data with Hadoop.
Deep Analysis of Data
SQL Server professionals are used to having analytic insight available, either through SQL Server’s windowing functions or through SQL Server Analysis Services. Although Hadoop doesn’t natively provide tools for OLAP style cubes or for windowing functions, it’s possible to gain insight from your data using Hadoop. Unfortunately, deep analytics are not Hadoop’s strong suit out of the box. Teams looking to take advantage of large scale data analytics will be doing a lot of heavy lifting themselves.
Mahout is a set of libraries that can be used to distribute analytics around a cluster, but there are limitations to the flexibility and interactivity of Mahout. Developers looking for the ad hoc interactive capabilities of SQL Server Analysis Services (or even of a relational data warehouse) will be disappointed. Bulk computation can be performed disconnected from users, but Mahout and Hadoop don’t provide any kind of ad hoc querying capability.
Real Time Querying
So far, all of the use cases we’ve explored have been based on distributed batch processes and large scale querying. Even though Impala is a vast performance improvement over Hive, Impala is still responding in a matter of several seconds to several minutes – hardly fast enough for interactive querying. Databases are used for more than running massive reports, and this is where HBase comes in to play.
HBase is a real time, random access, read-write database built on top of Hadoop. This isn’t a database like SQL Server with tables and joins; HBase is a NoSQL database that’s loosely based on Google’s BigTable database. There are tables, there are columns, but the schema isn’t as rigid as a relational database. Developers will be able to solve many problems with HBase, but there will be a bit of a learning curve as they understand the data model and update their data structures to work effectively with HBase. Data stored in HBase can even be queried through Hive or Impala making it possible to combine transactional and reporting data in the same Hadoop cluster – the scale and redundancy of Hadoop make it easier to reduce load on any single system and avoid many problems associated with reporting from a transactional data source.
When Should You Use Hadoop?
Ultimately, you’re looking for an answer to the question “When should I use Hadoop?” This is a difficult question to answer. Hadoop may make sense for part of a workload, or even for all of it. The best way is to start by looking at your environment and asking questions like:
- Can I keep my data on a single instance?
- Can I keep my data on a single instance and doing it cheaply?
- Are my queries running fast enough?
- Do I need complex, interactive, ad hoc analytics?
- What type of latency is acceptable between data arrival, analysis, and queryability?
Understanding your workload is critical to determining if you’ll be able to use Hadoop to meet your needs. Having realistic expectations of Hadoop is equally critical. No part of Hadoop will solve all of the problems an organization is facing. Hadoop can mitigate some problems, but it presents a different set of challenges – being comfortable with the limitations of Hadoop will go a long way toward having a successful implementation.
When something remarkable is around for a while, it becomes easy to take it for granted.
Today, I’ll share three awesome things about SQL Server with tips to give you an edge when using each one.
SQL Server is a huge product. We’ve got the database engine, Analysis Services for cubes, Integration Services to process and transform data, and an increasing set of other complex products for working with data documented in Books Online.
Tip: When using Books Online, check the Community Content at the bottom of the page first for any errors or questions others have raised— this can save you time and trouble if an important detail is missing or in error. Check other versions of the same page using the link at the head of the topic, and always read critically. Even encyclopedias can’t be perfect!
Microsoft’s documentation doesn’t stop with Books Online. Microsoft also publishes in-depth whitepapers for SQL Server. I recommend you periodically review the list to make sure you know what’s available, then set aside time to read the topics which are most useful to you.
We also have sites where community members publish high quality documentation for SQL Server. The SQL Server community is so vast that it’s impossible to mention them all. Three of my favorite sites for technical documentation are:
- Erland Sommarskog’s texts on SQL;
- SQL Server Stairways on SQLServerCentral;
- The SQL Server Customer Advisory Team Blog and SQL CAT articles.
Tip: When you see unusual behavior in SQL Server, search Microsoft Connect. This is where users report bugs and suggestions for the product. Make sure you log in to search: external search engines typically won’t lead you to what you need to find.
We have many ways to check out what’s currently happening in SQL Server. The instrumentation for the product is very well developed and gives a lot of options to see what’s processing at any given time.
The top four ways I check to see what’s happening are by using:
- Dynamic Management Objects – We can find out an awful lot with queries— and more with each version of the product;
- SQL Trace – Our old friend, sometimes used with SQL Server Profiler;
- Extended Events – The new, leaner and meaner hotness for tracing: SQL 2008 and higher;
- Perfmon Counters – Windows and SQL Server specific counters.
Each of these techniques has its own strengths, and Extended Events is becoming increasingly powerful. Together, all of these methods provide a vast array of information about what’s happening in and around SQL Server.
Gathering data with each technique has its own cost. It takes experimenting and research to know what you can get away with and where you need to hold back.
Tip: When it comes to finding out what’s going on in production, treat your investigation like a database change. That means you should test your method against another environment, even if you can’t reproduce the issue there. Always think about how your method of investigating may be impacted by increased load, and make sure you have a way to monitor its impact.
SQL Server ships with rich, user-friendly tools.
SQL Server Management Studio covers a lot of ground with different features. It helps us navigate a given installation and also develop Transact SQL. It has built-in reports to help describe what’s going on at instance and database levels.
It has sqlcmd mode if you prefer a different kind of scripting and want to interact more with Windows. You can use Object Explorer Details to select multiple items like Windows Explorer. We can view Graphical Execution Plans. Or, we can just run Transact SQL statements in multiple windows.
Tip: SSMS lets you choose to generate a script for almost every action you want to do rather than just executing it through the GUI. This feature is just plain awesome. Even if you choose to execute a change through the GUI, I recommend you always script out the command and save it off as a record of your change.
Want more SSMS tips? Check out Jes Schultz Borland’s blog post Tips and Tricks to Make SQL Server Management Studio Awesome.
What’s Cool About These Things Together
SQL Server may be complicated, but we have a lot of ways to work with it and to learn more about it.
As the SQL Server product grows, the tools and increasingly improved instrumentation allow us to understand the product more deeply. Books Online is just a launch board– the tools and the methods we have to see what’s going on in SQL Server help us take off from there.
Together with community members who want to share knowledge, this creates an interested, invested group of technologists who write about what they’re learning. And that’s something we really shouldn’t take for granted.
The most popular tech Q&A site in the world serves 12-14 million web pages per day with Microsoft SQL Server 2008 R2. They’re passionate about performance, and they’ll share the scalability lessons they learned along the way. This session is aimed at production DBAs who manage SQL Servers that need to go faster and SQL programmers who don’t understand why their database won’t deliver queries quicker.
More of My Free Training Videos
These two technologies can make a very big — and very bad — difference in how your SQL Server performs. Wouldn’t it be great if you could get the real, honest lowdown from a virtualization administrator, a SAN administrator, and a DBA? Wouldn’t it be even better if one person had done all three, and could give you the pros and cons of each point of view?
In this one-hour session, I explain how virtualization changes CPU, memory, and monitoring, and I show how to get specific recommendations for your make & model of SAN:
Data matters. Every day we generate huge volumes of data. Processing all of this data presents challenges for many people.
Pig is a data flow language. It sits on top of Hadoop and makes it possible to create complex jobs to process large volumes of data quickly and efficiently. Best of all, it supports many relational features, making it easy to join, group, and aggregate data. If you think this sounds a lot like an ETL tool, you’d be right. Pig has many things in common with ETL tools, if those ETL tools ran on many server simultaneously.
Where would you use Pig?
Case 1 – Time Sensitive Data Loads
Loading data is a key part of many businesses. Data comes in from outside of the database in text, XML, CSV, or some other arbitrary file format. The data then has to be processed into a different formats and loaded into a database for later querying. Sometimes there are a lot of steps involved, sometimes the data has to be translated into an intermediate format, but most of the time it gets into the database, some failure it to be expected, right?
Loading large volumes of data can become a problem as the volume of data increases: the more data there is, the longer it takes to load. To get around this problem people routinely buy bigger, faster servers and use more fast disks. There comes a point when you can’t add more CPUs or RAM to a server and increasing the I/O capacity won’t help. Parallelizing ETL processes can be hard even on one machine, much less scaling ETL out across several machines.
Pig is built on top of Hadoop, so it’s able to scale across multiple processors and servers which makes it easy to processes massive data sets. Many ETL processes lend themselves to being decomposed into manageable chunks; Pig is no exception. Pig builds MapReduce jobs behind the scenes to spread load across many servers. By taking advantage of the simple building blocks of Hadoop, data professionals are able to build simple, easily understood scripts to process and analyze massive quantities of data in a massively parallel environment.
An advantage of being able to scale out across many servers is that doubling throughput is often as easy as doubling the number of servers working on a problem. If one server can solve a problem in 12 hours, 24 servers should be able to solve it in 30 minutes.
Case 2 – Processing Many Data Sources
Knowing the effectiveness of an advertisement is big business. For people buying ad space, it’s critical to know that just how effective their advertising is in both the physical and virtual space. Combining advertising information from multiple sources and mixing it together with web server traffic, IP geo-location, and click through metrics it’s possible to gain a deeper understanding of customer behavior and judge just how effective certain ads are in certain parts of the country.
Pig isn’t just designed to scale out over many servers. Pig can be used to complex data flows and extend them with custom code. A job can be written to collect web server logs, use external programs to fetch geo-location data for the users’ IP addresses, and join the new set of geo-located web traffic to click maps stored as JSON, web analytic data in CSV format, and spreadsheets from the advertising department to build a rich view of user behavior overlaid with advertising effectiveness.
Creating this rich view of data is possible because Pig supplies complex features like joins, sorting, grouping, and aggregation. The syntax is different than developers are used to but Pig’s focus on data flow makes it easy to write complex jobs. Rather than creating complex logic in SQL, developers can create jobs that walk through data step by step to deliver the best results. It’s easy to rapidly prototype these procedural jobs and performance tuning can be accomplished with relative ease.
Case 3 – Analytic Insight Through Sampling
Even in case 2, we’ve seen how Pig can provide some analytical insight into the massive quantities of data that are generated every day in the datacenter. It’s easy to fall into the trap of thinking that Pig is an ETL glue that moves data from a log file, processes it, and drops it off for another database to consume. Pig is more than just an ETL tool.
One of Pig’s strengths is its ability to perform sampling of large data sets. As Pig manipulates data, it’s easy to reduce the set of data that we’re operating on using sampling. By sampling with a random distribution of data, we can reduce the amount of data that needs to be analyze and still deliver meaningful results.
Pig isn’t a replacement for SQL Server Integration Services. Their use cases overlap for many tasks, but they also solve very different problems. Using Pig for all ETL processes will be overkill when the data can reasonably be handled within a single SQL Server instance. On the flip side, there are problems that are too large to quickly solve within a single SSIS process or package. In either situation you should pick the best tool for the job.
Want to learn more about Hadoop and Pig? Check out our Introduction to Hadoop training class or
The NOLOCK hint: we all use it sometimes, and some of us use it almost all the time. In this session Kendra Little shows three phenomena which will cause you to look differently at the NOLOCK hint. If you have two years of experience working with TSQL and database structures, this session is for you.
For more videos like this:
Step right up, ladies and gentlemen, and I will sell you the solution to all of your database needs. That’s right, it’s Doctor Codd’s Third Normal Form, guaranteed to cure all ailments of the schemata, pragmata, and performata. Doctor Codd’s Form will make your data performant and compressive. Accept no substitutes or imitators; Doctor Boyce’s lamentable attempts cannot soothe your aches and pains like Doctor Codd’s Third Normal Form. Why, with just a simple application of Doctor Codd’s Third Normal Form, thrice daily, and you’ll be jumping around to the tune of normal forms and transactions in no time!
Anyone pushing a single idea is pushing snake oil, plain and simple. They’re selling you a warm and fuzzy feeling that you’ll make your problems go away by following their simple prescriptions. Deviation from the remedy will, of course, result in problems, failure, and potentially phlebitis.
Can I Cure You?
No, I can’t. Well, I can, but I’m not going to. Not yet, at least. You need to pay attention, first.
The Forms, Both Magnificent and Normal, Are A Not Panacea
Slavish adherence to normalization is bad for your health. There are as many reasons to not normalize data as there are reasons to normalize your data. Don’t believe me? What if I asked you to design a database to persist items that we might sell in a store?
It’s easy, at first, to design an
items table with a few columns to describe the main properties that we want to persist about an item in our store. Problems begin when different departments in the store need to save different properties. Different parts of our IT systems will need different views of the data. While adding a column is trivial on small databases, adding a column in a large database is decidedly non-trivial. Eventually the database boils down to an
item_properties table and at that point the database becomes impossible to query reasonably.
A Solution Most Surprising
We can solve this problem a few ways, but with Microsoft’s Hadoop announcements, it makes sense to look at what the non-relational world can offer. HBase is a real-time column-oriented database that runs on top of Hadoop.
HBase is helpful modeling dynamic properties because of flexible data model. While HBase does have tables, rows, and columns there are some powerful differences. HBase’s columns are split up into column families – these are logical groupings of columns. Columns can be added on the fly once a column family has been created.
Jumping back to our example, instead of modeling a
item_properties table, we can create an
items table and create column families to store properties specific to a department or for a common purpose. Rather than create many tables, we can add a
shipping_info column family, a
accounting column family, and a
sales_promotion column family. Over time this flexible data model can be used to populate reporting tables in an enterprise data warehouse. Rather than focus initial efforts on building a robust general purpose schema in an RDBMS, it’s easy to create a flexible schema in HBase and pull out the data we need for reporting at a later time.
A Final Commentary on Data
Denormalization doesn’t have to be a dirty word. There are many reasons to denormalize data. Ultimately, the process of shredding data apart should depend not on blind adherence to the principles of normalization but to the needs of the applications that consume the data. If you have a log file processing application, does it make sense to read log files from disk into a relational database? Every log entry will need to be shredded into multiple columns doesn’t make sense when log files are only infrequently processed and used to produce aggregations.
Even when you eventually need to query the log file data, there are tools suited to performing SQL-like operations across flat files. Hive provides a SQL-like querying layer on top of the Hadoop framework making it possible to run bulk queries across large volumes of data stored in flat files and spread across many servers.
Know how data is used; know the problem that the business wants to solve. Let the principle of consumption drive the structure of your information. You will thank me, some day, for freeing you from the false rigor of normalization.
TempDB is an important part of a healthy database. To keep your database healthy, you can’t just monitor the user databases, you have to pay attention the whole system. Join Jeremiah Peschka and the Brent Ozar Unlimited® team as they take a look at how to monitor an overlooked cause of database health problems:
Monitoring file space used in TempDB:
SELECT SUM (user_object_reserved_page_count)*8 as usr_obj_kb, SUM (internal_object_reserved_page_count)*8 as internal_obj_kb, SUM (version_store_reserved_page_count)*8 as version_store_kb, SUM (unallocated_extent_page_count)*8 as freespace_kb, SUM (mixed_extent_page_count)*8 as mixedextent_kb FROM sys.dm_db_file_space_usage;
Historical information about TempDB usage:
SELECT top 5 * FROM sys.dm_db_task_space_usage ORDER BY (user_objects_alloc_page_count + internal_objects_alloc_page_count) DESC;
Determine which queries are using large amounts of TempDB:
SELECT t1.session_id, t1.request_id, t1.task_alloc, t1.task_dealloc, t2.sql_handle, t2.statement_start_offset, t2.statement_end_offset, t2.plan_handle FROM (SELECT session_id, request_id, SUM(internal_objects_alloc_page_count) AS task_alloc, SUM(internal_objects_dealloc_page_count) AS task_dealloc FROM sys.dm_db_task_space_usage GROUP BY session_id, request_id) AS t1 JOIN sys.dm_exec_requests AS t2 ON t1.session_id = t2.session_id AND t1.request_id = t2.request_id ORDER BY t1.task_alloc DESC;
- SQLServer:Databases \ Log File(s) Size (KB)
- SQLServer:Databases \ Log File(s) Used (KB)
- SQLServer:Transactions \ Free Space in tempdb (KB)
- SQLServer:Transactions \ Version Store Size (KB)
- SQLServer:Transactions \ Version Generation Rate (KB/s)
- SQLServer:Transactions \ Version Cleanup Rate (KB/s)
TempDB DMV Resources:
Microsoft’s white paper, Working with TempDB, is an additional resource on the TempDB performance.