Here’s to the Ones Working Today

Happy New Year’s Eve!  And no, I’m not looking for a kiss.

Are you working today?  I usually did.  I’ve been in IT for over a decade, and I worked most holidays (not to mention most weekends).  It was the best time to get outage windows to do a lot of things that would help my systems stay online.  Before that, I was in hospitality – hotels and restaurants.  You’d better believe I worked every single holiday.  Every.  Single.  One.

I worked holidays and weekends because that was what my job duty required.  I wanted to be really good at my job, and that meant being around when my customers and my systems needed me the most.  I wanted to make sure I took good care of everything.

I might have grumbled a little, but truth be told, I was proud of myself.  Every time I pulled into an empty office parking lot after dark and waved my badge to get in, I knew I was going the extra mile.  I didn’t do it for other people – I did it for myself.  It was the right thing to do, and that itself was a reward.

Meet the Fukushima 50

In 2011, a major earthquake struck off the coast of Japan.  The Fukushima Daiichi nuclear power plant suffered a series of catastrophic failures that went from bad to worse to unbelievably horrific.  The reactor leaked radiation that started killing employees onsite.

You can’t just walk away from a failing nuclear reactor.  It doesn’t shut itself down gracefully.  Trained professionals have to make decisions about mitigating risk, and then they have to do dangerous things to save others.

Fortunately for all of us, the Fukushima 50 were passionate about doing the right thing.  These employees stayed at the plant while others were evacuated.  They put their lives on the line to stop the radiation leaks and save lives.  They were prepared to die to carry out their duties.

Thankfully, we IT professionals are rarely faced with real life-or-death choices.  I hope 2013 doesn’t bring a challenge like that to you, but knowing how many admins work weekends and holidays by choice, I bet you’d do the right thing.

Every Consultant Needs Three Things

Tired of workin’ for the man? Want to live the glamorous life of jet setting around from place to place, working on really challenging problems, and eating at foodie restaurants?  You just need three things.

1. A price.

Right now, people probably don’t put a value on your time.  You can’t exactly put up an hourly rate sign, but you can start to put in some gentle barriers to make sure people respect the worth of your time.

When I was a DBA and people walked into my cube asking me to do something, I pulled up my task list in  The cool thing about RTM is that it even works on mobile devices, so I can access the exact same task list from anywhere.  I would show them the list and say:

“Here’s what I’m working on right now.  If I push these aside, see the names next to each request?  That’s the person who I’ve promised it to, and here’s the dates when they need it by.  Can you run interference for me and get them to delay the dates on theirs?”

Just that it's really, really good food.

Jeremiah and Kendra will work for food.

It worked magic – people suddenly understood that there was a cost to my time.  Often, they were even completely willing to pay that cost.  They’d put skin in the game by going to these other executives and bargaining for my time, and they’d be forced to use political favors in order to get what they wanted.  Even though I didn’t profit directly, there was still a new cost to my time, and I wasn’t the one paying the cost.  I stopped acting as the go-between – I left it up to the consumer to pay for my time.

Later on, I got gutsier with meeting invites I received, too.  I’d reply back (without accepting the invite) and say:

There’s not an agenda attached to this meeting invite.  Can you give me a quick rundown of the decisions that we need to make during this meeting?  I’d like to make sure I come prepared, and I might be able to get the work done even earlier.  If you’re not sure what will be discussed, I’ll need to skip the meeting – I’ve got a lot of irons in the fire right now.

When I did get the meeting agenda, I busted my hump to do whatever was required ahead of time, and I’d send it to the meeting holder and copy all of the attendees.  My goal was to give them whatever they wanted without actually having the meeting.  It worked wonders.

But if I didn’t get the answer I needed, I didn’t attend the meeting.  If somebody fired off an email to my boss and said, “Dammit, Brent’s presence is urgently required,” I had my boss trained well enough to ask, “For what deliverable?  He’s really busy.”

2. A service.

In the beginning of my IT career, my service offering was “fixer.”  When something expensive and technical was going to hell in a handbasket, I wanted to be the first number on everybody’s speed dial.  I specialized in reverse-engineering stuff I’d never seen before and figure out the root cause.

That worked great as a full time employee of small to midsize companies, but it doesn’t work for consulting.  To understand why, you have to know the difference between consultants and contractors.  Consultants advise you on what to do, and contractors do what you tell ’em.  If you’re a great fix-everything guy, you end up as a contractor there for the long term.  (There’s nothing wrong with contracting – but remember, this post is about consulting.)

No dice. Smart negotiator.

We tried to pay Jes with food and jazz hands, but…

Over time, I ended up specializing in turning around SQL Servers in bad shape.  If you had a SQL Server problem that nobody else could solve, I was your Winston Wolf.  I got even more specialized and focused on SQL Servers that used storage area networks (SANs) or VMware.  It’s good to have a generalist background, but if you focus your service really tightly, you can do an amazing job at that service.  This is especially true when you specialize in an expensive technology.  If you’re having SQL Server CPU usage problems on a 40-core server, and Enterprise Edition costs $7,000 per core, then my services look pretty darned cheap.

Often, I’m brought into shops where a few local generalist consultants have struggled with a problem for months.  I parachute in, use a few slick proprietary scripts and tools, and get right to the root of the problem in hours.  I’m able to do that because I just specialize on one product (SQL Server) and I know that product forwards and backwards.  It’s the same reason your general practitioner refers you to a specialist doctor when you’ve got ear/nose/throat or back problems – even though it’s all just the body, there’s specialized skills for different parts of it.

I don’t wanna fix the printer.  I wanna be the one guy who gets called in when there’s a specialized SQL Server problem – and that’s where the final piece comes together.

3. A reputation.

When people are having a problem, and your skills are the answer, you want them to immediately say to themselves, “Man, there’s only one guy we need to call, and I know exactly who he is.”  It takes a long, long time to build up that reputation.  If you don’t have it, you have to rely on advertising and marketing, and then you’re in competition with a big pile of other consultants who are doing the exact same thing.

She's a bit of a biter.

Ernie works for food, but she has the wrong kind of reputation.

You have to start building your reputation right now – and I don’t mean by blogging, I mean by your own coworkers.  When you walk into a meeting, are they excited to see you?  Do other departments call and ask for you by name?  Do they say, “We gotta get so-and-so in here because I just know she’ll take care of this once and for all.”

You can’t get this reputation by being a jerk.  You can’t be the one who has all kinds of rules and always says “NO!”  You have to understand the difference between positive and negative reinforcement, and you’ve gotta use the former way more than you use the latter.

Every coworker and manager you have – they’re your test clients.  Right now, they’re not paying anything at all for your services.  Use them as your test market by becoming an internal company consultant for SQL Server.  If you can get raving fans inside your company, you’ve got a chance at becoming a consultant.

Probably the best gauge of future consultancy success is to ask yourself, “If I quit this job tomorrow, and I offered my former users a contract with a price and a service, would they make budget room for me?”  Don’t think of asking your manager, because one of your manager’s jobs is to make you feel welcome and loved no matter how bad your personal skills are.  Think about the users.  They’ve got real budgets, real business needs, and real feelings that they’ve probably expressed to you.  If they’d gladly – excitedly – hand you their budget money, then you’re ready to take a shot.

If not, go buy the book Secrets of Consulting: A Guide to Giving and Getting Advice Successfully (or the Kindle version).

sp_BlitzIndex® Holiday Week Edition

Only one thing could have dragged me away from the soft glow of the electric leg lamp glowing in the window… sp_BlitzIndex®

It’s one of those weeks when things get nice and slow. Your business users and managers are all out of the office due to holidays. Your inbox is blissfully quiet. You get a few moments to step back, make sure everything’s running, and catch up on a few of those things you just never have time to look into.

This was always one of my favorite weeks of the year as a production DBA. This isn’t a week when you want to do anything risky, but it’s the perfect week to learn about how your database servers have been running.

It’s Time to Check Your Index Sanity

Good news— unless you’ve just restarted all your SQL Servers, they’re still caching tons of information about your index performance. Even though it’s a quiet week, now is a great time to check on your indexes– is there something crazy hiding in your schema that you need to devote some time to in the new year?

Our free sp_BlitzIndex® stored procedure is designed to give you insight into your index schema and performance. sp_BlitzIndex® rolls through your database and looks for potential gotchas and issues with your indexes. It reads only metadata (no use of heavy DMVs or anything that needs to scan pages in your data tables themselves), then diagnoses what looks like it may get a little bit crazy in your database— everything from heaps with active deletes to multi-column clustered indexes.

sp_BlitzIndex® version 1.4 is Out

Just in time for the holidays, sp_BlitzIndex® is out with fresh updates.

The biggest new features are that sp_BlitzIndex® now diagnoses “Abnormal Psychology” in your indexes. This diagnosis finds indexes of specific types that require special handling and alerts you to their existence. We let you know if we find indexes using page or row compression, or columnstore, XML, or spatial indexes.

sp_BlitzIndex® also includes a few fixes for bugs users reported after our “Instant Index Insight” webcast. The stored procedure now works no matter what the default collation is of your SQL Server instance, and we added a few lines of code that prevents problems if you’re using any default user settings that might produce a problem.

What if You Find Something?

If you find that your indexes are hiding legions of problems, don’t panic. This isn’t the week to panic– this is the week to observe, learn, and plan. Each diagnosis has a URL so that you can learn more about the diagnosis. Spend some time with us and dig into the issue, then plan to make improvements in 2013.

How to Get Started

To spend your week getting to know your database schema and index performance better, get sp_BlitzIndex® from our download page.

Which SQL Server MVP Has Two Thumbs and a Hadoop Certification?

I’m a Microsoft SQL Server MVP – I like to talk about SQL Server a lot. But, as Brent loves to point out, I really like data; I’m open to alternative database lifestyles like PostgreSQL or NoSQL when they solve a business problem. And, frankly, I like some of these databases so much that I’m using them to build stuff for clients; I went so far as to become a Cloudera Certified Developer for Apache Hadoop this week.

Which MVP has two thumbs and a Hadoop certification?

This guy has two thumbs and an obsession with Hadoop.

How I’m Using Hadoop and Hive

“What kind of information gold mine are we sitting on?” That’s the question one of our clients was asking themselves earlier this year. The client had been tracking user’s search parameters for several years. Over time the data grew to where it was impossible to query the search logs without bringing the line of business application to its knees. Faced with the prospect of buying a second SQL Server for analytics, they were considering trimming data out of the logging database.

When I sat down with the client, they said “We want to get a better understanding of how users are interacting with the site, the types of searches being performed, and uncover richer information around product pricing. We just can’t answer these questions right now.” I talked through different options like a relational data warehouse or SQL Server Analysis Services before we settled on using Hive hosted in Amazon’s Elastic MapReduce (EMR). Using Hive hosted in Elastic MapReduce lets the business meet their goals while minimizing costs – the entire Hive cluster is turned off once data processing is done.

Money is important to businesses – everyone wants more of it and nobody wants to spend any of it. When faced with the idea of buying a second server, a second SQL Server license, and a second set of really fast disks, the client balked. By using Hive hosted on EMR we are able to run the data warehouse on demand and only pay for the resources used – this keeps costs under $200 per month.

How I Approach An Engagement

I love new technology, but that doesn’t mean I view it as a cure all. As I worked with the client, we worked together to understanding the data and the business’s questions before proposing a solution. One of the most important parts of our conversation was focusing the scope of questions into different buckets – the majority of the questions were traditional data warehouse queries.

When we began the process, we used a list of questions to kick off our investigation.

  • What are the current problems you have querying this data?
  • Just how much data are we talking about?
  • What types of queries do you need to answer?
  • How does this data interact with the rest of your data?
  • How will this data be consumed?
  • What does your team’s skill set look like?

Once we went through the list of questions, I took the client’s requirements and technical experience and used that to find the best fit product for the business. In this case, the solution was Hive running on top of Elastic MapReduce. I discussed the pros and cons of the approach and once I had the go ahead on a technology choice, I started building a prototype of the data warehouse that the business could continue to build on using their existing querying skills, without having to learn new technologies and platforms.

How I Can Help You

In this case, I was able to help a business get started on the Hadoop platform using Hive. If your company is like most companies, you’re probably asking yourself questions like “Is Hadoop or Hive right for us?”, “How could we get started with this project?”, or “How would Hadoop or Hive fit into our current environment?” This is where I can help out – I will work with your team and create a plan that meets your goals, works with the existing skills and technology that you have on hand, and create a high level road map. I can even help you prototype your first system using less than $200 in computing time in Amazon, no servers required. Contact me to set up a time to talk.

Hadoop Revisited (Video)

There’s a lot of buzz around Hadoop. If you’re like most people, you’ve looked into Hadoop and found a bewildering array of products, terms, and technology surrounding Hadoop – Hive, Pig, HDFS, MapReduce, HBase, co-processors, ZooKeeper, etc. Knowing where to start can be daunting. Trying to make sense of the Hadoop world is certainly possible, but takes a lot of time. Thankfully, I’ve gone ahead and done the work for you.

This is an update of last year’s presentation Hadoop Basics for DBAs.

What is Hadoop?

Hadoop is a group of tools to help developers create bigger tools. More specifically, Hadoop is a basic set of tools that help developers create applications spread across multiple CPU cores on multiple servers – it’s parallelism taken to an extreme. Although Hadoop is a set of tools and libraries, there are a number of products that are lumped into the same bucket as Hadoop and, frequently, they’re all referred to as Hadoop. Instead of describing every piece of Hadoop in detail, I’m going to focus on the functionality that’s of the most interest to SQL Server professionals.

Data Warehousing in Hadoop

If you need to work with big data, Hadoop is becoming the _de facto_ answer. But once your data is in Hadoop, how do you query it?

If you need big data warehousing, look no further than Hive. Hive is a data warehouse built on top of Hadoop. Hive is a mature tool – it was developed at Facebook to handle their data warehouse needs. It’s best to think of Hive as an enterprise data warehouse (EDW) – Hive can be used to research complex interactions across your company’s entire history of data, in exchange for that power you have to understand that queries will return in minutes. Unlike traditional EDWs, Hive is spread across tens, hundreds, or even thousands of commodity grade servers.

Hive was designed to be easy for SQL professionals to use. Rather than write Java, developers write queries using HiveQL (based on ANSI SQL) and receive results as a table. As you’d expect from an EDW, Hive queries will take a long time to run; results are frequently pushed into tables to be consumed by reporting or business intelligence tools. It’s not uncommon to see Hive being used to pre-process data that will be pushed into a data mart or processed into a cube.

While Hive can operate on large volumes of data, it’s not the most efficient tool: Impala seeks to overcome some of the limitations of Hive by making better use of CPU, memory, and disk resources. Impala operates more like SQL Server – data is cached in memory to improve query performance. Although Impala uses a separate query engine than hive, it uses data that’s already in Hadoop, making it easy to query massive amounts of data without having to store your data twice.

Both Impala and Hive are great for businesses querying of amounts of data while avoiding expense of massively parallel EDW solutions like Microsoft SQL Server PDW or Oracle Exadata. Hive is in a stable release cycle and, although Impala is still a beta product, many organizations are deploying one or both solutions to tackle their largest workloads.

Data Flow

SQL Server professionals are familiar with using SQL Server Integration Services (SSIS) to move data around their organization. SSIS provides a rich set of functionality for manipulating, but it’s difficult to make SSIS operations run across multiple CPU cores, much less multiple servers.

Pig is a tool for creating parallel data workflows. Pig takes advantage of the Hadoop tools to provide rich functionality across huge amounts of data. Pig makes it easy to perform step-by-step data manipulation over large data sources using a combination of different tools and functionality. There are a number of great reasons to use Pig (parallel processing, sampling, and loose schema requirements), but it’s safe to say that Pig is a great tool for processing data with Hadoop.

Deep Analysis of Data

SQL Server professionals are used to having analytic insight available, either through SQL Server’s windowing functions or through SQL Server Analysis Services. Although Hadoop doesn’t natively provide tools for OLAP style cubes or for windowing functions, it’s possible to gain insight from your data using Hadoop. Unfortunately, deep analytics are not Hadoop’s strong suit out of the box. Teams looking to take advantage of large scale data analytics will be doing a lot of heavy lifting themselves.

Mahout is a set of libraries that can be used to distribute analytics around a cluster, but there are limitations to the flexibility and interactivity of Mahout. Developers looking for the ad hoc interactive capabilities of SQL Server Analysis Services (or even of a relational data warehouse) will be disappointed. Bulk computation can be performed disconnected from users, but Mahout and Hadoop don’t provide any kind of ad hoc querying capability.

Real Time Querying

So far, all of the use cases we’ve explored have been based on distributed batch processes and large scale querying. Even though Impala is a vast performance improvement over Hive, Impala is still responding in a matter of several seconds to several minutes – hardly fast enough for interactive querying. Databases are used for more than running massive reports, and this is where HBase comes in to play.

HBase is a real time, random access, read-write database built on top of Hadoop. This isn’t a database like SQL Server with tables and joins; HBase is a NoSQL database that’s loosely based on Google’s BigTable database. There are tables, there are columns, but the schema isn’t as rigid as a relational database. Developers will be able to solve many problems with HBase, but there will be a bit of a learning curve as they understand the data model and update their data structures to work effectively with HBase. Data stored in HBase can even be queried through Hive or Impala making it possible to combine transactional and reporting data in the same Hadoop cluster – the scale and redundancy of Hadoop make it easier to reduce load on any single system and avoid many problems associated with reporting from a transactional data source.

When Should You Use Hadoop?

Ultimately, you’re looking for an answer to the question “When should I use Hadoop?” This is a difficult question to answer. Hadoop may make sense for part of a workload, or even for all of it. The best way is to start by looking at your environment and asking questions like:

  • Can I keep my data on a single instance?
  • Can I keep my data on a single instance and doing it cheaply?
  • Are my queries running fast enough?
  • Do I need complex, interactive, ad hoc analytics?
  • What type of latency is acceptable between data arrival, analysis, and queryability?

Understanding your workload is critical to determining if you’ll be able to use Hadoop to meet your needs. Having realistic expectations of Hadoop is equally critical. No part of Hadoop will solve all of the problems an organization is facing. Hadoop can mitigate some problems, but it presents a different set of challenges – being comfortable with the limitations of Hadoop will go a long way toward having a successful implementation.

The First Step to the Poor Man’s Runbook

In theory, before you introduce a new system – database server, load balancer, virtualization infrastructure, etc – you build a robust runbook that documents how you’ll handle every conceivable scenario.  When there’s any kind of failure, you’ll simply turn to chapter X and start going through a precise checklist that will guide you to the promised land of uptime.

Yeah, right.  In reality, you’re behind the 8 ball.  Everybody wants to go live with brand spankin’ new technology right now – even if we have absolutely no experience troubleshooting it.  Do it live, they say.

Here’s the easy way:

  • Find a room with a big whiteboard and a projector
  • Gather one person from each team (networking, systems, database, app, etc)
  • Connect to the system in question via remote desktop or whatever
  • Write a list on the whiteboard of every component involved

For example, on a SQL Server 2012 AlwaysOn Availability Group system, I connect to Failover Cluster Manager and list through all of the components:

  • Servers
  • Drives (local, SAN, quorum if applicable)
  • IP addresses
  • Services (local & clustered)
My ex-girlfriends would have been surrounded by red and yellow.

Ah, if only all risks were marked with signs.

For each component, ask:

  • When it fails, what will the symptoms look like?
  • How will it affect the system as a whole?
  • When we suspect that the component failed, who do we call to troubleshoot it further?
  • How long will we wait for them to figure out if it’s broken?
  • After that time, what’s our Plan B?

If we wrote down all of the answers, we’d have a runbook – but remember, we’re probably under the gun, so we probably won’t produce something that good.  That’s completely okay.  Let’s just get started by thinking through the complexity of the system and envisioning what failure might look like.

In complex systems, nothing every fails in a way that’s completely obvious and intuitive.  There’s no warning message in the event log that says, “The root cause is that Bob in Accounting decided to grab your cluster’s admin IP address for his new virtual server.  Go tell Bob to get his own unique IP address, and everything will be fine.”  Even if you’ve never experienced a failure like that, you might be able to recognize the symptoms if you imagine what a cluster admin IP failure would look like.  Document that, and you’re on your way to a killer runbook – which means faster recovery and easier troubleshooting.

“Don’t Touch That Button!” Four Dangerous Settings in SQL Server (video)

Every software product has its gotchas. SQL Server has some settings which sound like a great idea but can cause major problems for performance and availability when used improperly. In this 30 minute video, Kendra Little takes you on a tour of SQL Server’s most dangerous settings, from priority boost to lightweight pooling. She’ll explain why you need to be cautious and how you can check if your SQL Servers are configured safely.

Looking for the links discussed in the video? Scroll on down the page…

Hungry to Learn Even More?

Dig into these links from the video.

Priority Boost

Read the fine print on Priority Boost in Books Online here.

CPU Affinity Mask and I/O Affinity Mask

Check out Bob Dorr’s explanation of the problems with using CPU and I/O Affinity Masking here.

LightWeight Pooling, aka Fiber Mode

Read “The Perils of Fiber Mode” by Ken Henderson here.

For more videos like this:

Instant Index Insight: How to Use sp_BlitzIndex® (video)

You probably don’t have enough time to dig through DMVs trying to figure out which indexes you should add or drop. Whether you’re a DBA or developer, junior or senior, you’re probably too busy doing your real job to master all the index best practices – and now you don’t have to. In this 30 minute video, Kendra Little introduces you to sp_BlitzIndex®, a free script which you can immediately run to see if your indexes are healthy, or if they are heading towards insanity. Want to try out the tool? Check out sp_BlitzIndex®.

sp_Blitz® v16: Snapshots, Recompiles, ShrinkDB, and More

I don’t blog every release of sp_Blitz® (we pushed v15 out silently with a few bug fixes) but we added a lot of improvements and fixes in this version – and by we I mean you.  After I blogged about v14’s release earlier this week, that encouraged a lot of people to come out of the woodwork and contribute code.  I’m still going through all the submissions and adding ’em in, but I’m pushing this one out the door now because it’s got some cool stuff:

    • Chris Fradenburg @ChrisFradenburg added check 81 for non-active sp_configure options not yet taking effect and improved check 35 to not alert if Optimize for Ad Hoc is already enabled.
    • Rob Sullivan @DataChomp suggested to add output variable @Version to manage multiple-server installations.  This way you can query all your servers and get back what version they currently have installed.
    • Vadim Mordkovich added check 85 for database users with elevated database roles like db_owner, db_securityadmin, etc.
    • Vladimir Vissoultchev rewrote the DBCC CHECKDB check to work around a bug in SQL Server 2008 & R2 that reports dbi_dbccLastKnownGood twice.
    • I'll give you fifteen minutes to stop that.

      Bear Blitzes Brent from Behind

      We added checks for database snapshots, stored procs with WITH RECOMPILE in the source code, Agent jobs with SHRINKDATABASE or SHRINKFILE in the steps, and a check for databases with a max file size set.
    • We added @CheckServerInfo perameter default 0. Adds additional server inventory data in checks 83-85 for things like CPU, memory, service logins. None of these are problems, but if you’re using sp_Blitz® to assess a server you’ve never seen, you may want to know more about what you’re working with. (Kendra’s idea!)
    • Tweaked check 75 for large log files so that it only alerts on files > 1GB.
    • Fixed a few case-sensitivity bugs.
    • Added WITH NO_INFOMSGS to the DBCC calls to ease life for automation folks.  I was surprised by the number of requests we got for this – turns out a lot of people are doing widespread patrols of their servers with sp_Blitz®!
    • Works with offline and restoring databases. (Just happened to test it in this version and it already worked – must have fixed this earlier.)

    If you’d like to contribute code, contact us.  Pro tip: if your code is written in a way that I can just copy/paste into sp_Blitz®, it’ll get published a lot faster.  I get a lot of contributions that are various DMV queries, but if I have to rework it to handle multiple databases simultaneously, work differently for 2005/2008/2012, and handle case-sensitive collations, then it takes me much longer to implement (sometimes months).

    You can download sp_Blitz® now and stop getting surprised by your SQL Server’s hidden past.  Enjoy!

Whither Hadoop?

Where Can You Use Hadoop?

“Where can you use Hadoop?” isn’t an easy question to answer. An enterprising or creative person could probably figure out ways to replace the entire database infrastructure with various components of Hadoop. I’m sure it’s being done right now and I’m sure that someone is being incredibly successful with it.

Asking the question “where can I do XYZ” will inevitably lead to the answer “everywhere… if you’re creative!” There’s a better question that we can ask.

Where Should I Start Using Hadoop?

Let’s face it: Hadoop is something that you should start thinking about. Microsoft are clearly investing Hadoop as part of their enterprise data warehouse products. Microsoft has partnered with Hortonworks to bring Hadoop to Windows.

One of the most obvious places to implement Hadoop is for ETL processes. ETL jobs are typically difficult to tune – data is streamed from an OLTP data source, processed in memory, and then streamed to another data source. Tuning the process to run faster on a single machine requires specific skills – a good ETL expert knows T-SQL, SSIS, and more than a little bit of .NET. These are important skills for an ETL expert to have; but we don’t always need an expert to get the job done.

How Can I Start Using Hadoop?

What if you could make a process run four times faster by running it on four computers? This is the basic premise of Hadoop – workloads are made faster by splitting them across multiple workers. Just as SQL Server splits a query across multiple threads, Hadoop is able to parallelize across multiple computers and each computer may parallelize the work across multiple threads.

We can take advantage of Hadoop’s easy scale out without really changing our tools. There’s a tool called Hive – it sits on top of Hadoop and translates SQL into MapReduce jobs in the back end. Hive isn’t going to be useful for real time querying, but it it gives us the ability to perform translations on huge amounts of data using a familiar language. If we need custom functionality, we just track down an enterprising developer to write a custom function. Just like SQL Server, it’s easy to grab custom functions from another source, install them, and use them in queries.

Where Else Can Hadoop Help?

While ETL is an obvious place to start using Hadoop, there are other places where we can start using Hadoop. Just like SQL Server, Hadoop is a rich ecosystem – it’s more than a one dimensional tool. Portions of Hadoop can be used to create a distributed file system, machine learning tools, data processing frameworks, and large scale random read-write data. You can use Hadoop to scale your data needs in many different directions. The most important thing is to pick a single pain that you’re having – typically ETL or reporting – and experiment with using Hadoop to make things faster or operate at a much bigger scale.

Want to Know More?

If you’d like to more, make sure you check out my video on Hadoop Revisited.

Interested in learning more about Hadoop? Check out our Introduction to Hadoop training class.