Blog

Two DBAs Walk Into a Bar

8 Comments

Every year, Quest Software convenes a Customer Advisory Board at the home office in Aliso Viejo, California.  We bring some of our highly experienced customers out for a couple of days to talk about what’s happening with technology, and good times are had by all.

At the welcome reception, I was struck by how many of them had a great sense of humor.  I was standing in a room with a couple dozen seasoned and talented IT people, yet jokes were flying around left and right.  Let’s face it, us geeks aren’t exactly renown for our ability to light up a room.  We can be funny-ha-ha in private or on the intertubez, but this was a group of near-strangers that were cutting things up.

Funny Ha Ha
Funny Ha Ha

The morning after, as I was thinking back about it, I figured there were two possibilities:

Option 1: We Hand-Picked Funny Customers

If we hand-picked the comedians, then I bet we’re not the only company who would do something like that.  I bet funny people are more likely to get invited to events like this because they’re more enjoyable to be around.  After all, who wants to invite a bunch of grumpy people to hang out in a conference room for a few days?  If I wanted that, I’d just call a team meeting. <rimshot>

For the record, funny customers doesn’t mean happy customers – we’re not surrounding ourselves with yes-men by any means.  To force the products to get better, we have to hear brutally honest and honestly brutal feedback.

Option 2: Funny People Are More Likely to Succeed in IT

Who gets promoted into higher-level IT positions like DBA management?

To answer that question, back up and ask yourself who does the actual promotions.  When was the last time the CIO came to you and said, “Hey, I’m thinking about appointing a new DBA manager.  Who’s the best person for the job?”

Promotions aren’t handled by underlings – the higher-ranking folks pick and choose who they promote.  For better or for worse, they often don’t do it with the feedback of the rest of us.  They think they know what we think and who we like, and they might take it into account, but they may not.  However, they know exactly who they like – they like people who are easy to get along with, easy to interact with, and easy to sit in a conference room with.  Sadly, management involves meeting after meeting after meeting.  Why bring bores to the meeting?

Either way, if you wanna get ahead, have a sense of humor about what you’re doing.  If you think things are too serious in your job now, then brace yourself, because problems get worse as you go up the corporate ladder.  Getting promoted to management means worrying about who you’re going to lay off, how you’re going to handle Johnny’s drug problem, or how you’re going to decide who gets a raise.  Employees are much harder to manage than technology, because servers don’t bring a gun into the office to take revenge.

Life is short and workdays are long.  To get the most out of both, get yourself a sense of humor pronto.  Stop taking yourself so seriously, and people will be more likely to invite you to fun stuff and bring you up the corporate ladder.

(Note – I’m on vacation, so I probably won’t be responding to comments for a few days. I scheduled this post ahead of time. I’m on a sailboat in Lake Michigan, and I’ll respond if I get within wireless range. And yes, I’m bringing my laptop.)


Nice Overalls You Got There

Humor
6 Comments

Neither snow, nor rain, nor heat, nor gloom of night stays long-distance sailors from the swift completion of their appointed rounds.  Chicago-Mac race sailors stand a pretty good chance of running into bad weather, so I picked up a set of foulies.

Not Pictured - Dignity
Not Pictured – Dignity

Pictured here: bibs, just one part of the foul weather gear.  Sailing in foul weather has its own unique set of needs.  Of course the gear has to be waterproof because sailors get soaked from spray and rain, but the bibs also need heavily reinforced knees and rear ends.  (Insert sailor joke here.)  Sailboats have rough non-skid coatings all over the deck so folks don’t slip off the boat.  Unlike powerboats, sailboats spend long periods of time heeling sideways due to the strength of the wind, so traction is important.  The same rough surfaces that help Brent stay on the boat also wreak havoc on knees and butts.

The overalls make me look like a circus freak.  The photo would lead you to believe that I’m seven feet tall.  Everybody looks this way, though – at least, that’s what the salesman said when he wasn’t laughing.

Not pictured: a red foul weather jacket.  Word has it that yellow attracts biting flies.  We’re doing 4-hour shifts, and if it’s raining, that means four straight hours of trudging around a sailboat in the rain and spray.  To make that misery more enjoyable, the jacket has a built-in interior iPod pocket with cable routing for headphones.

Other goodies: 3/4 finger gloves (to keep your fingertips free to tie knots), boots, and a Tyvek jumper to keep the flies off in light weather.

All in all, I spent a disturbing amount of money, but I want Mother Nature at my side at the gambling table.  I don’t want to bet on dry, calm weather in Lake Michigan.  To find out how my bet went, let’s take a look at the current satellite map for the first leg of our journey, going from Whitehall (top right of the map) to Chicago (bottom left) for the race start.

Current Radar for the First Leg
Current Radar for the First Leg

Yep, looks like I did okay there.

Tracking Our Progress

We’re heading out late this afternoon from Whitehall, Michigan to Chicago.  Google Maps shows the car route, but they don’t have routes for sailors, oddly.  I’ll check in when we get to Chicago in a day or so.

Starting Friday at around 3pm, you’ll be able to track each boat’s progress.  We’re aboard the Hannah Frances.  When in doubt, look towards the back of the pack.  Don’t let Friday’s progress fool you – we’ll be pulling ahead only because they let the cruising boats start a day early.  (I wouldn’t be surprised if tracking wasn’t turned on until Saturday morning for the race boat start, either.)

Other links:

And now, I’m off to the sailboat!


SQL Azure Pricing: $10 for 1GB, $100 for 10GB

Microsoft Azure
23 Comments

Microsoft’s pricing for SQL Server in the cloud, SQL Azure, has been announced.  It’s free for now, but around November it’ll be:

  • $9.99 per month for 0-1GB
  • $99.99 per month up to 10GB
SQL Azure
SQL Azure

There’s currently a 10GB maximum size cap for SQL Azure.  For larger data storage needs, you’ll need to break the database into smaller sizes.

Scaling SQL Azure Applications

If you think you’re going to need 100GB in the near term, it probably makes sense to break your application up into multiple separate databases from the get-go (10 x $9.99 = $99.99 anyway) and just make really sure none of the individual databases exceed 10GB.  Ugh. I’m surprised by this strategy because Microsoft’s never been pro-sharding before, and now they’re asking database developers to make that jump.  One of the things I love about SQL Server is that it scales so well: we can start with just a single database and grow it like crazy. DBAs who like that approach will not like Azure’s 10GB-per-database limits.

I’ve worked with companies who spin up a new database for every new client, copying the same database hundreds or thousands of times, and that has some serious maintenance challenges.  Those companies will love SQL Azure’s scaling model, though; if each client has their own SQL Azure database, then it’s easy to pass the database hosting costs on to your clients.

Beep Beep, Back That Database Up

The bandwidth costs for SQL Azure are $.15 per GB of outbound bandwidth.  Assuming that you don’t compress the data before you pull it out of the cloud, that means daily backups of a 1GB database will add another $4.50 per month, and a 10GB database will add another $45/month.  Daily backups will cost about half of what your monthly service charges cost.

It’s not completely clear from the press release, but if Microsoft follows Amazon’s pricing model, bandwidth between the Microsoft cloud services will not incur a cost.  That would mean it might make sense to spin up an Windows Azure computing application for $.12 per hour, use that application to compress your SQL Azure database, and then send the compressed data off to Azure storage for backup.  That would eliminate the data in/out costs, and minimize the Azure storage costs ($.15/GB).  Database administrators would back up their SQL Azure data to Azure Storage, keep a history of backups there, and restore them to SQL Azure faster when needed.

Of course, there’s no native backup support in SQL Azure, and it’s not clear whether Windows Azure will include tools like SQL Server Integration Services.

SQL Azure Service Level Agreements (SLAs)

The cloud isn’t 100% reliable, as the CloudComputing Incidents Database attests.  The press release gives SLAs for Azure computing and storage, but not SQL Azure.  The computing SLA says:

“For compute, we guarantee that when you deploy two or more role instances in different fault and upgrade domains your Internet facing roles will have external connectivity at least 99.95% of the time. Additionally, we will monitor all of your individual role instances and detect within two minutes when a role instance’s process is not running and initiate corrective action.”

They’re essentially pushing the SLA back onto you: you need to make sure to spin up two web servers in two different datacenters, and only then they’ll guarantee 99.95% availability.  That’s 4.38 hours of downtime per year, 21.56 minutes of downtime per month.  Note that they didn’t distinguish between planned and unplanned downtime.

The storage SLA says:

“For storage, we guarantee that at least 99.9% of the time we will successfully process correctly formatted requests that we receive to add, update, read and delete data. We also guarantee that your storage accounts will have connectivity to our Internet gateway.”

Note that bold part, “that we receive” – that’s because the second part of the SLA guarantees that your storage accounts will have connectivity.  Nice.  The three-nines rate means 8.76 hours of downtime per year, 43.2 minutes per month.

If you’re a DBA scoffing at these costs and availabilities, make sure you have your own numbers ready before your boss asks.  Sooner or later, your boss is going to compare costs and SLAs in the cloud with your own internal costs and SLAs.  You need answers.

My Internal Costs vs SQL Azure

A quick back-of-the-envelope calculation using a 2-cpu server with 8gb ram, SATA drives, Windows 2008 and 2 CPUs of SQL Server Standard at list price puts me around $15,000.  (Yes, enterprises get dramatically discounted stuff, but enterprises don’t need SQL Azure at rack price either.)

Let’s say I use this server for three years – that’s $416 per month.  That does not include:

  • Power
  • Connectivity costs (but neither does the Azure $9-$99 price, either.  Remember that bandwidth costs extra for Azure.)
  • Management (but neither does Azure, since you still have to roll some of your own utilities.  Remember that Azure doesn’t support things like Profiler.)
  • Backups (but neither does Azure, and no, Microsoft telling me “it’s backed up” doesn’t count.)
  • Clustering or geographic high availability.  I probably wouldn’t achieve three nines of uptime with this configuration, but if I wanted to go for that, I’d add a second server in another location with SQL Server’s database mirroring.

The tough part of all this is the future:

  • Will SQL Azure’s costs go down? Hardware prices always go down, so it’s interesting to try to compare long-term pricing between the two.
  • Will SQL Azure add more features? I can back up a locally hosted database easily, but backing up Azure is going to be a little tricky for now.  If I want to add filestream data or TDE, that’s a piece of cake with local databases, but not with Azure.
  • Will SQL Azure stick? If I had a dollar for every piece of technology built then Microsoft abandoned, I’d be Steve Jobs.  The nice thing about developing for SQL Azure is that it’s a subset of SQL Server anyway.  Worst case scenario, Microsoft abandons SQL Azure – you just light up your own SQL Server and deploy your app there anyway.

Learning More About SQL Azure

For more information about the charges that will be appearing on your credit card next year:


Why Excel in the Browser Matters to DBAs

3 Comments

Today at the Worldwide Partner Conference in New Orleans, Microsoft announced a free web-based version of Microsoft Office 2010.  Much like Google Docs, Office users will be able to collaborate on the same document at the same time from anywhere, even without Office installed.  Users will be able to edit documents from IE, Firefox, Safari, even iPhones.

I’m completely sold on cloud-based collaboration technologies, and I’ve been an avid user of Google Docs for years.  It helps me break down walls between internal and external collaborators.  Right now, as we speak, several of us are editing a Google Docs spreadsheet to vote on the winners of the Best Thing I Learned At PASS contest.  I was able to set up a new document, share it with several people, and get collaboration working without any VPN problems, permissions, file sharing difficulties, you name it.  Presto, any one of several people can edit my spreadsheet.

Now replace “contest results” with “P&L statement”, and start asking questions.

Picture your CFO sharing the P&L results with several high-ranking executives.  Are all the cells locked?  Are you sure?  What happens when someone changes numbers?  How good is your change tracking?  Where did those sales figures come from?  Can you show the family tree for any given cell of data?

When I was a data warehouse administrator, end users constantly came up to me and said, “The data’s wrong in the data warehouse.  This here number on my report can’t possibly be right.”  To troubleshoot it, I had to ask:

  • What’s the calculation used to build this number?  Are we sure we’re summing when we’re not supposed to average?
  • What’s the SQL query used to get the data?  Are the joins right?  Are the group-bys right?
  • Where did the underlying data come from?  Did last night’s ETL bring in some bogus data, or is the data wrong in the source system?

Reports built by end users are troublesome enough, but when data can be shared and edited by anyone under open circumstances, change management gets more challenging.  In the future, us database administrators will need to pay even more attention to how reports are built and where those numbers are coming from, lest we get thrown under the bus for spreadsheet edits by L337CPA.


What I Want Versus What I Can Afford

Professional Development
35 Comments

I got an interesting comment on my article about the Top 10 Interview Questions for Senior DBAs.  AngryDBA said:

“Man, I don’t mean to sound harsh but..you’re an expert DBA? I’m guessing you wouldn’t survive one of my interviews. I only expect the candidates to get 50%. I’ve had them all too. PhDs, Masters in blah-blah-blah, Captain of the Patterns Team at Yale majoring in C#, writing joins since she was in Pampers. Uh huh.”

Answering this requires more than just a comment.

My Dream House Checklist

My Kind of Yard
My Kind of Yard

I use RememberTheMilk to store a lot of stuff, including my list of things that I’m looking for in my dream house. I’m planning long term, baby! It includes things like:

  • No ground-level windows – we don’t like people looking in or breaking in.  We like high windows on the ground floor (like 7′ up).  The other floors can have floor to ceiling windows, though.
  • On the water, but no nearby vegetation – we hate bugs.  Ideally, there wouldn’t be a plant on the property.  I’m not kidding.
  • Within walking distance of at least 2 restaurants – Erika and I really enjoy eating at restaurants.  It’s nice to have other people take care of everything once in a while.  Okay, often.
  • Gadget nook in the entryway – I like having my electronic gear near the door so that I can grab it on the way out.  Ideally, I’d have an inset nook in the wall with electric outlets, and the whole thing would be concealed so that wires weren’t dangling all over the place.
  • Lots of AC ventilation in the kitchen – we love a really cold house.  I’m talkin’ 65 degrees.  Erika loves to cook, but the kitchen always gets hot, because no house’s AC is ever designed to pump that much cold air into the kitchen while keeping the rest of the house tolerable.  Speaking of which….
  • Very powerful but very quiet AC – I don’t want to hear the air conditioning kick on and off, but I want the house at 65, and no, the answer isn’t leaving the air running full blast 24/7.
  • Silent garage door openers – I get up early in the morning and I like taking drives.  I hate it when the garage door opener vibrates in a way that you can hear it in the bedroom.  (Yes, there’s a lot of silent items on this list.)

I could go on and on – the checklist has over 20 items on it at the moment.  That doesn’t include my set of tear sheets with all of the interior design features and furnishings we’ve liked over the years, too, or my Delicious bookmarks tagged dreamhouse.

My Current Housing Checklist

My Kind of Price
My Kind of Price

Back in the real world, since I’m not making seven figures (yet), my housing search checklist is decidedly more utilitarian:

  • Washer/dryer connections – I don’t want to go to a community laundry facility.
  • Good phone/cable wiring – I need high speed internet access and old building wiring presents problems with that.
  • Two parking spots – we’re looking at downtown one-bedroom lofts, and those don’t always come with two spots.
  • Low crime rate – I don’t want to have to carry a gun when I walk the dog.

Notice the difference?  Champagne tastes, beer budget.  I can’t go asking about waterfront property with what I’m payin’.  Asking about waterfront property when I can’t afford it doesn’t make me look good – and in fact, my real estate agent is pretty quickly going to start rolling her eyes when I call.  “Here comes this bozo again, asking about waterfront property for fifty large.  What a jerk.”

How This Relates to Database Administrators

If money wasn’t an object, my interview question list would simply be:

  • Is your name Paul Randal?

However, money is always an object.  Even when you think money isn’t an object, you have to ask yourself if you would hire one super-expert-senior DBA, or hire two solid but not super-expert DBAs.  My DBA interview questions aren’t designed to separate Paul Randal from the rest – they’re designed for companies who need to find somebody reliable without spending a fortune.

One of my favorite sayings is that there’s two kinds of questions: the ones designed to find out how much the other person knows, and the ones designed to show off how much you know.  Showing off in interviews doesn’t impress the candidate – and in fact, it does the exact opposite.  Humiliating a candidate makes them bitter about your company before you even make them an offer.  If your candidates rarely achieve 50%, then you’ve got a disconnect between your tastes and your budget.

Start asking questions that your candidates might be able to answer.

Otherwise, you’re just showing them you’ve got a lot of DBA-ness.


Happy Birthday to @SQLBelle

4 Comments

Donabelle Santos (TwitterBlog) celebrated her birthday today, and her coworkers at Black Ninja Software set her up.  They pinged me to ask if I’d call her to surprise her on her birthday, and one thing led to another.  Next thing you know, she was getting an urgent support call from Ted Striker.  Ted (played by me) was having a few SQL Server problems.  I slowly laid them out one at a time:

  • He’d upsized his Access stuff to SQL Server
  • Using one database per client
  • For 1,600 clients
  • With the databases hosted on a USB hard drive
  • And he wanted to bring in more data from tables stored in Microsoft Word

She kept her composure much better than I would have, and she never once blurted out, “What the hell were you thinking?!?!”  Congratulations, Donabelle!  Here’s to hoping the rest of your customers this year are brighter than Ted.  And now if you’ll excuse me, I have to go bomb the storage depots at Daiquiri.


Stunt Car Drivers, Eggs Benedict, and You

11 Comments

Wanna be the best at what you do?

Start paying attention to more than just “what you do.”

What Ken Block Does

Ken Block is a rally car driver, which means he slides cars around dangerous obstacles at high speed with great precision.  Thing is, though, lots of people can do that.  Standing out in a field of adrenaline junkies means going above and beyond the job description and giving your fans a truly amazing experience.  He recently did a video with James “Captain Slow” May of Top Gear, and it’s quite a watch – even just to hear James May say the phrase “Facetube.”  While watching, it may help to know that May is also a pilot with his own planes, so he may be especially nervous at the thought of sliding sports cars around someone else’s planes.

At about four minutes in, Block slides his Subaru alongside a taxiing airplane.

That’s amazing.

It’s amazing for his car control, but to just see the car control alone is missing the point. It’s impressive not just for how incredibly difficult it is, but also for how entertaining it is.  The whole is more than the sum of its parts, and what Ken Block “does” as a rally car driver is just one small part of this video.  It’s about stunts, it’s about humor, it’s about timing, it’s about moviemaking, and more.  Sure, Ken Block wins trophies, but paying attention to these kinds of details make him money.  People are willing to shell out big bucks when you get every part of the experience right, and I’ll give you an example from my upcoming travels this month.

Molecular Gastronomy: Stunt Cooking

Eggs Benedict at WD-50
Eggs Benedict at WD-50

The phrase “molecular gastronomy” might sound like food for geeks, but that’s missing the point.  The chefs might be geeks, but you don’t have to be geeky in order to appreciate the food.  Rather, molecular gastronomy is food for people who like to watch stunt car drivers.

On July 29th, I’m going to have a fourteen-course dinner, and one of the courses will be Eggs Benedict.  It’s going to look like the picture shown here.  (Hopefully.)

I know what you’re thinking: “Brent, you’re an idiot.  That’s not Eggs Benedict.”  Exactly – in the same way that what Ken Block is doing isn’t driving.  If you suspend skepticism for a minute and zoom in on that food, you’ll see that it’s chock full of stunts.

Sauce is a liquid.  Liquids, by their very nature, can’t be deep fried, right? You can’t take a lump of mayo, drop it into a deep fryer, and expect anything recognizable to emerge, no matter how thoroughly you coat that lump in bread crumbs.  But see those little cubes in the picture?  That’s deep fried hollandaise sauce.  See the cylindrical column?  Egg yolk.  The razor-thin sheet of something?  That’d be bacon.

Deep frying liquid is a stunt – but that alone doesn’t make good eats.  You have to have a reason to deep fry the liquid, the reason being a final result dish that is somehow in desperate need of deep fried liquid.  Otherwise, you’re not really doing anything more revolutionary than deep fried Twinkies at the state fair.

Like stunt car driving, when it’s done amazingly well, molecular gastronomy is an art involving more than one sense – and Wylie Dufresne does it very well.  Don’t take my word for it – read the WD-50 review by the New York Times.  That place is one of only two restaurants on my bucket list – the other being Paula Deen’s restaurant in Savannah, Georgia.  Hard-core readers will recall that I blogged about visiting The Lady and Sons a couple years ago during a road trip, and this is pretty much a polar opposite experience.

The key word shared between WD-50, The Lady and Sons, and Ken Block’s driving videos is experience, and I’m not talking about the number of years you’ve been punching the clock.

What’s Your User Experience Like?

When someone steps into your cubicle to work with you on a problem, are they scared?  Intimidated?  Pissed off?  Most of us database administrators have ugly, nasty reputations for always saying no and never explaining why.  Think about what that’s like for your customers – and yes, even if people aren’t paying you directly, they’re your customers.

I’ve never given Ken Block a dime, but in a way, I’m his customer, and he’s gone way out of his way to build a killer experience for me.  He’s building a brand that he can sell to advertisers like DC Shoes.  Web sites work the same way, too; in today’s Coding Horror blog entry, Jeff Atwood talks about the business of building a successful web site:

“Despite Benjamin’s well reasoned protests, the source code to Stack Overflow is, in fact, actually, kind of … well, trivial. Although there is starting to be quite a lot of it, as we’ve been beating on this stuff for almost a year now. That doesn’t mean our source code is good, by any means; as usual, we make crappy software, with bugs. But every day, our tiny little three person team of speedy-but-doomed Velociraptors starts out with the same goal. Not to write the best Stack Overflow code possible, but to create the best Stack Overflow experience possible. That’s our mission: make Stack Overflow better, in some small way, than it was the day before. We don’t always succeed, but we try very, very hard not to suck — and more importantly, we keep plugging away at it, day after day.”

Users aren’t giving any money directly to StackOverflow, but Jeff knows he has to build a killer end user experience because that experience pays off in other ways.  Your career has the same goals.  Even though the developers, project managers and end users may not pay your salary, they do have political capital with your boss.  They can make or break your career in an instant.

After years of working with people who want to throw triggers everywhere, people who don’t understand the basics of indexes, people who want their server to run 24/7 for zero cost, and people who call you every weekend for trivial support issues, it’s easy to become cynical and angry.  It’s easy to let yourself slip into growling when you pick up the phone.  Heck, I have to refocus on this all the time – I get frustrated too, despite the shiny-happy-upbeat-please-your-customers stuff I post on the blog.  Building a good end user experience is a never-ending journey.

To be the best at what you do, it’s not enough to just do what you do. Lots of people toil away just like you every day cranking out widgets. The difference is making people want to go out of their way in order to work with you and to watch you work.


SQL Server Index Tuning Tip: Identify Overlaps

Indexing
21 Comments
Performance Tuning 101 - Add More Spoilers
Performance Tuning 101 - Add More Spoilers

If you’ve got performance troubles with an application that stores data in SQL Server, and especially if it’s a home-grown application (not a store-bought app), you can get dramatic performance improvements simply by focusing on some basic indexing techniques.  These tips and tricks pay off more than pouring money into hardware that might look good sitting in the datacenter, but doesn’t really make the application significantly faster.

When I go into a shop to speed up an application I’ve never seen before, two of my favorite quick-hits are from the index performance tuning queries from SQLServerPedia:

  • Find unused indexes – these are indexes the SQL Server engine says it’s not using.  Unused indexes incur a speed penalty because SQL Server still has to add/update the indexes as records change, so they make writes slower.
  • Find missing indexes – these are indexes SQL Server wishes it had available.

I’m not going to cover those in detail this week because I’ve already recorded tutorial videos over at SQLServerPedia for those, but I do want to focus on something these queries won’t pick up.  Sometimes a table has two nearly-identical indexes, and they’re both being used for reads.  Take these two:

Performance Tuning 201 - Even Wood 2x4s Can Be Spoilers
Performance Tuning 201 - Even Wood 2x4s Can Be Spoilers

They’re two different indexes, and they’re both getting used – but does that mean we need them both?

They’re very nearly identical – but the second index has one extra field.  When the SQL Server engine gets a query that needs RunID, SiteID, DataSource, OutputType, and PeriodType – but not QuotaItemDriverID – then it will use the first index.  When it gets a query that needs all six fields, then it’ll use the second index.

In cases like this, I prefer to drop that first index and let the slightly bigger index pick up the slack.  Reading a slightly larger index will take slightly more time: if a query didn’t need that QuotaItemDriverID field, it still has to pull it off the disk in order to perform the query.  However, dropping the index pays off during inserts/updates/deletes, because it’s one less index SQL Server has to manage.  It also makes the database smaller, thereby making database maintenance tasks smaller/faster.

If:

  • I have two indexes with the exact same fields in the same order, but
  • One has 1-2 extra fields, and
  • There aren’t include fields, or the include fields are the same

Then I’ll drop the shorter index with extreme prejudice.

When Indexes Have Include Fields

If they have “include” fields, then I’ll merge the include fields between the two indexes to make one index to serve both needs.  Say we have these two indexes:

The first index includes the YTDRevenue field, but the second index doesn’t.  If I just drop the first index, then queries that needed that field won’t get the full speed benefits from the second index.  To merge the two indexes, I need to drop both indexes and recreate the second one with the YTDRevenue field included, like this:

In this example, I tacked the YTDRevenue field on to the end of the include field list.  The order of the included fields doesn’t matter, since SQL Server doesn’t sort by those.

Performance Tuning 301 - Beauty Just Adds Weight
Performance Tuning 301 - Beauty Just Adds Weight

Things to Watch Out For

In my examples, I kept things simple by omitting all of the extra indexing options like partitioning and sorting in TempDB.  When doing index tuning in real life, though, you’ll want to check those options to make sure they’re consistent from index to index.

Field order matters in indexes; if two indexes have the same fields but in different order, that doesn’t mean you can drop one of them.

Ideally, after making index changes, we would restart the SQL Server instance to reset the DMV counters that monitor index use.  In reality, though, that’s not so easy to pull off, so we need to log our changes to understand what the changes have been.  After making index changes, log the changes somewhere. I keep the output of the index performance tuning DMV queries in Excel spreadsheets because it’s easier to email those back and forth from machine to machine, especially when I’m consulting. The next time you do index performance tuning on the same database, you can use the historical spreadsheets to determine whether or not your changes worked the way you’d planned.

SQL Server 2008 Query Performance Tuning Distilled
SQL Server 2008 Query Performance Tuning Distilled

Learn More About SQL Server Index Tuning

I really like Grant Fritchey’s book SQL Server 2008 Query Performance Tuning Distilled, and I wrote a book review about it.  I can’t recommend it highly enough, and I’d start there.

If you don’t have the patience to wait for a book, here’s a few more blog posts about performance tuning:

Performance Tuning with Perfmon – how to set up Perfmon, what SQL Server Perfmon counters to track what the indicators mean.

Data Mining Your SQL Server Perfmon Counters – want to take your Performance Monitor statistics to a new level? I wrote an article on SQLServerPedia explaining how to use Microsoft’s free Table Analysis Tools for the Cloud plugins to dive deeply into your data.

Primary Keys and Indexes – I explain the concepts behind keys and indexes using phone books as an example. Indexes have huge impacts on performance, and if you master these you can make your server go a whole lot faster without spending more money.

SQL Server 2005 Setup Checklist – some simple configuration tweaks can get you 20-30% performance increases right from the start without spending any extra money.


The Chicago-Mac Sailboat Race

SQL Server
4 Comments
The Race Course
The Race Course

Since 1898, sailors have gathered at the Chicago Yacht Club each summer to race their sailboats up Lake Michigan to the Mackinac (pronounced mackinaw) Island.  The 333 mile course from Navy Pier to the lighthouse makes it the longest annual freshwater sailing race in the world.

The sailboats race around the clock for days as the crew work in shifts, sailing for a few hours and then sleeping for a few. Night sailing, storms, and quiet windless calms make this a memorable experience.

I’m nowhere near qualified enough to get a crew position on one of the real race boats, but I tried it in high school aboard a friend’s cruiser.  At the time, cruising sailboats weren’t technically allowed in the race, but bystanders shuffle down to Chicago and start at the same time as the serious racers.  We lived about halfway up Lake Michigan, so we were proud that we even made it down to Chicago for the start.  We made it about halfway back up before calm winds and a problematic engine made us give up.  I still fondly remember steering the boat in the middle of the night, watching the compass and the stars, talking to friends about what we planned to do with the rest of our lives.

The Hannah Frances
The Hannah Frances

This year, I’m honored to be able to give it another shot.  The Chicago Yacht Club started a separate class for cruising boats recently, and I’ll be aboard the Hannah Frances.  Mike Cook’s a good friend of mine, and he tolerates my complete ignorance of how to tie a knot.  (I was a Boy Scout – how come I know absolutely nothing about how to tie lines together?!?)

We have no delusions of winning, but we do have delusions of finishing.  The Hannah Frances is a wonderful boat rigged for easy shorthanded sailing and relaxed self-tacking, but fast, she is not.

We’re hoping to finish the race in under 4 days, but that means a lot more than 4 days of sailing.  We’re leaving in two weeks – Wednesday, July 16th – for a couple/few days of sailing down to the Chicago starting line on Saturday.  Then it’s four days of sailing up Lake Michigan, a day of partying with the other sailors on Mac Island, and then another few days of sailing back to White Lake.  By the end of it all, the crew will be intimately familiar with the boat and with each others’ quirks.  (Mike’s already warned me that if I want to listen to Death Cab for Cutie, I’d better bring headphones.)

Over the next couple of weeks, I’ll blog a little about race preparations.  Sailboat racing really is a sport, and it’s harder work than it looks.  For starters, I have to go pick up a Tyvek suit to fend off the black vampire flies.


More Thoughts on Blog Plagiarism

14 Comments

In the aftermath of the InformationFlash plagiarism incident, several questions have come up from the site’s webmaster and from other bloggers.

Is it okay if the plagiarizer isn’t making money?

No.  Authors work really, really hard to create their original content.  Seeing someone else pass it off as their own, whether there’s a charge or not, reduces the value of our hard work.

If I took the whole content of The Manga Guide to Databases and reproduced it here on my blog, I wouldn’t be making a dime off it.  However, I’d be robbing the author of income.  Even if that author was giving away the work for free, the author might be benefitting in a way that I don’t understand yet, so I need to contact the author before republishing their copyrighted work.

Is it okay if I don’t understand my blog aggregation software?

No.  If you pick up a gun, it’s your responsibility to understand how it works. The first time it accidentally goes off and shoots somebody, you might be able to get away with claiming you didn’t know it was loaded.  After several people complain about gunshot injuries, though, you need to put the gun down.

Just as you can go to a local gun club to learn about firearm safety, you can get help with RSS aggregators too.  Post a message in the product’s support forum, contact other users of the product, or post a message on StackOverflow.  But whatever you do, don’t wave that thing around until you understand what you’re doing.

Shouldn’t the bloggers change their feeds to prevent theft?

Bloggers can choose whether to include the full article or just a few words in the RSS feed.  In my series on how to start a technical blog, I recommend using the full article because readers like it a lot more.  They don’t want to click through to read your full article on your site.  (Personally, I hate the holy hell out of blogs who just include the abstract, and their content has to be insanely good for me to subscribe to one of those kinds of blogs.)

Even if the blogger changes their feed to just include an abstract, it still doesn’t prevent syndication sites from stealing content with screen-scraping techniques.  Then the naysayers would say, “It’s the blogger’s fault for not requiring a username and password in order to read the blog.”

If we have another site pop up like InformationFlash, I’ll probably end up including a copyright note at the bottom of every blog entry.  It’ll say something like, “If you’re not reading this article at BrentOzar.com or SQLServerPedia.com, it was stolen.”  I hate doing that, though, because it looks crappy.  It’s like bolting the TV remote to the nightstand.

Is it okay if end users submit the copyrighted blogs?

No.  When the owner of copyrighted content notifies you that your site has their stuff on it, and they want it taken down, you have to take it down pronto.  YouTube is a good example because people try to upload copyrighted data all the time.  If the original content owner files a DMCA complaint at YouTube, then YouTube acts quickly to take the content down.

Just as a side note – if you try to claim some other user uploaded the copyrighted content, you need to be *very* prepared to show database records and web server access logs to prove the site administrator wasn’t the one uploading content.

How come it’s okay when Digg or DotNetKicks does it?

Because those sites don’t publish the full content of the article.  They show the first few words of the article, and if the reader is interested, they click through to the full content of the article on the blogger’s site.

InformationFlash was showing the entire article, start to finish, without even showing the author’s name.  That isn’t promoting the authors at all.  To make matters worse, InformationFlash had a Google PageRank of a whopping zero – meaning it wasn’t promoting anyone other than itself by stealing content.

Then is it okay if the site promotes the bloggers?

No. When you’re taking copyrighted content from bloggers, you have to get their permission first, period.

Some authors are completely okay with you republishing their work as long as you attribute them appropriately and link back to them.  For example, I’ve told SQL Server Magazine they’re free to use any material from my blog as long as they quote me.  (Part of this is a selfish reason: despite what Compete thinks, I’m pretty sure SQL Server Magazine has more readers than I do.)

Is it okay if it’s not illegal?

Even if you register your domain name anonymously and ignore all incoming emails, sooner or later people are going to figure out your real name.  They’re going to post your name in public along with an explanation of what happened.  That kind of information will turn up in Google searches, and it’ll make for very ugly job interviews and client negotiations down the road.

Besides, don’t you want to be successful?  Your site simply can’t become a success by alienating the very people upon whom your site depends for content.  You can be successful by working with the community and making sure everything is a win-win.  It’s not easy, and it’s not cheap, but it works in the long run.

Stealing is easy and cheap – but the long-term outlook is not so good.


How to Take Action When Your Content is Plagiarized

23 Comments

If your copyrighted blog content shows up in whole on another site without proper attribution like InformationFlash.com is doing, here’s a few steps you can take. IANAL (I Am Not A Lawyer), so YMMV (Your Mileage May Vary).

Send the Webmaster a Cease & Desist Letter

Get a sample cease & desist letter and tailor it to include your own content information. Identify the exact copyrighted blog post that’s showing up on their site.

The webmaster may not be aware of the plagiarism. Sometimes end users post copyrighted material on their own without the webmaster being aware. In other cases, the admin themselves may be doing the copying. Sending a Cease & Desist to the webmaster helps them understand that you didn’t give them permission to post it on their site.

The User Causing All The Problems
The User Causing All The Problems

Some sites like InformationFlash don’t make it easy – they don’t publish any personal information on their site, and they try to hide behind private domain registrations. They only accept emails through a contact form, thereby making it impossible to guarantee message delivery. No problem – keep reading.

Send Their ISP’s Abuse Department a DMCA Takedown Notice

The Digital Millennium Copyright Act protects the intellectual property rights of people who create content, like bloggers. Title II of the DMCA is an agreement between you (the copyright holder) and internet service providers (the web hosting company). As long as the copyright holders notify the ISP and the ISP reacts appropriately, then the ISP is not liable for the copyright infringement. Only the plagiarist is liable. That means web hosting companies and internet providers react swiftly and fairly to complaints of copyright infringement.

Get a sample DMCA notice to hosting companies and send it to the web host. In the case of InformationFlash, you can send it to abuse@dreamhost.com. I took the extra measure of sending one DMCA takedown notice per copyrighted article to show the extent of the problem.

Send Search Engines a DMCA Notice Too

If the site’s webmaster and their web host still don’t react, we have another weapon: the search engines. Before doing that, find out if the site even turns up in search results – the search engines may have already received DMCA takedowns for the site in question. Go to your favorite search engine and type the name of your blog post in quotes, like this:

“Top 10 Developer Interview Questions About SQL Server”

Look at the search results and find out if the offending site shows up. In the case of InformationFlash, it doesn’t show up – even if I add the word InformationFlash to the search. That’s awesome – Google’s already figured out that the site’s up to no good. In order to send a DMCA notice to a search engine, you have to show that their site will show up in a search for your work.

Each search engine has a different procedure for getting sites delisted:

There’s also a sample DMCA notice to search engines that you can use, but make sure to adapt it to each search engine.

Ask for Help From Fellow Bloggers

If you syndicate your blog with SQLServerPedia, email me about the offending site. If you blog at any other site, email the head honcho. All of us are writers, and all of us take plagiarism very, very seriously.

A cynic might ask, “But wait – how is this different than blog syndication at SQLServerPedia?” I’m glad you asked.

  • You ask us to syndicate your content. We don’t go poaching content.
  • We work with you to set up specialized feeds so that you choose what to syndicate.
  • We slather your name all over the place, making it abundantly clear that it’s yours.

If someone takes your syndicated content without your permission, and if you complain to me about it, I will make every effort to go after the offending party with all of the resources available to me. If you want them to syndicate your content straight off your site, that’s completely okay – but they need to take it from your site with your permission, not from SQLServerPedia. You, as a blogger, are completely welcome to syndicate with as many sites as you’d like.

In the case of InformationFlash.com, we’ve already sent them C&D letters, yet they’re still using (y)our content inappropriately. I hate to have to take it to the next step, and I hate to name names in public on my blog. I try to give everyone involved the benefit of the doubt and give them time to do the right thing. If they don’t do the right thing, then I want to make sure the public knows the names of the individuals involved and what they’re doing.

My next post will explain why companies should think twice before hiring individuals who plagiarize intellectual property, whether as full time employees or consultants.

InformationFlash-Content-Copied-From-Brent-OzarUpdate 6/27: as I expected, InformationFlash syndicated my content despite the post actually being about InformationFlash stealing content.  Rather awkward.  Here’s a screenshot of their plagiarized content, as well as a screenshot of a blog post they plagiarized from Gail Shaw.  Also note the name of the user who submitted the content – either their admin account has been hacked, or the site’s administrator is responsible for plagiarizing the content.  The top of the page notes that they aggregate information via RSS, but remember that we’ve already sent them a cease & desist once, and they agreed to do it – they’re just not doing it.

Update 6/28: Dreamhost contacted me and said they’re taking the site down due to our DMCA complaints.  It’s not clear whether the takedown is permanent.  I want to thank Dreamhost for acting quickly to protect the intellectual property rights of bloggers.

Update 6/29: I got emails with questions from the site’s webmaster and from a few bloggers, so I added the answers in a followup post with More Thoughts on Blog Plagiarism.


My Michael Jackson Story

2 Comments

I’ve been joking a lot on Twitter about the passing of the King of Pop.  Somewhere between the Jesus Juice and the Elephant Man, he’d lost a lot of credibility in his fading years.  Earlier in both of our lives, though, things were different.

This Had Me Written All Over It
This Had Me Written All Over It

In middle school when he was at the peak of his popularity, I desperately wanted a red and black leather jacket like his.  I mean, desperately.  I had enough money saved up, and Sears carried one that I could afford.  That right there should tell you everything about my level of style – I aspired to own a piece of clothing carried by the most unhip of 1980s retailers.

My parents, having slightly more taste than me, would not allow me to purchase the jacket.  I was upset, mortified, angry, you name it.  Today, my father sports a diamond earring inspired by Jimmy Buffet – but I digress.

Instead, I ended up buying a large boom box, with which I played songs like Thriller, Bad, and Billie Jean.  Over time, my tastes changed to Huey Lewis and the News, but the King of Pop will always make me wanna get out on the dance floor and perform ill-advised moves that show off my complete lack of physical grace.

So today I’ll be listening to the Essential Michael Jackson collection I just picked up off Amazon MP3 for $17, dancing around the desk, and I won’t stop til I get enough.


How to Get More Twitter Followers

7 Comments

Yesterday, Kevin Kline ran across the WeFollow list of top twitterers for the SQL tag and remarked:

How to Climb a Mountain
How to Climb a Mountain

I hear that same question privately every now and then, and it’s not that hard.  I’ve got the simple answers to get yourself to the top of the popularity list!

Set Up Searches for Key Phrases

If you’re interested in SQL Server, there are tools you can use like RSS feeds from Search.Twitter.com that will alert you whenever someone mentions SQL Server.  That way you can jump right into their conversation and interrupt help them.  They will surely be impressed by your knowledge and your willingness to help, and they’ll follow you for your insight.

The drawback, though, is that there’s a lot of conversations happening on Twitter at any given time.  It’s seriously hard work to keep up with all of them.  You could devote your time to Twitter searches, or maybe hire a savvy assistant to proactively run your Twitter profile, but sometimes even a human being isn’t enough.  At that point, you’ll want to bring in the machines.

Set Up Robots to AutoRespond For You

Clippy
Clippy

Twitter has a cool set of APIs that you can use to build a robot.  Whenever someone mentions a topic, like say SQL Server, you can build an automatic response that says something like:

“I see you’re trying to build a database.  Would you like some help?  I’ll be your best friend.”

If you’re really good with your autoresponses, people will never guess that your witty responses are coming from an automated, heartless piece of software.  Bonus points if they try to carry on a conversation with you, and you have another autoresponse for that.  They’ll line up to follow your Twitter account in no time.  To see an example of a bot in action, check out @joe_kl.

Follow Everybody You Can Find

Go crazy with the Follow button.  Follow anybody and everybody regardless of what they’re talking about.  They might follow you back just out of sheer politeness.

There’s a catch, though: Twitter will yank your account if you follow too many people too fast.  Every few days, go into your Friends page in Twitter, which lists the people you’re following.  You can identify the ones who are following you back because there’s a “Direct Message” link – you can only send DM’s to people who are following you.  Unfollow anybody who doesn’t have a “Direct Message” link next to their name, and presto, it’ll keep your list shorter and let you follow more people.

When you unfollow people, they may get alerted about this if they’re using a service like NutshellMail.  At that point, they’re going to know you’re a bit of a spammer, because they’re going to guess that you followed them just to try to bait them into following you back.  This isn’t a problem at first, but if you try that same trick repeatedly, it pisses off users because they know you’re just an absolute slimeball.  (Doing it even once makes you a slimeball, though.)

Give Stuff Away to People Who Follow You

Announce that once a month, you’re going to pick a random follower and give them something juicy like a gift certificate or a free iPhone.  People will do almost anything for a Klondike bar, I hear.

Once you start, though, it’s like a drug addiction.  If you don’t keep giving things away, people will stop following you, and worse, they’ll start UNfollowing you.  Of course, if you’re in the business of professional marketing, you should have no problem justifying giving away portable hard drives or Macbooks in order to get your spam message out to a larger audience.  Heck, even just the Twitter population as a whole may not be enough, and you may want to…

Send Spam Emails Asking People to Follow You

The Simple Twitter Book
Download My Free Twitter Book

The majority of humanity isn’t on Twitter yet, so when these measures aren’t enough, it’s time to kick it up a notch.  Send out a broadcast spam email to everyone you can find asking them to join Twitter and follow you.

I’ve been watching the Twitter follower counts of one particular publication who chose to spam me with an invite like this.  I was curious to see if it worked – I had this vision of people saying, “Wow, this is awesome!  I don’t get enough spam through my email client, and it takes so darned long to get it.  I’ll go sign up right now and follow them for up-to-the-minute spam in 140 character chunks!”  Not surprisingly, it doesn’t appear to be working.

Or, Uh, Maybe Just Be Yourself

Maybe I’m old school, but I like to get my followers the old-fashioned way: I earn them.

Don’t follow people just to game the metrics. Unless you’re Ashton Kutcher, nobody really gives a rip how many followers you have.  Twitter is about relationships.  It’s about caring, not calculations.  If you’re out to prove you’ve got the biggest numbers, cut straight to the chase and start giving away free pr0n.

Be yourself, not your company. I follow some company accounts because they have truly kick-ass products.  I want to hear every single bit of news about the cool new stuff they produce.  I work for a company too, but I don’t use my Twitter account as a pimp platform.  If you ask me questions about our products, I’ll be glad to talk with you about it, but not in public on Twitter.  Nobody wants to listen to somebody else buying a used car on Twitter, for example.

Join the conversation. Don’t just spew garbage out automatically – listen, help, and engage.  When you jump into a stranger’s conversation and start blathering about yourself, your opinions or your product, people see through your act.  In meatspace, you can identify the failure of your technique by watching the panicked horror in their facial expression, but on Twitter it’s not so clear.  If the technique doesn’t work in meatspace, it won’t work here either.

Remember that kid in middle school whose mom always sent him in with a bag of cookies trying to make friends?  The one who kept running into you and your buddies and just standing around until he could inject himself into the conversation?  The one that everybody said was trying too hard?  Don’t be That Guy.


Top 10 Developer Interview Questions About SQL Server

Knowing good SQL questions to ask during an interview with a developer can help you filter out the best candidates from the ones who aren’t the most qualified.  There’s a huge difference between “It worked on my machine” and “It scales well in production.”  These interview questions will help you filter out the bad apples before you hire them.

10. Explain why DBAs don’t like cursors.

I like to phrase this interview question this way because I’m not saying the DBA is right – I’m just asking the developer to explain the DBA’s point of view.  I don’t have a problem with the developer rolling their eyes as they explain the answer, but I have a problem with the developer being surprised by the question.

The candidate gets bonus points if they seem even vaguely aware of the terms “set-based processing” and “row-based processing”, but that’s purely a bonus.  (I wish I could say that these concepts are requirements, but in today’s economic market, companies don’t always want to pay top dollar to get the best candidates.)

9. Where do you like business logic – in the app or in the database? Why?

Personally, I like stored procedures because they’re easier for us DBAs to test, tune and tweak. On the other hand, the developer community isn’t always as fond of stored procs.  For their side, see these posts by Jeff Atwood:

I don’t mind what arguments the coder candidate uses, but I want to see ’em put some thought into it.  No matter which angle they take, I’ll play the devil’s advocate and prod them with arguments just to see how they react.

8. Explain when and how transactions should be used.

Not In The Oprah Book Club, Oddly
Not In The Oprah Book Club, Oddly

Start with just that open-ended interview question, and if they have trouble getting started, give them a scenario.

“Say we’ve got a table for Orders, and a table for OrderDetails.  Someone places an order for two books – Bacon: A Love Story and the hit bestseller Eat What You Want and Die Like A Man.  Tell me what happens.”

After they’ve answered, ask them when transactions should not be used.  I don’t want my developers wrapping anything inside a transaction unless it absolutely needs to be.  (Unlike bacon, which should be used as often as possible for wrapping purposes.)

7. Explain referential integrity and where it can be enforced.

If they stumble on the question, circle back to the Orders and OrderDetails tables we used as examples earlier.  What’s an orphan?  How do we make sure that we don’t end up with OrderDetails for records with no matching Order record?  Where are all the places we could enforce referential integrity?  (Think foreign keys, triggers, the application, or not at all.)  Have you worked in places where there was no referential integrity, and what problems did you run into?

6. What’s the fastest way to get a thousand records into the database?

I’m not looking for the best answers – I’m just looking to hear that they’ve done some work to performance tune their queries.  If they’re doing fully logged individual record inserts, one at a time, into a data warehouse-size system, we’re going to have problems down the road.  (Yes, I’ve actually worked with a BI developer that did millions of individual inserts per night in full recovery mode and thought the performance was the database’s fault.)

Bonus points if they link back to the previous interview question and talk about whether or not they should disable constraints or referential integrity during data loads.  (I don’t care what their final answer is, but I just want them to know the pros and cons.)

5. What’s the difference between a primary key and a clustered index?

This is almost a bonus question.  Most of the time, the candidate doesn’t know because it’s a function of the data modeler or architect, not the developer.  However, I want to see how the candidate reacts to tough questions.  Ideally, they say in a relaxed tone of voice, “I’m not sure, but I know who I’d ask.”  If they don’t mention where they’d go, ask them where they go for SQL Server answers.  Speaking of which…

Bonus Points for This Candidate
Bonus Points for This Candidate

4. What’s your StackOverflow name?

I don’t need to see a high reputation, but I do want to see an awareness of the site.  This interview question serves two purposes: it finds out if they’re serious enough to be active in the community, and it shows them that you’re okay with their community activity.  Start a conversation with them about the level of internet time that you find acceptable in the office, and encourage them to share their knowledge with their peers.  This sells the candidate on your shop.

3. Tell me about a time when a DBA got mad at you.

This is a spin on the classic interview question, “Tell me about a time when you failed.”  Implemented a user-defined function, trigger, CLR in the database, or something else that made the DBA freak out?  I want to hear that the candidate listened to what the DBA had to say, good or bad.

If they say it’s never happened, rest assured it’s going to happen soon.

2. How can you tell if a query will scale for production?

I want to hear that they do things like load tests or maybe look at execution plans.

I’m sometimes comfortable when a senior developer says things like, “I can pretty well tell when something isn’t going to scale, because I know the production boxes really well.”  The key is asking a followup question about times when things didn’t scale.

1. When is the DBA right?

Always, kid.  Always. (Okay, uh, maybe not.)

Learn More in Our Interviewing and Resume Tips


Questions About Automation & Patch Management

12 Comments

Opinion poll time!  I got asked a few questions, and I’m curious about what the rest of you think:

Question 1: How do you feel about Run Book Automation for databases?

Have you used it?  Would you want to?  Why or why not?  If you’ve never done it and never would, please respond too – don’t just be scared off by terms you haven’t heard before.  (I hadn’t heard of it before I came to work for Quest.)

Question 2: Do you have multiple copies of the same database?

Do you have several copies of the same database that you need to keep in sync, whether it’s schema or data?

Question 3: If yes, how do you feel about automated patch and configuration management of databases?

Do you have enough databases that you would consider building a “gold” standard, deploying it, watching for changes, catching out-of-band scenarios, syncing them back to the gold standard, etc.?  Why or why not?

Let me know what you think in the comments.  I’ll hold back my own answers for a couple of days.


SPWho2.com: StackOverflow user and tag statistics

3 Comments

I took the StackOverflow database dump, brought it into SQL Server, and did some slicing and dicing by tags and users.  I wanted to find the answers to questions like:

I’m having a lot of fun with the data, and I thought other people might enjoy it too.

To make it easier, I built SPWho2.com, a site with StackOverflow user and tag statistics.  It’s not terribly attractive yet, unless like me, you find numbers attractive.  It’s a side hobby for me right now, and over time I’ll add in more data visualization with graphs and trends.  If there’s anything you’d like to see added, let me know.


Finding old, unanswered StackOverflow questions

SQL Server
0

So you wanna start answering questions on StackOverflow, but you’re frustrated because it seems like people are fillin’ out answers even faster than you can type?  Find yourself up at 3am hitting Refresh just hoping a new question comes in?

Necromancer Badge Shirt
Necromancer Badge Shirt

Here’s a new trick:

  1. Import the StackOverflow database into SQL Server
  2. Run this query to find old, unanswered questions
  3. Profit!  Err, no, but watch your reputation go up as you solve old questions.

It turns out there’s a little (and I do mean little) niche of questions ripe for the pickin’.  I spent half an hour around midnight last night answering a handful of old SQL Server questions and I’ve already gained 75 points – pretty surprising for a Friday night.  Even more surprising to me, I’d posted an answer to a question that had been dormant since December (6 months ago), and the questioner already checked my answer and thanked me for it.  See, answering old questions really does help people, not just game the system for reputation points!  (That’s how I rationalize it anyway.)

Another boost: you’ve got a better chance of earning the Necromancer badge, which is awarded when someone answers a question more than 60 days old and then gets 5 upvotes on their answer.

Oh, and the shirt?  Yeah, I’ve set up a few StackOverflow badge shirts, including badges that don’t exist (yet) like Fastest Gun and Bounty Hunter.


I’m Going to Disn-errr, #SQLPASS!

#SQLPass
0

Woohoo!  I’m presenting three sessions at the PASS Summit in Seattle this year.  I’m doing one with Jason Massie (TwitterBlog) on Social Computing for the Database Professional (I don’t have the abstract handy for that one), and these two:

Session: Yes, I’m Actually Using The Cloud

Security in the Cloud
Security in the Cloud

There’s a lot of hype around cloud-based databases. After you get past the knee-jerk reaction about security, what else matters? Is it time to buy in, and what should you watch out for? Brent explains some of the pros and cons hes experienced running SQL Servers in the cloud, and will demonstrate how easy it is to fire up a new SQL Server in the cloud.

Brent’s involved with StackOverflow.com as an advisor, and he’ll talk about the decisions they made about whether to host production and/or disaster recovery servers in the cloud.

Session Goals

  1. Learn to estimate an application’s costs in the cloud
  2. Learn options for cloud-based disaster recovery
  3. Learn how to talk to developers and managers about cloud database options

Session Outline

  • What’s In The Cloud Now
    • Cover the services currently available in the cloud for DBAs
    • Demonstrate starting up a new SQL Server in Amazon EC2
  • The Elephant In The Room
    • Security, and why it matters
    • Why (and when) it doesn’t matter
  • Disaster Recovery in the Cloud
    • Explain methods of log shipping and database mirroring to the cloud
    • Cover drawbacks of current solutions
    • Show how I’m using it now
    • Talk about why StackOverflow chose not to use it
  • Predicting Cloud Costs
    • How to query MSDB to predict log shipping costs in the cloud
    • How to use Perfmon to predict bandwidth costs of a cloud-based SQL Server
  • How to Talk Cloud to Managers
    • How to lay out the pros and cons of cloud-based DR
  • How to Talk Cloud to Developers
    • How to explain the risks of cloud-based hosting with today’s solutions
  • Questions & Answers

Session: DRP101 – Learn the Difference Between Your Log And Your Cluster

Hurricane Ike (from the presentation)
Hurricane Ike in Houston

Developers and accidental DBAs: if you know more about how SQL Server handles crashes and disasters, you’ll be able to make a better decision about how to prepare. In this session, Brent will cover all of SQL Servers backup and high availability options at a high level, including clustering, log shipping, mirroring, replication and more. He’ll show the pros and cons of each, and teach you how to pick the right method for your application. We won’t have enough time to dive into actual implementation demos due to the number of solutions we’ll cover, but we’ll show screen shots and give links to the best resources for each method.

Session Goals

  1. Learn the difference between high availability and disaster recovery
  2. Learn real-world drawbacks of each solution
  3. Learn which methods complement each other for even better protection

Session Outline

  • High Availability and Disaster Recovery – What’s the Difference?
    • Show examples from each category of crash
    • List the SQL Server protection methods for both
  • HA: Clustering
    • Explain the hardware requirements
    • Point out the single point of failure (the SAN)
    • To plan for SAN failure, we can use clustering in combination with…
  • HA: Synchronous Database Mirroring
    • Explain the hardware requirements (and separate drive arrays)
    • Live demo of setup and failover
    • List drawbacks (accidental failovers, single-db failovers, index maintenance jobs, etc)
  • HA, DR: Replication
    • Explain basic requirements
    • Cover drawbacks (schema changes, 3rd party application support, monitoring, management overhead)
  • DR: Asynchronous Database Mirroring
    • Explain the differences between synch and asynch
    • Give examples of when each is more appropriate
  • DR: Log Shipping
    • Discuss how it’s set up
    • Talk about non-DR uses (development servers, testbeds)
    • Note SQL 2008’s advantage due to compression
  • DR: SAN Replication
    • Cover hardware requirements, expense
    • Discuss advantage of using a single DR method for all applications (SQL/Oracle/Exchange/etc)
  • DR: Virtualization Replication
    • Explain requirements and how it works
    • Note that it only works for virtual servers
  • My Personal Favorites
    • Explain why I like to use a combination of clustering and log shipping
  • Backup Best Practices
    • Things to do or avoid doing no matter which HA/DR method you choose, such as:
    • Back up to a different SAN in possible
    • Do regular fire drills
    • Get the tapes offsite as fast as possible
    • How to estimate timelines for managers
  • Questions & Answers

I’m excited to see everybody there!


The Best Thing I Learned at #SQLPASS

#SQLPass
1 Comment

I went to my first Professional Association for SQL Server Summit in 2007. I’d gone to a few local SQL gatherings, talked to other area DBAs, and worked on some pretty cool projects.  I really liked what I was doing, and I thought I was doing a decent job.  I knew I wasn’t a rocket scientist, but I kept the trains running on time.

Early Version of Query Analyzer
Early Version of Query Analyzer

At the PASS Summit, though, it was as if a giant light bulb clicked on above my head.

I learned that the stuff you really need to know isn’t in Books Online.

You wanna learn the stuff that makes your server fly, the stuff that saves your rear when the database is suspect, the stuff that makes you look like a genius amongst your peers, the stuff that makes the developers’ eyes get big, the stuff that makes you wave your hands in the air and shout “I’M KING OF THE WORLD, FOR TINYINT VALUES OF WORLD!”

Everybody knows the stuff in Books Online – that stuff ain’t enough.  You wanna learn the secret tips, the undocumented goodies, the up-up-down-down-left-right-left-right-B-A cheat codes.  They say there’s no /faster switch for SQL Server, and they’re wrong. The SQL Server world is chock full of sorta-kinda-barely-official things that don’t get enough press, yet make a very real performance difference.  The classic example from last year was Jimmy May’s session on partition alignment.

PASS is running a contest to find the best thing you’ve learned at PASS, so in the blogosphere you should be seeing a few folks chime in with their experiences.


How to Import the StackOverflow XML into SQL Server

SQL Server
16 Comments

UPDATE 2013: This code is no longer available. The size of the StackOverflow export has grown beyond what you can import with this method.

Want to play around with the StackOverflow database export?  Here’s how to import the XML files into SQL Server, and some notes about the tables and data schema.

Script to Import StackOverflow XML to SQL Server

This T-SQL script will create six stored procedures:

  • usp_ETL_Load_Badges
  • usp_ETL_Load_Comments
  • usp_ETL_Load_Posts
  • usp_ETL_Load_Users
  • usp_ETL_Load_Votes
  • usp_ETL_Load_PostsTags (which isn’t one of the StackOverflow tables – more on that in a minute)

The XML import code is from an excellent XML tutorial by Denny Cherry.  The scripts create a table (named Badges, Comments, Posts, Users, Votes) for each XML file.  The schema matches the XML file with one exception – I added an identity field to the Badges table.  The rest already had Id fields.  The tables don’t have any indexes to speed querying. I would highly recommend that you not change the schema of any of these tables, because I’ll be giving out more scripts over the coming days and weeks that rely on the base tables.  If you want to add more data, add additional tables.  Plus this will keep your importing clean anyway – you can dump and reload the StackOverflow data repeatedly as long as you keep that data separate.

After importing, the database is about 2gb of data.  Be aware that depending on your database’s recovery model and how you run these stored procs, your log file may be 2gb as well. None of the sentences in this paragraph blend together well, which bothers me but not quite enough to stop publishing the blog entry. Anyway, on we go.

If the table already exists when the stored proc runs, the table contents are deleted using the TRUNCATE TABLE command, which requires hefty permissions.  If you don’t have admin rights on the box, substitute DELETE for the five TRUNCATE TABLE commands.  Using DELETE will take significantly longer to run.  For reference, with TRUNCATE TABLE, the stored procs take around 10 minutes on my faster machines, and around half an hour on my slower virtual machines.

These stored procs only work for the new database dump released on Monday morning, not the one released last week.  If you get invalid XML errors while importing, you’ve got the older database dump.  Go get the fresh hotness.

Now for some schema notes, and I’m going to go out of alphabetical order because everything links back to the Users table.  I’m only going to cover the fields that aren’t immediately obvious:

Users Table

  • Id – primary key, identity field from the original StackOverflow database.  Id 1 is “Community”, which is a special user that denotes community ownership, like wiki questions and answers.
  • LastAccessDate – this is useful because it tells you when the data export was last updated.  If you’re doing queries for things like the last 30 days, check the most recent date here.
  • Age – the user enters this manually, so it’s not terribly reliable as I discovered earlier.
  • AboutMe – I’m using an nvarchar(max) field here, but you can go with a shorter field like nvarchar(2000).
  • UpVotes and DownVotes – the number of votes this user has cast.

Posts Table

In StackOverflow, questions and answers are both considered posts.  If a record has a null ParentId field, then it’s a question.  Otherwise, it’s an answer, and to find the matching question, join the ParentId field up to Posts.Id.

  • Id – primary key, identity field from the original StackOverflow database.
  • Title – the title of the question.  Answer titles will be null.
  • OwnerUserId – joins back to Users.Id.  If OwnerUserId = 1, that’s the community user, meaning it’s a wiki question or answer.
  • AcceptedAnswerId – for questions, this points to the Post.Id of the officially accepted answer.  This isn’t necessarily the highest-voted answer, but the one the questioner accepted.
  • Tags – okay, time to blow out of the bullet points for a second.

StackOverflow limits you to five tags per question (answers aren’t tagged), and all five are stored in this field.  For example, for question 305223, the Tags field is “<offtopic><fun><not-programming-related><jon-skeet>”.  It’s up to you to normalize these.  If you’d like to normalize them out into a child table, check out the usp_ETL_Load_PostsTags stored proc, which creates a PostsTags table with PostId and Tag fields.  Each Posts record (questions only) will then have several child records in PostsTags.

Next, check the contents of the Tag field carefully.  StackOverflow allows periods in the tag, like the .NET tag and ASP.NET tag.  However, in the database, these are stored as “aspûnet”.  Just something to be aware of.

Comments Table

  • Id – primary key, identity field from the original StackOverflow database.
  • PostId – the post parent for this comment.  Joins to the Post.Id field.
  • UserId – who left the comment.  Joins to the User.Id field.

Badges Table

  • Id – an identity field for a primary key.  This number is meaningless – I just added it for some referential integrity.
  • UserId – joins back to Users.Id to show whose badge it is.
  • Name – the name of the Badge, like Teacher or Nice Answer.
  • CreationDate – when the user achieved the badge.

Votes Table

This stores the votes cast on posts, but the key field is VoteTypeId.  The VoteType table wasn’t included in the export, so this table isn’t too useful yet, but if the guys give me the OK I’ll post the contents of that table here.  The Votes table doesn’t include *who* cast the votes, and I’ve got my hands full analyzing the other tables anyway, so I haven’t been interested in the VoteTypes yet.

All of the Id fields except for Badges.Id are from StackOverflow’s original database.  In theory, these numbers will not change, which means if you build your own child table structures like UserBaconPreferences, and you join via User.Id, you should be able to blow away and reload the Users table with every new StackOverflow database dump.  That’s the theory, but in reality, you shouldn’t rely on anybody else’s ID fields, because there’s no reason to believe these won’t completely change down the road.  Who knows – Jeff might switch over to GUIDs as primary keys.

Sample Questions Query

Once you’ve got it all together, you can do some fun stuff. Let’s look at some overall statistics about questions (not answers):

And some of the results are:

  • Questions – 176,137
  • Average Score – 1.89
  • Average View Count – 311
  • Distinct Questioners – 39,795 (meaning anyone who has asked a single question has asked an average of 4.4 questions – there may be some odd stuff in here around anonymous questions though, haven’t looked at that yet)
  • Average Answer Count – 4
  • Average Comment Count – 2.3
  • Closed Questions – 3,656 (or 2% of all questions)
  • Average Questioner Reputation – 1,506
  • Average Questioner Age – 30 (but remember, that’s unreliable)

I’m just getting started playing with it, and I’ll have a fun new StackOverflow statistics toy available for everybody to play with in a couple of days.  In the meantime, you can download the StackOverflow database dump via BitTorrent and download my ETL stored procs.

Update: Sample StackOverflow Queries in the SQLServerPedia Wiki

Jon Skeet had an excellent idea: we need a wiki to store interesting queries.  Wouldn’t you know, I happen to run one!  I added a section in SQLServerPedia for sample StackOverflow database queries.