When people buy my virtualization training video, one of the followup questions I get most often via email is, “Can I build SQL Server clusters in VMware and Hyper-V?”
In theory, yes. Microsoft’s knowledge base article on SQL Server virtualization support says they’ll support you as long as you’re using configurations listed in the Server Virtualization Validation Program (SVVP).
But the real question for me isn’t whether or not Microsoft supports virtual SQL Server clusters.
The question is about whether you can support it.
Us geeks usually implement clusters because the business wants higher availability. Higher availability means faster troubleshooting when the system is down. We need to be able to get the system back up and running as quickly as possible. Getting there usually means reducing the amount of complexity; complex systems take longer to troubleshoot.
Adding virtualization (which also means shared storage) makes things much tougher to troubleshoot. If the business wants a highly available SQL Server, ask yourself these questions before virtualizing a SQL Server cluster:
- Do you have a great relationship between the SQL Server, storage, and network teams?
- Do all of the teams have read-only access to each others’ tools to speed up troubleshooting?
- Do all of the teams have access to the on-call list for all other teams, and feel comfortable calling them?
- Do you have a well-practiced, well-documented troubleshooting checklist for SQL Server outages?
- Does your company have a good change control process to avoid surprises?
- Do you have an identical environment to test configuration changes and patches before going live?
If the answer to any of those questions is no, consider honing your processes before adding complexity.
But the Business is Making Me Do It!
They’re making you do it because you haven’t clearly laid out your concerns about the business risk. Show the business managers this same list of questions. Talk to them about what each answer means for the business. Was there a recent outage with a lot of finger-pointing between teams? Bring that up, and remind the business about how painful that troubleshooting session was. Things will only get worse under virtualization.
To really drive the point home, I like whiteboarding out the troubleshooting process for a physical cluster versus a virtual cluster. Show all of the parts involved in the infrastructure, and designate which teams own which parts. Every additional team involved means longer troubleshooting time.
Once the business signs off on that increased risk, then everyone’s on the same page. They’re comfortable with the additional risk you’re taking, and you’re comfortable that you’re not to blame when things go wrong. And when they do go wrong – and they will – do a post-mortem meeting explaining the outage and the time spent on troubleshooting. If the finger-pointing between the app team, SQL Server DBAs, network admins, virtualization admins, and sysadmins was a problem, document it and share it (in a friendly way) with management. They might change their mind when it’s time to deploy the next SQL Server cluster.
Whether you want help choosing between an active/passive and an active/active cluster, or if you’re the kind of DBA who knows that’s not even the right name for failover clustered instances anymore, check out our SQL Server clustering training page.
Scaling up is hard: big hardware gets expensive fast. Scaling out is equally difficult; interesting design problems creep in to scale out solutions. One of the more troublesome issues architects face when scaling out is the issue of distributing identity. It’s often advantageous for object identity to be unique and portable across all database instances in an application – we may need to move a user from server A to server B. Once multiple database servers are involved, a centralized sequence generator can be come a single bottleneck or may even run out of values under heavy load. A solution is needed to generate unique values outside of the database while avoiding the performance problems of random, or semi-random, GUIDs.
Sequential Beginnings: Identity and IDENTITY
In the beginning, there was an empty schema. At this point an architect returned from the coffee pot and made a decision about database identities or GUIDs. There many valid reasons to make a decision in either direction – identities or sequences are controlled in the database and can be used to ensure a physical order to data. Who knew that spinning hard disks work better when data is sequential? (By the way, sequential GUIDs may not be sequential.)
You can make a lot of arguments for the right or wrong way to do things from a logical perspective, but DBAs do have a pretty good point when they say that randomness can cause problems for database performance. Generating sequential identifiers in the database may not be the most elegant solution to identity problems, but it does ensure that data is written in an order that makes sense in a world of spinning disk drives.
Database controlled sequential identity has one problem: sequential identities will need to be generated on a single server. Under sufficient load, that server will become a bottleneck for application scalability. To move past a single server as a bottleneck, a more robust and load tolerant solution is needed.
Distributing Identity: The Case for GUIDs
Architects and developers may be thinking that identities are great, but what happens when everything gets further apart?
As applications grow (or even by design), it becomes common to see teams become worried about distributing workload across many servers, using queues, or even splitting the data out into multiple databases or database servers. It’s at this point in the discussion that things get heated and people start throwing around the idea that GUIDs are the only way to solve this problem. Or that you just can’t rely on identity from the database and application generated identity is the only identity that matters.
This is where the war about numbers vs GUIDs gets nasty and someone gets their feelings hurt. Ignoring the size of GUIDs, I can say that I’ve witnessed several GUID collisions in production systems. GUIDs are only theoretically unique – they may even create a problem that you didn’t know you had.
A Better Solution for Distributed Identity
Combining distributed identity and ordered data seems like it’s a hard problem. Random GUIDs can’t be guaranteed to be unique, sequential GUIDs can’t be guaranteed to be non-overlapping, and database generated identities require a persistent connection to a database (or else they require a looser idea of identity than some folks are comfortable with).
Moving away from the database as the central keeper of all knowledge and identity is difficult and many teams seem to equate moving identity out of the database with moving all logic and functionality out of the database. This doesn’t have to be the case. The database can still be used to provide a considerable amount of declarative functionality and logic but identity generation can be moved outside of the database.
Twitter and Boundary have solved the problems of distributed sequencing by moving the work away from the data tier. Both solutions solve the problem by treating a number as if it were an array of information. The first portion of a sufficiently large number is a timestamp; the timestamp is stored as the number of milliseconds since a previous point in time. The next number is a worker identifier – this can be anything that uniquely identifies the device generating the sequence. Finally there’s a sequence itself. The sequence is typically small (between 8 and 16 bits) and it starts counting again from 0 every time the millisecond counter changes.
The machine identifier, usually the MAC address, doesn’t matter specifically, but we do need to be able to reliably generate separate sequences of IDs. This doesn’t have to be a MAC address, it could be any number of bytes that identify a unique source of sequences. By storing milliseconds since epoch as the first portion of the key, we’re able to produce a sequence that’s mostly ordered with some random jitter at the intermediate levels. On the whole, though, our inserts will be ordered.
If you’re on the .NET side of the house, I solved this problem with a library called Rustflakes. The implementation is lifted wholesale from Boundary’s flake and the generated sequence values are a .NET
decimal which lines up with SQL Server’s
DECIMAL data type – it’s nothing more than an ordered 128-bit number. Which, coincidentally, is the same size as a GUID.
Wrapping it Up
There’s no easy solution to this problem. GUIDs are an easier approach, but they introduce additional load and maintenance overhead on the data storage mechanism. Distributed sequences don’t solve all of the problems of GUIDs, but they provide additional flexibility for database administrators, developers, and application architects alike.
I recently finished helping the StackExchange team migrate their SQL Server 2008 infrastructure to SQL Server 2012. These servers power StackOverflow, DBA.StackExchange.com, the new AskPatents partnership with the US Patent and Trademark Office, and hundreds of other Q&A sites that together form one of the biggest web site networks. It’s one of the most visible success stories for the Microsoft web stack.
How StackExchange Used SQL Server 2008
Before we talk about their new infrastructure, let’s look at their business needs. I’ve blogged about how we designed the StackOverflow recovery strategy back in 2009, but things have changed a little since then.
We store two types of data, and let’s start with the one you already know – the questions and answers. The public-facing web site data (questions, answers, comments, users, reputation, etc) is valuable, but it’s not priceless. From a business perspective, we could afford to lose a few minutes of this data in the event of a disaster, and we could take a few hours of downtime. (SQL Server can indeed do zero-data-loss solutions, but they’re not exactly cheap, so I try to avoid those where I can.)
The second type of data is growing more and more valuable: Careers.StackOverflow.com, for example. Careers merges your online reputation (like the number of answers you make on StackOverflow) in with your traditional resume. Employers can find people who don’t just say they’re good at .NET – they actually are, and they’ve proven themselves by getting high votes from their peers. Employers pay to post tech jobs and reach the best candidates. There’s real money involved with this kind of data, so a few hours of downtime is less acceptable.
Here’s how StackExchange used SQL Server 2008 for data protection, simplified a little for illustration:
- NY-SQL1 – primary. All web site connections go here.
- NY-SQL2 – hot standby using SQL Server asynchronous database mirroring. When NY-SQL1’s not doing transactions or playing sudoku, it sends the transaction log data over the network wire to NY-SQL2, which then applies the same changes. If NY-SQL1 had a serious hardware failure, we could manually go live with NY-SQL2 with a little work. Note that the storage doesn’t have to be the same – although that can introduce some challenges. Unfortunately, this server isn’t accessible to end user queries – its only purpose is madly scribbling down the live transactions.
- File Share – I’m a big believer in doing your full and transaction log backups to a different server, directly over the network. If you back up locally and then copy to a network, you’ve got more things you have to manage, and more things that can go wrong.
- NY-SQLREPORTS1 – once the databases are backed up to a file share, other SQL Servers can restore the backups for development or reporting purposes. You can also run DBCC checks against these databases.
- OR-SQL1 – offsite server with a delayed, somewhat out-of-date copy of the NY-SQL1 databases. If we lost the entire NYC datacenter, we could go live offsite, but we’d lose the changes since the last nightly backup.
That infrastructure diagram is simplified, and there’s some other business needs I’m not covering here. In reality, we had multiple sets of these servers, and the automatic backup/restore strategy was more complicated. I explained one version of this in my post on how to really compress SQL Server backups.
This setup worked well, but it entailed a lot of moving parts: database mirroring, log shipping, manual failovers involving a lot of scripting, and some painful management. I was really excited to simplify the infrastructure with SQL Server 2012.
Why SQL Server 2012 Made Sense for StackExchange
Microsoft brought its A-game to SQL 2012 when they introduced AlwaysOn Availability Groups. Now, along with the production SQL Server, we can have up to four more replicas in any number of datacenters. The replicas stay within a few seconds of the live server, yet people can query the replicas without blocking production load. This is awesomely useful for things like API.StackExchange.com and the Data Explorer (which lets anyone query any Stack database live).
Here’s a simplified version of the new SQL Server 2012 infrastructure:
Now we’ve got three separate pieces of hardware, and their purposes are a lot simpler:
- NY-SQL1 – the primary StackOverflow database.
- NY-SQL2 – the secondary StackOverflow database. NY-SQL1 sends transaction log data directly to this SQL Server, which then applies the updates. Other users (like API and Data Explorer) can query this server, and we could run backups here (but we don’t, for reasons I’ll explain later).
- OR-SQL1 – the offsite copy. NY-SQL1 also directly sends transaction log data here – it doesn’t flow through multiple replicas. Other users can query this server too, so we can make use of both datacenters’ internet connections if we wanted to. We run offsite backups here, so we’ve got redundant backups in both NYC and OR.
A little bit more about the backups – while SQL Server 2012 can indeed offload backups to replicas, the Stack team chose to continue doing their backups on the primary. Our logic was that if replication broke, we wanted to continue to get good backups – something you can’t guarantee if the backups are happening on a secondary. Therefore, we do full and transaction log backups on the primary (NY-SQL1) plus full backups on OR-SQL1 to have local offsite copies.
This diagram has a cluster, but no shared storage. This is one of the most attractive features for StackExchange because they wanted to avoid buying complex, expensive shared storage for this project. They’re speed freaks, and it’s tough to beat local SSDs in RAID 10 for that.
Finally, this diagram is really, really oversimplified. In reality, we’ve got:
- Multiple clusters – one for the highest load sites, and one for everything else
- Multiple availability groups – because we want to be able to fail over databases in groups.
- Multiple live sites – for example, NY-SQL1 doesn’t have to be the primary server for all databases in the cluster. Some sites have their primary home in Oregon rather than NYC.
- Multiple networks, differently powered servers, and more
The new infrastructure solved StackExchange’s business problems, but it introduced a lot of technical challenges.
Our Challenges with Availability Groups
We found a new clustering bug in Windows 2008R2. One night, all four of the SQL Server instances hosting secondary replicas crashed at the exact same time – across two separate clusters. This clearly was bigger than a SQL Server issue since it was happening in separate clusters, so we opened up a Microsoft support case immediately. Thankfully, it didn’t affect the primaries, so we were able to keep StackOverflow live on SQL Server 2012. After weeks of troubleshooting (and multiple crashes per week in the meantime), Microsoft narrowed it down to a race condition in Windows Server’s clustering code and gave us a hotfix. (It hasn’t gone public yet, but I’ll blog about that when it does.) UPDATE Dec 2012 – http://support.microsoft.com/kb/2777201
We ran into networking issues that we still don’t understand. While going live with the second set of clusters, we were stuck for hours with known problems with Windows caching ARP addresses. Once we got past that, we ran into the same symptoms described in this Exchange 2010 cluster post. SQL Server 2012 relies on Windows clustering in much the same way as Exchange does, so we tried the Exchange fix – changing an IP address temporarily. It worked.
We ran into worker thread issues in SQL Server. In the middle of the night (doesn’t that always happen?), one of our primary servers ran out of worker threads and simply stopped sending data to the other replicas. When more threads freed up again, it didn’t restart sending data. Microsoft has blogged about the number of worker threads required for AlwaysOn AGs, and we were well above that number. We manually raised the number of worker threads anyway, and SQL Server still didn’t start replicating again. There isn’t a command to restart Availability Group communication in cases like this – ALTER DATABASE XXX SET HADR RESUME doesn’t work because it’s not at the database level. To fix it without restarting the replicas, Nick came up with the idea of just changing a replica setting (like from read-only-intent to allow-all-reads) and that started communication flowing again.
Through all of this, the troubleshooting tools failed us. During our worker thread issues, the AlwaysOn dashboard reported that everything was A-OK – even when it had stopped replicating data. Even worse, the log_send_queue_size field in the DMV sys.dm_hadr_database_replica_states stopped updating itself once we ran out of worker threads. It looked like only 10MB of data was queued up – but we were actually behind by almost 10GB. Thankfully we caught it manually within a few hours, but this could have been a catastrophic problem. Using AlwaysOn Availability Groups is like flying a plane: you can’t just rely on one metric on one gauge and think everything’s okay. You need multiple copies of every gauge, and you have to watch ‘em all.
Microsoft’s support failed us for a while too. I don’t envy Microsoft support at all; it’s really tough supporting a product as big as SQL Server. New features make things even tougher because it’s hard to anticipate what data will be needed to solve a support case. However, I was pretty frustrated when support repeatedly asked questions like, “How is the cluster configured?” SQL Server has had complex integration with Windows Failover Clustering for years, and support needs a better way to get the data they need. Problems with AlwaysOn Availability Groups will require a lot of background information to get a good fix, and right now, the answer is spending a lot of time on the phone trying to explain things. That led to a bad experience for the Stack guys.
I had to fact-check Microsoft support’s answers. During the investigations, I monitored communications with support and interjected where necessary. For example, one support engineer told StackExchange to enable full memory dumps – without any hint of a problem that might cause on a box with 384GB of memory. I had to explain to the support engineer why he shouldn’t be making that recommendation without also explaining the drawbacks. This kind of thing isn’t a new problem with SQL Server 2012, and customers shouldn’t have to get a Microsoft Certified Master involved just to filter answers from Microsoft support.
I don’t think this stuff will get better with SP1. For years, I’ve heard people say, “I’m not installing the first version of Windows/SQL/Exchange/AngryBirds. I’m going to wait until Service Pack 1 comes out.” SP1 usually brings a slew of fixes, and indeed, there’s already been three cumulative updates. Waiting for SP1 also gets you the benefit of better community resources – I’ve already Googled for new error numbers and come up empty, which means I’ve got some blogging to do. But even with bug fixes and better resources, I don’t think AlwaysOn Availability Groups is going to get any easier to design, configure, and troubleshoot. It’s a very complex feature built atop a lot of complex features.
The Bottom Line: It Worked, but It’s Not the Easy Button
Microsoft SQL Server 2012’s new AlwaysOn Availability Groups solve real business problems at StackExchange, allowing us to aim for higher availability with shorter failover times, but it introduced some really tough challenges.
The StackExchange team consists of the smartest sysadmins I’ve ever had the pleasure of working with (and they’re hiring!) and they don’t have a full time SQL Server DBA. I wouldn’t recommend SQL Server clusters without a full time database administrator, though, and ideally a full time team of DBAs. StackExchange’s implementation wasn’t really that complex, and yet we still spent days and days working with Microsoft support to get the bugs ironed out. Along the way, we ran into Windows and SQL Server bugs – always a risk with a new feature – and it requires a highly skilled team to get to the bottom of these complex problems fast. Thankfully StackExchange has that team, but when faced with similarly complex challenges, I’m not sure how other companies are going to react.
Learn more about the challenges of AlwaysOn Availability Groups at BrentOzar.com/go/alwayson. If you’d like help navigating the obstacles in your own AlwaysOn infrastructure, contact me.
Strong development practices don’t spring up overnight; they take time, effort, and teamwork. Database development practices are doubly hard because they involve many moving pieces – unit testing, integration testing, and deploying changes that could have potential side effects beyond changing logic. In this session, Microsoft SQL Server MVP Jeremiah Peschka will discuss ways users can move toward a healthy cycle of database development using version control, automated testing, and rapid deployment.
Tools Mentioned in the Webcast
- SQL Server Alerts – Having basic SQL Server alerts in place can give you warnings of possible hardware failure.
- tSQLt – A database unit testing framework for SQL Server.
- Fixed database roles and user-defined database roles help define granular security.
- Many libraries exist for performing database migrations – they’ll typically be bundled with an ORM. Even Entity Framework supports database migrations!
South Korean musician Psy’s hilarious music video for Oppa Gangnam Style has been tearing up the charts:
Sure, it’s catchy, but it turns out there’s a message behind it. Gangnam is an absurdly wealthy area of South Korea, like America’s Beverly Hills meets The Hamptons meets Tokyo. Just like anywhere else in the world, the people who have inspire the people who don’t – but only inspired to spend like they have it, not inspired to actually make it. As a result, there’s a running joke in South Korea about women who spend $2 on ramen for lunch and then spend $6 on coffee.
Psy’s actually from Gangnam, and his video is a series of cut-together scenes that at first look glamorous – but then turn out to be pretty sub-par. For example, it opens with him sunning in the sand, but as the camera zooms out, you can see he’s really at a children’s playground, looking like an idiot. He’s really making a social comment, and you can read more in The Atlantic’s piece on Oppa Gangnam Style.
In America, it’d be like a gangster rapper poking fun at guys who claim to be tough, wealthy pimps, but who drive their $100,000 car home to their $50,000 house and proceed to spend the weekend doing chores while their wives relax on the couch. (I love it already.)
SQL Gangnam Style
Everybody wants 80 cores and 1 terabyte of memory on their database servers – and yeah, we work on those. It’s fun to open Task Manager and look at all the pretty pixels. I’ll save you the math: at sticker price, that’s over half a million dollars just for the SQL Server licensing alone. (Thankfully, in these circles, nobody pays sticker price.)
Notice how those CPUs are just sitting there idle, not doing anything? Hmm.
SQL Server 2012’s new core-based licensing means we have to stop flushing our money down the drain on accident. Our SQL Server needs a balanced diet: we can’t afford to spend $6 on coffee and skip lunch.
SQL Server’s version of a balanced diet is a balance between:
- How fast the SQL Server’s CPUs can consume data
- How fast the storage can deliver the data
- How fast the pipes are between the CPUs and the storage
And here’s how you figure out if your SQL Server is Gangnam Style:
- How to Prove It’s a SAN Problem Video – I explain how to test your storage throughput.
- Building the Fastest SQL Servers Video – How to use Microsoft’s Fast Track Reference Architecture guidelines.
- My Storage Best Practices – a collection of posts and references.
It’s fun to dance around doing the pony trot while yelling that you’re Gangnam Style, but don’t do SQL Gangnam Style. Spread your budget around to get the right combination of CPU, storage, and throughput in order to make your app fly.
SQL Server 2012 was the first version of SQL Server to support being installed on Windows Server Core Edition (Windows without a GUI to the rest of the world). With the impending release of Windows Server 2012 there are a lot of people who are excited about running SQL Server 2012 on Windows Server 2012 in their data center. Let’s look at what you’ll need to do to become fluent and productive with Windows Server Core.
You are sitting in front of a terminal. There is a keyboard here.
Connecting to Windows Server Core is a lot like firing up a text adventure game from the early 1980s. You’re greeted with a blinking cursor and not much else. Ready to get started? I am, too. Our first step is to change the default shell from
powershell.exe. You can only do make this change through the registry. Just fire up
regedit.exe and LOL your way to pain.
Or, you can start PowerShell (type
powershell.exe and press enter) and then make the registry change there. I found instructions through some random blog, to save you clicking, though, here’s the code:
$RegPath = "Microsoft.PowerShell.CoreRegistry::HKEY_LOCAL_MACHINESOFTWAREMicrosoftWindows NTCurrentVersionwinlogon" Set-ItemProperty -Confirm -Path $RegPath -Name Shell -Value 'PowerShell.exe -noExit -Command "$psversiontable'
That looks awesome, right? Now that you’ve made the change you’ll need to restart the server to see if it took effect. Just click on the Start icon and… oh yeah. Well, thankfully we can just use PowerShell to get this info through
Get-Help reboot. PowerShell may have to go out to the web and download and index a help collection. Your servers can get out to the web, right? While you’re finding a network administrator to give you permission to download PowerShell help files, I’m going to sit here and play Zork.
After the Reboot
At this point, you probably think I’m behaving like a jerk. You’re right. But I’m doing it to prove a point – as much as I am excited by the idea of Windows Server Core, I’m completely terrified of it and you should be, too.
How comfortable are you solving problems without RDPing into a server? If you really want to test yourself, bribe that network administrator with an ice cream sandwich and have them block your use of port 3389 for a week. We’ll see how you feel after that.
If you’re still excited about the prospect of a world without RDP, take stock of everything that you do on a daily basis. Evaluate how comfortable you are with automating every task via either T-SQL, PowerShell, custom .NET code, or a combination of all three. This isn’t all that difficult if you’re a DBA. If you’re an admin of a different sort, this might start to get fun.
After the Double Reboot
The biggest hurdle is troubleshooting – it’s going to be difficult for you to troubleshoot hot production problems without a GUI. Sometimes you really want to remote into that server, admit it. Here’s what I want you to do: every time you solve a production issue, write down every step you take. Make sure you include all the poking around and false starts at solving the problem. Yes, I even mean staring at Task Manager and yelling. Do this for the next month.
After you’ve resolved each issue I want you to script out every step you performed through T-SQL and PowerShell. Some of those remote tasks aren’t looking so pretty, are they? Maybe there is additional logging you can enable to make the diagnosis easier. Perhaps there’s monitoring you can put in place. But right about now I’d be willing to bet a sack of chalupas that you’re getting nervous about a move to nothing but Server Core.
While you’re busy being nervous, I want you to stand up and walk over to everyone else who needs to access a server. Ask them to do the exact same thing that you’re doing – every big problem gets every step written down and then scripted out after the fact. DBAs aren’t the only people who need to access a SQL Server, so make sure that all of the other administrators are on board and are comfortable with performing their regular tasks through a combination of PowerShell and custom .NET development.
After a Month of Scripting…
I’ll be blunt – I’m excited about Windows Server 2012 No More GUI Edition, but I’m also nervous as hell about the first time I run into this scenario on a production system. You should be, too.
Windows Server Core makes an incredible amount of sense for professionals who are administering hundreds of servers. The people doing this right now already have tools in place that make their job easier, or they’ve solved the problem by hiring an army of off-shore administrators. Windows Server Core makes a lot of sense at a point when you have sophisticated monitoring and you are in a place where you can treat your infrastructure as an array of sensor covered parts – you need to know that something is about to fail before it actually fails so you can take action long before you end up putting out a fire.
This isn’t for everyone. If you’re managing less than 10 servers, Windows Server Core isn’t for you. If you’re managing a volatile number of servers, elastic demand, or a virtualized army of siloed application servers I bet you’ve already figured out how to solve your administration woes. If not, take a look into Windows Server Core. If you’re managing a critical piece of infrastructure, think long and hard about what you’re doing and take my advice – script everything you’re doing for a month and then see how you feel.
Going into Amazon Web Services (AWS) architecture training last week, I expected to get whiteboard-level training for project planners, and it met expectations. We spent time talking about how Amazon’s most popular services fit together and how they can be used to solve our business needs. Nothing in the training itself was groundbreaking for me – I keep up with the AWS blog enough to know what’s going on – and I think I actually learned more about the state of consulting than I did about the cloud. Here’s the highlights:
Most attendees were consultants trying to get their heads into the cloud. They weren’t hands-on implementers – they just wanted to be able to point their customers to the right services. Their heads were swimming with all kinds of crazy (no, really) ways to use cloud services to solve client pain points, like connecting Amazon-based storage to on-premise servers over Fiber Channel over Ethernet. (Jeremiah’s retort made me laugh out loud in the middle of training: “HELLOOOOOOOOOOOOOOOOO IIIIIIIIIIIIIII AMMMMMMMMMMM LATENCYYYYYYYYYYYYYYYYYYYYYYYYYY”)
The rest were implementers tasked with moving their company’s operations into the cloud. They had already built successful technology companies using on-premise hardware, and now they wanted to migrate out of the expensive server-and-space problems. They wanted to know the best practices on how to make this transition.
DevOps is going to be the story of 2013. In the cloud, you use scripts to monitor load and adjust your infrastructure automatically. Put another way, your code makes continuous decisions about how much your monthly bill will be. The smartest coders will use a combination of on-demand instances, reserved instances, and the new black market. This blew away all of the attendees, and they asked a lot of questions like, “Who’s responsible for this – development or admins?” It’s neither – welcome to the world of DevOps. I was pretty surprised that even Amazon’s trainers didn’t bring up this concept, because it’s fundamental now. To learn about DevOps, check out What Is DevOps?, then Dmitriy Samovskiy’s excellent primer from way back in 2010 (he’s @Somic), then Mike Loukide’s historical perspective.
Cloud service sprawl is going to make SQL Server sprawl look like nothin’. Amazon’s architecture best practices actively encourage people to think of technology as a bunch of loosely coupled black boxes. Sure, that works today to build a very scalable infrastructure quickly, but think about how it’ll look 5-10 years from now. If you’re a CIO and you get a big bill for thousands of instances of hundreds of services, how do you even begin to identify where to cut costs? Be careful what you ask for – I’d be terrified to inherit a bunch of uninstrumented black boxes.
It’s really hard to train people on cloud services. Cloud services change everything: storage, networking, app/database interplay, caching, replication, you name it. To get the most value out of cloud architecture training, you have to come armed with a really solid understanding of networking, hardware, server virtualization, service-oriented architecture, and more. During the sessions, if any one attendee wasn’t familiar with any one term, we had to go off on a 5-15 minute tangent. That was pretty frustrating for those of us who came prepared.
Everybody’s struggling to keep their skills up. Technical services are changing faster than ever, and Amazon’s own trainers often had to stop and ask each other because things had changed since the training material was last updated. (Example: “Wait – the prices on this slide are wrong – they’ve dropped.”)
Developers who know the cloud can run rings around everybody who doesn’t. The most junior programmer in the room with some background in Amazon Web Services was far, far, far more capable of making business-changing improvements than the most savvy (but hands-off) consultant. One guy actually said, “I wish there was some kind of drag-and-drop Visio-style tool to just link these services together to build stuff.” Bad news: GUI tools can’t keep pace with the rapid changes in cloud services, and the guy who can code the cloud wins.
Thinking About Attending the AWS Architecture Training?
I’d recommend that you spend a few days working with the online services first. Consider doing the following tasks:
- Set up a Windows virtual machine in EC2
- Assign it a publicly accessible Elastic IP address
- Configure security so that you can remote desktop to it
- Add an EBS volume and hook it up to your Windows server
- Create an S3 volume, upload some files to it, and download them via a web browser
Those simple infrastructure experiments will get you familiar with the basics of configuring the most common Amazon Web Services capabilities. Many of the other services build atop EC2, EBS, and S3, so this ground knowledge will help you get the most out of the architecture training.
Next, read the most recent month or two of posts from the Amazon Web Services blog. When you don’t understand a term, spend a few minutes Googling it. When you feel comfortable digesting posts like Amazon EC2 Reserved Instance Marketplace and AWS Cost Allocation for Customer Bills, good news! You’re not just ready for AWS training – you’ve actually given it to yourself for free.
I’m not quite sure I can recommend the existing AWS architecture training that I took – it was a rough mix of out-of-order content and basic-level questions from attendees. I think the training would be much more valuable if you attend in a very high-tech area like Silicon Valley or Amazon’s home base in Seattle. There, you might meet more hands-on implementers and have more valuable real-life experiences to share. (For example, in Dallas, I was the only attendee who could answer questions about how to help clients choose between on-demand and reserved instances.) I’ve discussed my experience with the Amazon training team, and they were already making changes to reorganize the content and raise the bar on testing attendees to make sure they’re qualified for the course.
The Biggest Thing I Learned
Even when you don’t learn much from a class’s training material, that teaches you something: you know more than you thought you knew.
I wasn’t all that upset that the material wasn’t new to me. This just means I need to step up my blogging game here – we’ve gotta do a better job of sharing the things we’ve learned about cloud infrastructure. People find it valuable – after all, they’re paying to attend the Amazon classes – and I heard a lot of good questions that we can address here in the blog.
We’ll be sharing more of our success (and failure) stories in migrating database applications to the cloud including our very own site. Yep, BrentOzar.com is finally moving from a single virtual server to a highly available and elastic infrastructure using Amazon RDS, EC2, and load balancing. We’re aiming to go live next month before the conference season starts, and you’ll know when we do – we’re bringing an amazing new look to the site, too.
I get a lot of email, and I’m zealous about staying at Inbox Zero using the Getting Things Done productivity techniques.
One of my favorite ways to handle email fast is by using templates. Email programs like Outlook let you set up multiple signatures, and I use those to respond fast to common types of emails. When I get a job posting from a recruiter, I just click Insert Signature, Recruiters, and in pops this:
Thanks for the email about the position! If you can forward on the full job description and salary range, I’ll be happy to pass it along to my peers, too. I know a lot of people who are kinda-sorta-looking, but not actively in the job market.
After being repeatedly burned (hey, buddy, check out this job – oh, sorry, I didn’t know it only paid $X for Y years of experience) I have a policy against forwarding jobs without a salary range and a full job description. I hate that I have to say this, but “depends on experience” isn’t clear enough. Thanks for your understanding.
Have a great day!
The recruiter is completely clear on what they need to do next, and there’s no hard feelings. Most of the time, believe it or not, the recruiter really does reply back with a rate – and of course, the rate is a joke. (After all, one of the reasons companies hire a recruiter is that they’ve run out of contacts who are willing to work under their conditions.)
When I get a technology question, I use a similar approach. No matter how straightforward the question is, I want to teach people how to fish rather than give them a single fish. In most cases, I use this signature:
This is a really interesting question. I wish I had the time to dedicate to give you an answer that’s just as interesting, but there’s a lot of possibilities here, and I want to make sure you get the best help possible from as many people as possible.
The best way to get help on something like this is to narrow down the question as tightly as possible, then post it on http://StackOverflow.com for programming, http://ServerFault.com for systems administration, or http://DBA.StackExchange.com or http://SQLServerCentral.com for database administrators.
Include as much example code as you can so that others can quickly run it and get an idea of what you’re asking, and they might even post your code back with fixes. If you don’t get good answers within a day or two, let me know and I’ll take a look at the question to see if there’s a way we can get better answers together. Plus, bonus points: the next time someone Googles for this question, they’ll see the answer too.
This one’s a little tougher sell. Sometimes people are excited because they didn’t know about these killer free Q&A resources, and they’re thankful to learn. Other times, I gotta be honest – people are a little pissed off that I would respond so quickly with what is clearly a canned template.
I don’t use that Q&A signature in all cases, though. If it’s a paying client, I answer the question. (Duh.) Also, if I know someone’s asking a very specialized question and there’s only a few people in the world who know the answer, I’ll reply back with an introduction to one of those individuals so they can get the right answer fast. If someone’s asking a very personalized architecture question, and they include details that probably aren’t suited for public posting, then I’ll explain my options for consulting engagements. My public help like blog posts, presentations, and Q&A sites are free – I like helping as many people as I can, all at once. Helping people individually in private, though – that doesn’t really scale, and I have to charge for my time for those kinds of questions.
Using these templates helps me spend more time helping more people, and helps people get better answers and better jobs, too.
As a mid-level DBA, you know that SQL Server relies on the log file to record transactions and modifications in the database, ensuring integrity until data is written to the data file. Internally, one physical log file is comprised of many virtual log files. If you’ve ever been curious about how the log file works, this session will help you understand that. I’ll cover how transactions are written to the log, how virtual log files are used, and the optimal settings for databases of various sizes.
People bring me in when they’re having data problems. They can’t store data fast enough, they can’t make it reliable enough, they can’t hire people to manage it, etc. When I’m in the conference room, it’s because there’s a fire in the
You know how it is as a DBA, though – DBA means Default Blame Acceptor. Everybody thinks the fire started in the database, but often it’s the SAN, VMware, crappy code, bad third party apps, or any number of combustible materials. The company gets more and more concerned about the growing smoke, and they ask the DBAs, “Who’s a SQL Server expert you can call to put this fire out?” The DBA thinks about my crazy blog posts and blurts my name out – mostly because he wants to find out if I’ll show up in the Richard Simmons costume. (That costs extra.)
Now put yourself in my shoes: I show up in a conference room or on a WebEx, and there’s a huge problem somewhere in the infrastructure. Everybody involved is pointing fingers at each other, and they’re all armed with volumes of reports proving that it’s not their problem. In a matter of 3-4 days, I need to:
- Find the real root cause of the problem
- Prove it to everyone involved using their own language
- Show a few possible solutions and recommend the right one
- Convince them to implement it as quickly as possible
SQL Server isn’t the only fire extinguisher, and I have to know how to put out data fires with other tools. Amazon’s got a ridiculous list of services that are easy to get started with, including:
- Relational Database Service – point, click, and deploy Microsoft SQL Server, Oracle, and MySQL instances. Amazon manages the backups, patching, and security. The MySQL ones even support readable replicas and replication to multiple datacenters.
- DynamoDB – super-fast NoSQL database hosted on SSDs. You pick how fast you want it to go, and Amazon makes it happen.
- Glacier – store your backups in the cloud for $.01 per gigabyte per month with no cost for incoming data.
- Import/Export – ship them a USB drive, and they’ll hook it up to the cloud. For folks with slow upload links, this is the fastest way to move your data online.
That’s why I’m in Dallas, Texas for a few days attending Amazon Web Services Architect Training. It’s a three-day design session that covers how to design solutions with their services. It’s not going to make me a Certified Master of Cloudiness across their broad range of tools, but that’s not the point. Clients don’t usually want me to do the work myself: they want me to find the right answer fast, get the staff on the right page, and let the staff knock out the work together.
If you’re a data professional, and you’re frustrated when people keep saying it’s a database problem when it’s not, what are you doing to bridge the gap? Are you frustrated that The Other Guy doesn’t know anything about SQL Server? Or are you reaching out to learn The Other Guy’s technology to help him to see where the smoke is coming from?