When you see the cover of Database Reliability Engineering, the first question you’re probably gonna ask is, “Wait – how is this different from database administration?”
“…for a long long time, DBAs were in the business of crafting silos and snowflakes. Their tools were different, their hardware was different, and their languages were different. (…) The days in which this model can prove itself to be effective and sustainable are numbered. This book is a view of reliability engineering as seen through a pair of database engineering glasses.”
The book absolutely delivers: it’s a 250-page version of the concepts in Google’s Site Reliability Engineering book (which I love) targeted at people who might currently call themselves database administrators, but want to go to work in fast-paced, high-scale companies.
How Senior DBAs should read this book
Jump to page 189, the Data Replication section of Chapter 10. Campbell & Majors explain the differences between:
- Single-leader replication – like Microsoft SQL Server’s Always On Availability Groups, where only one server can accept writes for a given database
- No-leader replication – like SQL Server’s peer-to-peer replication, where any node can accept writes
- Multiple-leader replication – like a complex replication topology where only 2-3 nodes can accept writes, but the rest can accept reads
The single-leader replication discussion covers pages 190-202 and does a phenomenal job of explaining the pros & cons of a system like Availability Groups. Those 12 pages don’t teach you how to design, implement, or troubleshoot an AG. However, when you’ve finished those 12 pages, you’ll have a much better understanding of when you should recommend a solution like that, and what kinds of gotchas you should watch out for.
That’s what a Database Reliability Engineer does. They don’t just know how to work with one database – they also know when certain features should be used, when they shouldn’t, and from a big picture perspective, how they should build automation to avoid weaknesses.
I love those 12 pages as a good example of just how big in scope this 250-page book really is. The authors have very, very deep knowledge – not just database specifics, but how the database interacts with applications and business requirements. They abstract their experience just enough to make it relevant to all data professionals, yet keep the language clear enough that it’s still directly mappable to the technologies you use today.
For example, it doesn’t teach you how to use version control to treat your infrastructure as code. It just tells you that you should, and gives you a few key terms to look for as you start to build that skill.
You’re going to learn new terms and techniques. It’s going to take you years to turn them into a reality in your current organization. That’s okay – it’s about broadening your horizons.
How managers should read this book
Managers, you’re gonna read this and go, “Wow! I want a DBA team that thinks like this!”
Go back, read chapter 2 (Service-Level Management) carefully, and start working on it now with the staff that you have. Start crafting your service level objectives and defining how you’re going to measure them. In my experience, this is the single toughest part of the book, and it relies on the business stakeholders being able to come to a consensus. It’s a political problem, not a technical problem, and as a manager, it’s the part that you have to deliver.
That chapter’s recap includes two lines I adore, emphasis mine:
“The SLOs (Service Level Objectives) create the rules of the game that we are playing. We use the SLOs to decide what risks we can take, what architectural choices to make, and how to design the processes needed to support those architectures.”
Availability and latency are to database reliability engineers as revenue and profits are to salespeople. You wouldn’t dream of telling your sales team, “Ah, just get the best price you can, and we’ll be okay.” You can’t do that with your reliability engineers, either.
How developers & sysadmins should read this book
If you’re coming into database administration for the first time, some of the concepts are going to be familiar to you (release management, SLOs, monitoring, not treating human error as the root cause.)
Chapters 10-12 will seem terrifying.
In those chapters, you’ll learn a lot of very big concepts (ACID, CAP Theorem, caching, message systems.) When you read those, your eyes may get large, and your ego may get small. Don’t freak out: just by reading these chapters, you’re already ahead of what most database administrators know about those topics.
See, most of us DBAs are resemble the way Campbell & Majors described the starts of their careers in the beginning of the book: accidental DBAs. We didn’t go to school for this, and most of us don’t have computer science backgrounds. Reading chapters 10-12, you’ll think you’re getting a crash course on something that everybody else already knows well. Good news – we don’t know it well either. (That’s also part of why I told DBAs to start with pages 190-202.)
And yes, I do recommend this book.
It’s the kind of book that’s easy to read, and hard to implement. Seriously, just implementing the SLOs described in chapter 2 takes most traditional companies months to agree on and monitor.
Over time, the brand names and open source tools will change, but the concepts are going to be rock solid for at least a decade. This book is a great waypoint marker set about 5-10 years in the future for most of us, but it’ll be one you’ll be excited to work towards.
You can get Database Reliability Engineering on Kindle, on paperback, or on O’Reilly Safari. If you like this kind of thing, you should also pick up the Site Reliability Engineering book too – it’s fantastic.