Satya Slips a SQL Server Standard Secret?

Everybody wants to know the next SQL Server release date, pricing, and feature lists, but anybody who knows those facts is bound by non-disclosure agreement and can’t say it publicly. Every now and then, though, we get lucky and someone from Microsoft slips up. That’s why I pay particular attention to Microsoft interviews as we approach release dates.

Microsoft’s perhaps-next-CEO new CEO Satya Nadella sat down for an interview with Gigaom. It included this interesting line:

…everybody who buys a SQL server, in fact in this release, gets an always-on secondary in Azure.

Satya Nadella

Satya Nadella

Well isn’t that special? Satya would be the guy to know, too – he used to run the Server and Tools group (which includes SQL Server), and he’s since been promoted to Cloud and Enterprise (which includes both SQL Server and Azure).

Will SQL Server 2014 Standard Include AlwaysOn AGs?

Right now, SQL 2012 only offers AlwaysOn Availability Groups in the expensive Enterprise Edition. 2012 Standard Edition’s closest HA/DR feature in Standard Edition is database mirroring, but that’s listed on the Walking Dead Deprecated Features List.

But what if Standard Edition includes a crippled AlwaysOn AG? Currently, SQL Server 2012 Standard Edition includes a crippled version of database mirroring that only offers synchronous replication, not asynchronous. It’s less desirable for many customers due to the transaction slowdowns if you have to commit at both servers before a transaction is considered committed. Microsoft could treat Standard AGs the same way – only offering the less-desirable version in Standard.

Standard’s AlwaysOn AGs could also be limited in numbers. While 2014 Enterprise will offer up to 8 readable replicas, perhaps Satya’s “an always-on secondary” means Standard gets exactly one secondary.

Satya could even mean that only an Azure-based replica is included – not an on-premise one. This would be an interesting turn of events because it would require vendor lock-in to just Microsoft’s cloud rather than Amazon’s or a colo vendor.

What Did Satya Mean by “Gets an AlwaysOn Secondary”?

Could he mean that Microsoft is really willing to include the AlwaysOn virtual machine? Like you get some kind of licensing key that unlocks one free Azure VM running SQL Server 2014, and an easy way to set up HA/DR between your on-premise SQL Server and your new free virtual machine?

This would be revolutionary because Microsoft would be seen as a real vendor partner to SQL Server users. Instead of buying expensive hardware and data center space to protect your SQL Servers, you could just rely on Microsoft’s cloud.

At first glance, this would look like Microsoft going into competition with server vendors like Dell, HP, and IBM Lenovo, plus competing with cloud vendors and data centers like Amazon and Rackspace. But hey, that’s exactly what Microsoft has been doing lately – going directly into competition with vendors that used to be partners. The Surface competes with laptop partners, and the Nokia deal competes with Windows Phone partners. This could just be the next step, especially given another Satya quote in the interview:

So one of the things is, I have an always-on database, where is it running? The always on database is kind of on your private cloud and kind of on Azure and so the distinction even goes away.

If Microsoft can move your database into their cloud, they stand to sell more virtual machines and services. Maybe your database’s disaster recovery is becoming Microsoft’s loss leader. I’d love this, because we all need to do a better job of protecting our databases from disaster.

Or DID Gigaom Misquote Satya?

Maybe Satya said “Everybody who buys SQL Server Enterprise Edition gets an AlwaysOn Secondary in Azure.” After all, in this one quote, Gigaom managed to mis-capitalize both SQL Server and AlwaysOn. It’s not inconceivable that the quote was butchered.

The Trouble with Keys

Scaling up is hard: big hardware gets expensive fast. Scaling out is equally difficult; interesting design problems creep in to scale out solutions. One of the more troublesome issues architects face when scaling out is the issue of distributing identity. It’s often advantageous for object identity to be unique and portable across all database instances in an application – we may need to move a user from server A to server B. Once multiple database servers are involved, a centralized sequence generator can be come a single bottleneck or may even run out of values under heavy load. A solution is needed to generate unique values outside of the database while avoiding the performance problems of random, or semi-random, GUIDs.

Sequential Beginnings: Identity and IDENTITY

In the beginning, there was an empty schema. At this point an architect returned from the coffee pot and made a decision about database identities or GUIDs. There many valid reasons to make a decision in either direction – identities or sequences are controlled in the database and can be used to ensure a physical order to data. Who knew that spinning hard disks work better when data is sequential? (By the way, sequential GUIDs may not be sequential.)

You can make a lot of arguments for the right or wrong way to do things from a logical perspective, but DBAs do have a pretty good point when they say that randomness can cause problems for database performance. Generating sequential identifiers in the database may not be the most elegant solution to identity problems, but it does ensure that data is written in an order that makes sense in a world of spinning disk drives.

Database controlled sequential identity has one problem: sequential identities will need to be generated on a single server. Under sufficient load, that server will become a bottleneck for application scalability. To move past a single server as a bottleneck, a more robust and load tolerant solution is needed.

Distributing Identity: The Case for GUIDs

Architects and developers may be thinking that identities are great, but what happens when everything gets further apart?

As applications grow (or even by design), it becomes common to see teams become worried about distributing workload across many servers, using queues, or even splitting the data out into multiple databases or database servers. It’s at this point in the discussion that things get heated and people start throwing around the idea that GUIDs are the only way to solve this problem. Or that you just can’t rely on identity from the database and application generated identity is the only identity that matters.

This is where the war about numbers vs GUIDs gets nasty and someone gets their feelings hurt. Ignoring the size of GUIDs, I can say that I’ve witnessed several GUID collisions in production systems. GUIDs are only theoretically unique – they may even create a problem that you didn’t know you had.

A Better Solution for Distributed Identity

Combining distributed identity and ordered data seems like it’s a hard problem. Random GUIDs can’t be guaranteed to be unique, sequential GUIDs can’t be guaranteed to be non-overlapping, and database generated identities require a persistent connection to a database (or else they require a looser idea of identity than some folks are comfortable with).

Moving away from the database as the central keeper of all knowledge and identity is difficult and many teams seem to equate moving identity out of the database with moving all logic and functionality out of the database. This doesn’t have to be the case. The database can still be used to provide a considerable amount of declarative functionality and logic but identity generation can be moved outside of the database.

Twitter and Boundary have solved the problems of distributed sequencing by moving the work away from the data tier. Both solutions solve the problem by treating a number as if it were an array of information. The first portion of a sufficiently large number is a timestamp; the timestamp is stored as the number of milliseconds since a previous point in time. The next number is a worker identifier – this can be anything that uniquely identifies the device generating the sequence. Finally there’s a sequence itself. The sequence is typically small (between 8 and 16 bits) and it starts counting again from 0 every time the millisecond counter changes.

The machine identifier, usually the MAC address, doesn’t matter specifically, but we do need to be able to reliably generate separate sequences of IDs. This doesn’t have to be a MAC address, it could be any number of bytes that identify a unique source of sequences. By storing milliseconds since epoch as the first portion of the key, we’re able to produce a sequence that’s mostly ordered with some random jitter at the intermediate levels. On the whole, though, our inserts will be ordered.

If you’re on the .NET side of the house, I solved this problem with a library called Rustflakes. The implementation is lifted wholesale from Boundary’s flake and the generated sequence values are a .NET decimal which lines up with SQL Server’s DECIMAL data type – it’s nothing more than an ordered 128-bit number. Which, coincidentally, is the same size as a GUID.

Wrapping it Up

There’s no easy solution to this problem. GUIDs are an easier approach, but they introduce additional load and maintenance overhead on the data storage mechanism. Distributed sequences don’t solve all of the problems of GUIDs, but they provide additional flexibility for database administrators, developers, and application architects alike.