Geographic Sharding is Usually a Pretty Bad Idea.

Development
1 Comment

Let’s say you built a web app and it became globally popular. You’ve got users scattered across North America, Europe, and Asia.

In theory, you could shard your data globally: build 3 sets of SQL Servers, one on each continent, and split up their data.

In reality, this usually creates more problems than it solves.

Users are mostly active in their own time zones. So your North American database servers will be getting hammered while your European and Asian database servers are sitting around bored. That’s a great thing if you need to plan database maintenance, but it’s a terrible thing if you’re paying a ton of money for hardware & licensing that’s sitting around unused at night.

You have to figure out where to get a user’s data. In simple tutorial examples, posts usually hand-wave away this complexity, but it’s much, much harder than it looks. Just because someone’s web traffic looks like it’s coming from a European IP address doesn’t mean it’s a European user. At login time, you’re going to need to check against a centralized list of clients – and that means you need to keep that list in sync across all continents. When a new user is added, you’ve got something akin to a database transaction to be coordinated across that central list, plus other continents. It’s doable – but it’s much harder than tutorials suggest.

Users will move. For example, I lived in Iceland for 9 months, with an Icelandic address. I was subject to the GDPR while I lived there. I couldn’t believe how much my web experience changed – sites like Facebook were suddenly so much more polite about asking for permissions for things. How long should a site wait before moving my data to a different continent? How long is it legally allowed to wait, for data residency purposes? How will you automate this process, because with a global app so big that it needs multiple sets of database servers, you can’t wait around for meatbags to push buttons to do this maintenance task.

Managers will want reports across continents. I’m not just talking about your customers, either: your own company’s internal managers will want to see near-real-time reports in a single place that sum up global activity. You’re either going to need to build a reporting layer to query multiple continents and combine the data, or use a data warehousing process to periodically merge the data across multiple continents.

Even users want to see data across continents. If you’re building a global social media network, for example, then every time @Cristiano posts a photo, his 634 million followers around the world are going to want to see it, immediately. Where do you store his data & photos?

This stuff is all doable,
and yes, companies do it.

But when they do, they dedicate entire teams of developers to solving the problems listed above. This isn’t something you wanna do over a weekend with a pizza and a case of beer.

If you’re not ready to commit entire teams of people to solving these problems, do the obvious stuff first: performance tune your existing database servers. And of course, you can learn how to do it, and if you can’t wait for that, you can hire me for quick pain relief. Both of those options are way cheaper, faster, and easier than geographic sharding.

Previous Post
Who’s Hiring in the Microsoft Data Platform Community? September 2024 Edition
Next Post
[Video] Office Hours Q&A: Abrupt Exit Edition

1 Comment. Leave new

  • I saw a really cool sharding solution using oracle for an organization that implemented basically an automated inventory system in several remote locations with only satellite networking available, while reporting in the central location needed to be fairly fresh and logically within the same database(s).

    At each remote location they had an oracle instance that recorded the data locally then did some sort of oracle magic to send stub records in real time (minus the latency) to the central location (I assume this was a hash or something) so that if they needed basic counts, they could get that information immediately, and then the more detailed information would trickle in asynchronously.

    Logically they had only one database with many partially autonomous shards, all still acid compliant and consistent within each ledger without creating a ton of blocking as each new record from the slow locations came in. Similar to bi-directional replication, but it allegedly actually worked.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.