Since the dawn of time, database vendors have called out to developers, “Store everything inside the database. You know you want to. Here, we’ll make it easy for you by adding data types like binary and features like filestream.”
Ah, but this siren song leads you to your eventual doom.
Users wanna store historical versions of files. In the beginning, the design requirements might agree to only save the most recent version of a file, but that’s gonna change. Users are going to wanna save every version over time – and even when they don’t, they’re going to create “new” files by saving them under an ever-so-slightly different name. You know that 2022_Budget_Final_v2_with_Changes.xlsx file, and all its grandparents? Yeah, they’re all living in the database, forever.
Users create new versions frequently. You know how you click File, Save all the time because you’re used to crappy applications that crash and lose your progress? Now think about that hitting a relational database server every time, multiplied by all of the users who are working in this system.
Users don’t purge anything. Over time, the files look like an episode of Hoarders. Nobody can clean things out because nobody knows what’s safe to delete, so “we’d better keep it all just to be safe.”
Files are huge. Relational databases work great with tables broken up into rows and columns. When we insert rows or update columns, we’re only affecting a relatively small amount of data each time, making it easy to track in transaction logs and write to storage quickly. Not so with files, which usually contain incompressible things like images.
A bigger database means slower backups. Not just your full backups, either – we’re talking slower transaction log backups, too. As your users constantly save new versions of files, those versions get written into the transaction log, and that’s gotta get backed up. (Filestream users: before you get all excited that your feature is “different”, go double-check your log backup sizes, because filestream doesn’t work the way you think it works.)
A bigger database means tougher HA & DR. Not only do we have to back up the transaction log, but we have to push those transactional changes over to our other replicas. This huge log volume means it’s very challenging to do synchronous replicas in other data centers, and it’s very easy for other data centers to get way behind on transactions. When a user adds a 100MB file, and then another user tries to insert a single row, that row has to wait until the 100MB file synchronizes first. The low-value file storage can cause much worse recovery time objectives for our more valuable data.
Database servers suck at re-synchronizing data. When replication breaks, or you need to re-add an Availability Group replica (or log shipping or replication or whatever), SQL Server isn’t smart enough to only synchronize the different data, or sync data from the closest replica. This isn’t a big deal with small transactional tables, but throw terabytes of file data in, and resynchronization is a nightmare.
Finally, database servers are an expensive, challenging bottleneck. If you make a list of all the servers in your company, and rank them by things that make you tear your hair out, the relational database servers are probably at the top of every list. Their licensing and support costs are expensive, they’re hard to performance tune, and you just don’t have that many people on staff who are truly experts on them.
All of this might come as a surprise if you’ve only been reading the brochure from the database vendors. That’s because they want you to store files in the database. Can you think about why they might be motivated to sing you that siren song? They make more money when you’re locked into their system for everything.
There’s a better way:
store files in a file system.
File systems have been around for a really long time, and they have tools to help address these challenges. Best of all, they’re insanely, insanely inexpensive, especially compared to relational databases.
When you think that storing files in a relational database is the right answer, stop for a second, write up your list of requirements, and go talk to your systems administrator.
Because if you’re not doing a relational join with it,
it doesn’t belong in a relational database.