Oops, You Wrote a Database

Those who fail to study databases are doomed to write them

·

5 min read

Dear Sir,

I am afraid to inform you that you have written a database. I know you just wanted some "simple persistence" and that "a basic key-value store will do". Maybe keep it in memory as an object, or read/write simple JSON files on disk or to a cloud KV store. You said that "Postgres is overkill" and "ORMs create impedance mismatches", and yet, six months later, you have a mountain of application code dedicated to caching, updating, and defensively reading your data — breaking every time you change your data model. You moved 30% faster for your 1 month MVP but it is now slowing you down 30% per team member per month.

Surely, you've read Reddit has two tables and Dan Pritchett's BASE: An ACID Alternative and you don't mind writing some extra migration and defensive code in userland to be web scale. But after working on the app for multiple weeks and hiring more people you are having trouble remembering what goes where, so you start writing down a list of all the important entities and their attributes and the range of their values. Perhaps you manually maintain them, or you pull in something like a tRPC or an Apollo GraphQL to get some extra dev tooling and codegen.

The other problem you soon encountered is that there would be weird Heisenbugs cropping up in your KV stores where a user update would go through and show up in one feed but not be updated in another, especially when multiple users and apps access the same sets of data. Your new team members suggest adding a Pending state to all the fast updates, and then waiting for success on the slow/error-prone updates to then do a second update to Complete. We're moving fast, and we're ensuring consistency in userland while keeping things simple. Maybe you just needed this in like 5 places throughout the app, so you extracted some utility code to wait for confirmation of updates for success, or roll them back on partial failure. Maybe you even had someone split out the Pending updates to a separate "log" since if the app crashes we don't want to lose any user data. Maybe when updates happen, you want other things to happen, so you devise an ingenious "hook" system that triggers more code to run when your not-a-database code ends.

Then you saw that there were bits of data that always get accessed and updated together. They are inconsistently named and it's tiring to always write the same 3 lines of code needed to join them together every time. In the spirit of keeping things DRY, you wrote a class with all these CRUD operations across fields, but also taking care to have intuitive, guessable API naming with a consistent grammar. Maybe it needs to be learnable by others, and maybe you want to expose it to end users (your app's users) for them to make their own queries whether through plain text or autogenerated UI.

Your app launched out of beta, and you got real users! Performance became an issue. There are a myriad of ways to tackle it and your growing team wrote more code to use them all:

  • The same queries get asked again and again? Maybe we can just memoize the reads, save the results and return them without rerunning the read, join, and aggregate code.

  • We can predict which queries get asked again and again? Maybe we can just pick a few of those queries and precompute all the results. This way, we're faster on initial queries, not just the subsequent ones.

  • We can't predict which queries will happen? But we want to make sure we don't request data we don't need? And nested data dependencies mean some queries wait on other queries? Maybe we'll compile a little graph and caching layer to run everything faster. Maybe give it an unassuming name like Dataloader.

  • Some writes are slow? Like this one? When your team's book club read Designing Data Intensive Applications, someone had the bright idea of splitting out some high fan-out writes and pushing some of the load into the third normal form.

In the last leg of your journey to avoid using a database, your new high paying Enterprise customers demand assurance that you have taken the necessary security measures:

  • How do you ensure users can't edit documents they don't own?

  • How do you ensure that a devious hacker who is snooping around your undocumented but publicly exposed APIs can't read what they shouldn't?

  • How do you ensure that people who shouldn't have access to your customer's data, don't?

  • How do you reassure them that YOU don't have access to their data? How do you keep their data where they need/want it kept? How do you delete their data when they want but also recover it if they made an oopsie (or worse, you made an oopsie)?

  • When something bad happens, how do you go back in time to figure out what's wrong? How do you know who did what and when?

Your engineers sigh, but those enterprise contracts are juicy. You write more and more and more code and have them audited by a fancy security firm to get the thumbs up.

A schema, transaction manager, write ahead log, change data capture/stored procedures, query language, caching, indexing, query planning, security/authorization, recovery, and an audit log.

Dear Sir, you have written a database.


written as a database complement to Dear sir, you have built a compiler.

Further discussions on Hacker News and Twitter.

See also Stop Building Databases.


P.S. more things Are Database than you might think!