A while back, we embarked on a journey to increase the scalability and reliability of the Turso Cloud by orders of magnitude. We have already discussed how we are using Deterministic Simulation Testing and using the Antithesis deterministic hypervisor to keep iterating at high speed and high confidence—even when it meant rewriting our entire stack.
But that is just part of the puzzle. We've also completely reworked our storage system. Our new architecture allows us to operate without local disks, running entirely on S3. If you're familiar with transactional database design, you might be raising an eyebrow right now. There's a broad understanding that this approach is neither possible nor desirable for a transactional database because the latency would be too high.
But here's why this matters: this architecture makes it much simpler to run the software in an organization's own cloud account. Commonly known as the Bring-Your-Own-Cloud model, or BYOC, it is the holy grail for companies wanting to keep their data within their walls while still having someone else handle all the database headaches. No more painful Kubernetes StatefulSets. No more compliance nightmares. Just your data, in your cloud, managed by some sucker somewhere.
In this post, we'll show how a combination of two key elements—AWS's new S3 Express One Zone and Turso Cloud's massive multi-tenant architecture—makes this seemingly impossible architecture not just feasible but advantageous. We'll explore this architecture and discuss the trade-offs we're making.
Once, in my 20s, I was coming out of the shower and before I had the chance to get dressed, I had an epiphany on how to fix a bug that was destroying my soul. Instinctively, I rushed to sit on my computer chair, without noticing there was a wasp in the chair.
Yes, you guessed it right: I got stung. By a wasp. On my testicles.
You might wonder why, in an article about technology, I'm sharing this painful anecdote. It's because until I had to deal with Kubernetes StatefulSets, being stung there was the biggest pain I'd ever felt. It is now a distant second.
Having a wasp munch on deeznuts was such a painful experience, that I don't wish that on anybody. Maybe with the exception, of course, on the creator of YAML. And yet I would rather feed a nest of wasps a daily dose of meatballs than spend my life managing Kubernetes StatefulSets.
But here's the thing: these days, everyone wants to have their cake and eat it too. Companies want their databases running in their own cloud accounts (for that warm fuzzy security feeling) but also don't want to deal with the headache of managing them. It's the classic "I want all the control but none of the work" dilemma. This is what the cool kids call Bring-Your-Own-Cloud, or BYOC.
More often than not, BYOC translates into Kubernetes StatefulSets. And while it's not impossible to operate a BYOC offering this way, it's definitely not easy. For those lucky enough not to know much about Kubernetes, the system is designed to move pods (fancy names for groups of containers) from one place to another, restart them, recreate them, etc. Something misbehaving? Kubernetes will automatically kill that pod and bring another one in its place. Resources tight? It brings up more pods and spreads the load between them.
But when you have state, things get much harder. The scheduler becomes severely restricted in what it can do and where it can allocate a pod. If things need to be moved to different nodes, the data needs to move with it.
Our previous architecture was heavily reliant on state, which is not surprising for a database. We've been storing backups on S3 for a long time. But writes to the database were returned to the client right away, with backups done asynchronously. The common wisdom is that writing to S3 before acknowledging the write to the client is both unacceptably slow and expensive. This means that despite the S3 component, our architecture was not diskless: there was a period when writes were not really durable anywhere except the local disk.
A truly diskless architecture means that database compute nodes behave how Kubernetes expects: they can be moved anywhere, at any time, and the data can always be retrieved from S3.
This approach not only frees our team from managing things (allowing us to focus on more interesting endeavors like that juicy rewrite of SQLite in Rust), but it also means that deployment in customer cloud accounts can be properly automated.
S3-native architectures are becoming a staple of analytical databases. For transactional databases, they're less common. The reason? S3 is supposedly slow and expensive. But is it, really? Let's find out.
From an EC2 instance in the us-east-1 region, we executed 1000 writes and reads of sizes 4kB, 128kB, and 512kB. The code can be found in this gist, and here are the results:
Operation | avg (ms) | p95 (ms) | p99 (ms) |
---|---|---|---|
Download 4kB | 19 | 23 | 42 |
Download 128kB | 24 | 33 | 46 |
Download 512kB | 25 | 34 | 48 |
Upload 4kB | 31 | 60 | 102 |
Upload 128kB | 53 | 124 | 199 |
Upload 512kB | 83 | 176 | 233 |
For a 4kB write, a common segment size for database transactions, the latency is already in the double digits. And don't get me started on the P99—it's absolutely horrendous.
What about the price? According to AWS, it costs $0.005 per 1,000 PUT
requests. Our cheapest paid plan on Turso Cloud costs $4.99 per month and allows developers to write 25M rows per month. In the worst case, where each transaction writes a single row, this would cost us $125.00. I'm no Warren Buffett, but in the words of Tony Stark, not a great plan.
The good news: Amazon now has the other S3. They missed a great opportunity to name it S4 or even S3-XY, and went instead with "S3 Express One Zone" (or "S3 Express" for friends and family).
S3 Express comes with some limitations. For starters, it's not available in all regions yet. Also, as the name suggests, it's not replicated to multiple availability zones. Data durability is still guaranteed to be 99.999999999% (that's 11 9s) by AWS SLAs, but the uptime SLA is that of a single availability zone, which according to AWS is 99.95%.
But by doing this, AWS can guarantee single-digit millisecond latency. We put that to the test with the same 4kB, 128kB, and 512kB reads and writes, as shown below:
Operation | avg (ms) | p95 (ms) | p99 (ms) |
---|---|---|---|
Download 4kB | 3.8 | 4 | 4 |
Download 128kB | 4.3 | 5 | 6 |
Download 512kB | 5.5 | 6 | 7 |
Upload 4kB | 6.4 | 7 | 7 |
Upload 128kB | 5.5 | 7 | 7 |
Upload 512kB | 7.5 | 8 | 10 |
Operation | avg (ms) | p95 (ms) | p99 (ms) |
---|---|---|---|
Download 4kB | 4.4 | 5 | 5 |
Download 128kB | 4.9 | 6 | 6 |
Download 512kB | 6.1 | 7 | 8 |
Upload 4kB | 7 | 8 | 8 |
Upload 128kB | 6.2 | 7 | 8 |
Upload 512kB | 7.7 | 9 | 10 |
As we can see, the results are much better and within AWS's promise. Not only is the latency very low, but it's also highly consistent and predictable.
For context, unbeknownst to many people who haven't had their souls forever tarnished by writing (and debugging) databases, even with a local disk, data isn't really durable until you call the fsync
system call, which can cost around ~2ms. This means S3 Express is only adding about 4ms in the 4kB case—perfectly acceptable for most traditional SQL workloads.
It's important to acknowledge that for some ultra-latency-sensitive workloads, even ~4ms of latency added to the write path may be too much. But for the vast majority of applications, it's entirely within acceptable parameters.
What about the cost? From the same AWS pricing table, 1,000 PUT
requests cost $0.0025 with S3 Express—half the price of standard S3. That means we'd still be losing about $57 a month per customer. Not exactly a sustainable business model!
And here's where the true key innovation of Turso Cloud becomes apparent: our server is a massive multitenant system designed to host millions of SQLite files per node. Of course, to get to millions, most databases would have to be inactive. But even if we host 100 very active SQLite databases per node, we can batch writes for all active databases. The cost doesn't change substantially, and the latency remains comfortably within single digits even for larger writes.
We may wait a couple of extra milliseconds even when we could start a new upload to S3 Express right away. This gives us the chance to accumulate transactions for a large number of databases. If we amortize the cost of $57 a month by, say, 100 databases, the cost drops to just $0.57 per database per month. That leaves us with plenty of budget to account for other things like reads and compute, while still leaving valuable real estate unclaimed on Mr. Abraham Lincoln's face.
But what about reads? Certainly, adding 5ms to every read operation isn't anyone's dream? Thankfully, that's not needed. The data still resides on local media—there's no need to run the system without any disks, just without any local state that marries the compute nodes to the local disk. The local disk becomes a write-through cache. Once data is on S3 Express, we also write it locally.
If a pod needs to be replaced, you can stream the data from S3 into the new pod and switch when ready. If a pod suddenly dies, another one can be brought in immediately and start reading directly from S3. This isn't fast, but in disaster scenarios, you'd have to restore from a backup anyway. With a diskless architecture, you don't lose availability: the data can be served immediately from the new pod, at the expense of higher latencies while a background recovery process executes.
A 99.95% uptime SLA is great for many use cases. For those who need more, it's always possible to write data to multiple availability zones, which is something we consider doing in the future. As we saw from our experiments, the latency of crossing a zone for S3 Express is still very good—just one added millisecond. And the cost for writing to two zones is the same as the cost of writing to standard S3.
Operating stateful services is challenging. Operating them inside someone else's AWS account is even harder, given the limited ability to perform manual operations when things don't go according to plan. To overcome this, many infrastructure providers offering the BYOC model opt for an S3-first architecture, where data is always resident on S3, with an optional copy on local disks for fast reads.
Traditionally, it's believed that this architecture is too slow for transactional databases. But we've proven this assumption wrong through a unique combination of fast access through S3 Express One Zone and our massive multi-tenant architecture that amortizes costs across databases.
The result is a truly diskless architecture that delivers:
This architecture forms the perfect foundation for organizations seeking the security and compliance benefits of keeping their data in their own cloud while still enjoying the operational benefits of a fully managed database service.
And most importantly, nobody needs to get stung on their sensitive parts to make it work.
Get started with Turso Cloud today at turso.tech