A deep look into our new massive multitenant architecture

A deep dive into the reasons behind us rewriting our software stack

Glauber CostaGlauber Costa
Cover image for A deep look into our new massive multitenant architecture

Over 20+ years in which we have been in the Software Industry, we have learned a few important lessons. The two most important of them? Never reinvent the wheel, and never rewrite what is working. Those things are fun to do, but they usually end in pain.

Our company was recently faced with an important scalability problem. How are we solving it? We reinvented the wheel, and rewrote what is working.

Is that a symptom of our – let’s call it stupidity, since the r-word is banned from polite society – or one of those rare cases where you have to know the rules to know when to break them?

In this post we will lay down the facts. And you’ll be the judge of that.

#The problem

Turso is a serverless database based on a fork of SQLite. We believe that in the next coming years, the cost of a single database needs to drop down to zero. If the cost of a database drops down to zero, you can have millions of them. That can be used to support short-lived and ephemeral databases, per-user AI contexts, etc. A file-based database is the best way to make that happen.

#The status quo

Once you sign up for Turso, our control plane provisions a small VM for you. Inside that VM you can create a lot of SQLite files, and then either access them over HTTP or replicate them to your own server or mobile device.

From the point of view of a specific database, this is fully serverless: if you access a specific database, you pay for it. If you don’t, it costs you nothing. But we still need to keep the VM up for you. To keep costs under control, if you don’t access any of your databases after a while, your VM scales to zero.

We have this 2-layer approach in place because isolation between individual databases is good but it is not perfect. In theory we should be able to have near-perfect isolation between different SQLite connections. But in practice there are issues. Sometimes related to resource allocation and consumption, but also security, that keeps us from doing this for the time being.

#Moving fast, but bugs slow you down

The approach we have been using to make sure everything is perfect and bug free is the KimStyle method, summarized in the image below:

With the difference that Kim is certainly better dressed than us, the method is similar. It relies on developers moving very carefuly and knowing what they are doing. That, plus automated testing, of course. But automated testing is limited: it is only great at making sure that the issues you have predicted that could happen or have seen before in the field won’t happen again.

As our platform keeps onboarding new users, some of them with incredibly wild workloads, they keep unearthing more issues. That is a double whammy: first, this makes us feel less confident in just throwing a lot of unrelated customers in the same multitenant server. Second, the time we spend fixing those bugs is time that we are not spending moving the product towards our goal. And a lot of those bugs can be very difficult to even reproduce outside of the real life setting.

#An example bug

One of our customers created ~5,000 databases in a couple of hours. Sometimes, after creating a couple of them in quick succession, database creation would deadlock.

Usually deadlocks occur when you have a pair of mutexes and acquire them in an inverted order: thread1 acquires mutex A and waits on B, but thread2 acquires mutex B and waits on A.

Turns out our deadlock happened with a single mutex. More specifically, in this code:

let blocking_task = tokio::task::spawn_blocking({
  let mutex = mutex.clone();
  move || loop {
    eprintln!("blocking thread start");
    let guard = mutex.lock().unwrap();
    some_sync_function();
    drop(guard);
    eprintln!("blocking thread end");
  }
});

When using async rust, you are not supposed to hold a synchronous mutex across an .await, and this code looks fine. But the function being called looked like this:

fn some_sync_function() {
  tokio::runtime::Handle::current().block_on(some_async_function());
}

This code doesn’t have an explicit await, but an implicit one, due to the block_on inside the function. Whenever all futures were resolved immediately and control was never yielded to the executor, it all worked fine. When one of the futures was pending, this code would deadlock. Of course, whether or not any future will yield depends on many factors. Whether or not data is cached, whether a socket is ready or not, the orientation of the stars, or whatever else.

When the ticket is opened, all you can do is try to blindly guess which conditions you have to emulate to trigger this, or read the entire codebase until you come up with a reasonable theory.

This one took us weeks. Skill issue on our side? Maybe. But look no further than issues like Crowdstrike, the Linux CUPS CVE, and really any system level programming, to understand that those things happen. All the time.

In a departure from the original KimStyle, both the engineer who wrote the bug and his family are still alive and doing well. But the weeks we spent tracking this down are never coming back.

#A different style

Instead of coding KimStyle, we decided to experiment with a new style of coding: TigerStyle.

Popularized by the absolute bonkers of a database, TigerBeetle, it felt very attractive: an alternative way to code with no bugs that actually increases our confidence that the system will work well with any input.

At the core of TigerStyle lies Deterministic Simulation Testing, or DST. The appeal of DST is that you can write the system in such a way that all I/O operations are abstracted. You can then plug a simulator into the I/O loop. The simulator has a seed, and given the same seed, you guarantee that every single thing will always execute in the same way, in the same order, and with the same side-effects.

And then comes the beautiful part: you can generate random traces, and get the system to do things so wicked that not even the most degenerate chat member on ThePrimeagen stream could think of. And if anything breaks, the seed will allow you to replay exactly what happened, as many times as you want.

#Rewrite everything

Once we decided to try DST, there were two choices: have an implementation of our server that uses DST from the ground up, and/or use software like Antithesis. Antithesis is a relatively new company that allows any distributed systems to be simulatable by running it on top of their deterministic hypervisor.

Rewriting things is scary, and before embarking on this journey we discussed this extensively with Joran, TigerBeetle’s CEO, to learn from his experience. In the interest of full disclosure, he ended up investing in our company as these discussions evolved. On the Antithesis question, on his advice, we decided to do both. A good analogy to explain why is integration vs unit tests.

Rewriting our stack to use DST from the ground up is the equivalent of unit testing: it would allow us to find bugs and iterate a lot faster and cheaper in the long run. External simulation is like integration testing: slower and more expensive. But it can find bugs in the simulator itself, and tests the system interaction with the OS and other components.

For massive multitenancy, having our own DST server also allow us to codify that excessive resource consumption case are bugs, which is critical for our plan to achieve massive multitenancy.

I have a very high degree of trust in Joran’s work. In the early days of TigerBeetle, he reached out to me personally to let me know how much the work we have done on io_uring and asynchronous Direct I/O at our previous company (Scylla) inspired TigerBeetle’s I/O loop.

A while back, I was flattered when @jorandirkgreef approached me to say I had inspired him with my work around io_uring in designing Tigerbeetle. Now Joran, it's my turn to tell you how much you have inspired us. The more I look into deterministic simulation, the more I fall in…

Pekka Enberg
Pekka Enberg
@penberg

You can see the full proof of concept in this repository: github.com/penberg/hiisi Looking forward to hearing what you folks think! 5/5

157
Reply

Seeing Joran talk about how DST allowed TigerBeetle to move fast with confidence and no bugs, and with the problem we had at hand, we decided we had to try. And if it worked, we could join hands in a virtuous circle jerk of bug-free glory.

#Reinvent the wheel

If you are to write any significantly complex code that deals with I/O in Rust, you will likely reach out for one of the async executors In practice, almost always Tokio. After all, why would you reinvent the wheel?

For almost everybody out there, this is the way to go, a safe bet. But for our DST-based server, we decided to just use a manually crafted event loop with a callback system, and stick to sync Rust. Let’s evaluate some of the issues that arise from using async Rust, especially Tokio, for our specific use case.

#The direct cost of the abstraction

Async rust is not without cost. There is a great benchmark that compares the cost of the async abstraction to manually rolling your own event loop. You can read the whole thing here, but a great summary is found in those two lines:

manual                  time:   [26.262 µs 26.270 µs 26.279 µs]
...
box                     time:   [96.430 µs 96.447 µs 96.469 µs]

The first line is the cost per operation of rawdogging Rust the way The Lord intended, and the second, the cost of using the async abstraction, with a minimum custom runtime (not tokio) and boxed futures. The async version is almost 5x slower. This is usually not much since it’s still in the microseconds. But a SQLite query is often in the microseconds too. A 70us difference essentially means at least one fewer query, meaning we now can fit fewer databases in a box.

This is roughly in line with what we saw in our first benchmark on our first prototype, although the new DST server has other factors in play as well.

Times in milliseconds, for one individual SQLite query over local HTTP, with 30 concurrent connections on localhost:

Existing Server: [Mean: 20.377, StdDeviation: 22.408, Max: 135.000]
New DST Server: [Mean: 3.317, StdDeviation: 4.392, Max: 25.000]

#The indirect cost of the abstraction

In practice, the overhead that comes directly from the executor is not the biggest issue. The biggest issue is that async Rust forces everything to have a static lifetime. For those not initiated in the cult of Rust, that means that your variables don’t have a well defined scope, and have to “live forever”. There are two kinds of data that live forever: statically allocated data at the beginning of the program– that has very limited uses– and data allocated on the heap.

Not only data allocated in the heap requires a memory allocation, the very resource we’re trying to be very particular about, but those allocations are very hard to control. As variables are passed down to continuations, they hold on to memory. An example of this is:

let a = Rc::new(something);
let inner = a.clone();
handle = task::spawn(async move { inner_usage(inner); });
other_usage(a)

This memory is now almost impossible to account for, and will not be freed for as long as someone holds a reference.

#Inherent non-determinism

The async executor can also be a source of non-determinism itself. While it will itself internally use the epoll system call for most network operations, it uses threads for other things, like File I/O. Within the thread, if work stealing scheduling is enabled (Tokio’s default), the async executor also has its own task scheduler, meaning that you are now opening yourself up to issues that may arise from the timing of scheduling.

While it is possible to use Tokio in a deterministic manner by disabling work stealing, and bypassing the executor in the operations for which it would use threads, that now means you have to be extremely careful and audit your dependencies to make sure they follow this pattern. Most won’t, so you have to rewrite a lot of your code anyway. Might as well go balls to the wall.

#Conclusion

We have decided to write a large chunk our cloud service from scratch, to take advantage of Deterministic Simulation Testing, and make sure massive multitenancy is part of the core design of what we do. Even with an imperfect simulator, we have already managed to find a variety of bugs in hours that would have otherwise puzzled us for weeks.

The benefit is already clear. In a short period of time, we have developed enough confidence in that server to finally release a version of our service that doesn’t have any cold starts for anybody, in any situation. That server is not feature complete yet, but we can now rebuild the features with a solid multitenant foundation. It will be available as a beta next Wednesday, during our Launch week. Stay tuned, at tur.so/scarygood

#Appendix: Why not Zig?

Because we were committed to rewriting everything, a crucial question arose: why not Zig? Obviously, it would offer better control over memory allocations, which is vital for our model. Regardless of its benefits, Rust's ecosystem appeared more developed at this juncture. Opting for Rust wasn't just about its maturity, though. We had spent years coding in C and C++, and with our goal of rapid development in mind,

Choosing to let go of the borrow checker seemed unwise. Having this powerful feature at our disposal was too valuable to ignore. Especially considering our extensive background in languages prone to memory errors, Rust's safety guarantees were incredibly appealing. Consequently, we decided that Rust's borrow checker was a critical asset we couldn't overlook. Keeping it in our toolset aligns perfectly with our objectives of speed and reliability. Evaluating all these factors, we ultimately chose Rust for our rewrite project. Rust's unique features, particularly its borrow checker, made it the ideal choice for our needs.

Many thanks to Claude for helping me rewrite my original text in a way that highlights the point we wanted to make.

scarf