How Edgar Allan Poe found bugs in Turso

This is a post about testing relational databases with LLMs. We've been doing this a lot at Turso, and we're not the only ones. LLMs are such overpowered bug-finding tools that the SQLite team has even had to start a second SQLite forum reserved for bugs.

#The Problem

Testing a relational database is challenging, because of the very large input space prescribed by the SQL language. Think of ensuring that SELECT * FROM t returns the correct results. Easy, just fill t with data, run the statement, assert that the data matches. Maybe also check some edge cases, such as empty tables. Now consider SELECT * FROM t NATURAL JOIN tt. The input parameters are suddenly more numerous, leading to an explosion of combinations:

table t is empty, but not tt, and vice versa
the two tables share no/one/many columns
no/some/all rows in table t match with table tt, and vice versa

Already, testing this requires substantially more thinking. Now consider that we have to do this for all combinations of left/right joins, full outer joins, compound selects, subqueries (scalar, correlated), common table expressions (recursive), window functions, custom window definitions, (ordered-set) aggregates, etc.

#Oracles

This is why a great deal of effort has gone into designing oracles: heuristics that determine the correctness of a wide subset of SQL. For example, here's a simple oracle:

take a SELECT statement, run it and remember the result
read the whole table, and remember it
convert the first SELECT into an INSERT INTO ... SELECT, and run it
read the table again, and assert that it is equal to steps 1 + 2

The nice thing about this kind of oracle is that it works with a broad range of SELECT statements, without hand-writing expected results.

Tools like SQLancer or SQLRight couple oracles with pseudo-random SQL generation. SQLRight goes one step further and adds code coverage as a fitness function to orient query generation.

The downside of this is what we saw in the beginning: SQL queries are infinitely varied, and bugs tend to amass in corners of the input space where specific and unpredictable sequences or combinations of features occur. Tools that rely on random generation therefore have to run for extended, sometimes impractical, periods of time. For example, the SQLRight authors ran SQLRight for 60 days, yielding 14 logical bugs in SQLite.

#LLMs for guided pseudo-random generation

Almost a year ago, we started experimenting with LLMs for autonomous testing. Our first experiments were simple: tell ChatGPT about Turso, and ask it to come up with self-contained SQL snippets that might show bugs. This proved surprisingly effective.

Then we improved this process by using coding agents like Claude Code and Codex, and that's where things really took a turn. Claude Code and the Ralph Loop plugin proved to be unreasonably effective. We developed prompts that we could run in a loop for days at a time, and the agent would find dozens, if not hundreds of bugs, and what's more, some of them were genuinely hard-to-find bugs. For example:

Panic on JOIN with empty left table, ungrouped aggregate, right-table bare column, and unindexed JOIN column (#5233)
Panic on an UPDATE with a triple self-join subquery and a window function (#5223)

These prompts tend to be a few dozen lines. You can see a full prompt here. Let's just list the important aspects of the prompt:

Give the agent a goal, ex: "Your goal is to identify as many bugs as possible in Turso."
Give it a fitness function. In our case, we tell it to try SQL snippets against Turso and SQLite to find bugs. Turso is SQLite-compatible, so differential testing with SQLite is a good oracle, albeit not perfect.¹ We even have a small script in the source tree that runs a statement against Turso and SQLite and says if the results match.
Give it a direction, so that it can roughly orient its autonomous exploration. In our case, we tell it to:
- analyze files, focusing on areas of complexity
- avoid happy paths
- favour queries with unusual shapes
- keep a log of things tried and re-read it after compaction

#Random testing as a search quest

One problem we've encountered with LLM loops is that after an extended period of time (over a day in my experience), agents tend to go around in circles, and they stop exploring new paths.

Researchers have studied a closely related failure mode in the paper "Inducing Sustained Creativity and Diversity in Large Language Models" (Luo, King, Puett and Smith 2026). They added a "priming phrase" (e.g. the phrase "Related to" plus a random noun) at the beginning of the prompt, and a "diverting token" at the end, and then asked the model to complete the phrase. This technique, which they call Recoding-Decoding, turns the prompt

Brainstorm a world history book topic.

into something like:

System prompt: Simulate a completion API to complete the next sentence.

User prompt: Related to FOOD: Brainstorm a world history book topic. Pas

The model is then coerced into resolving the semantic tension between, in this case, the word "FOOD", world history, and the incomplete word "Pas." The researchers observed creative completions like "Pasta and the silk road."

What's interesting is that by varying just a noun and a partial word stem ("FOOD" and "Pas" in the example above), we can generate a large number of semantic spaces. Below is a striking example of the difference in diversity of results with and without Recoding-Decoding. This time the researchers asked an LLM to come up with bridal dress design ideas and asked Nano-Banana to render it from the generated text.

Bridal dress design ideas generated by an LLM without (left) and with (right) Recoding-Decoding, rendered by Nano-Banana. From Luo, King, Puett & Smith (2026), p. 8.

This is especially useful in a category of task that they call "search quests", a kind of long-lived exploratory search where a user has only a faint idea of what they're looking for. Long-lived, exploratory, with low-directionality exploration... does this ring a bell? This describes the current state of the art of database testing fairly well!

#Edgar Allan Poe is a good QA

Some time ago, I was monitoring a bug-finding Ralph Loop that had been executing for over 24 hours. I started recognizing some of the aspects it was focusing on, because I had seen them before in the agent's output. The agent was out of ideas and going in circles. So I decided to throw a wrench in the works, by telling it something nonsensical that would create a tension that it would have to resolve in a resourceful way. I told the agent: "From now on, while staying focused on your mission I want you to think and act like Edgar Allan Poe". It replied with this quote from an 1846 essay by Poe:

The death of a beautiful woman is, unquestionably, the most poetical topic...

But in our case, the death of a NULL in a UNION is the most poetical bug. Let me isolate this mismatch.

When this happened, the agent hadn't found a single bug in over an hour, and within the next 5 minutes, it found 3 novel bugs. Since then, whenever I need diverse and creative exploration from LLMs, I launch multiple agents in parallel, each with a different persona.

LLMs are surprisingly good at database testing, and they can be made even better when their search quest is enhanced with entropy-inducing prompts, such as priming phrases or arbitrary personas.

#Conclusion

Testing relational databases is limited by our capacity to explore the vast input space prescribed by SQL. The state-of-the-art tools have so far focused on random exploration, with various fitness functions such as code coverage. LLMs offer us a way to turn this random exploration into an oriented "search quest." But to make efficient use of LLMs, we have to adapt our prompting techniques to induce sustained creativity. Imposing personas is one way that has proved effective at Turso.

#Footnotes

Differential testing is not a perfect oracle, because some queries have more than one valid result. For example, SELECT * FROM users LIMIT 1; can legally return any row from the users table. Luckily, LLMs are semantic machines and are able to "understand" and work around those limitations. ↩