Beyond FTS5: Building Transactional Full-Text Search in TursoDB

Modern applications, with AI agents chief amonst them, benefit tremendously from Full Text Search (FTS). Even in cases where Vector Search is available, FTS still provides much better results in many domains, like searching for product SKUs, error codes, names, etc.

SQLite supports FTS through "FTS5", one of the most popular SQLite extensions. In our mission to provide a fully API-compatible alternative to SQLite, we intend to allow for SQLite extensions to be loaded as-is (which is a work-in-progress). However, SQLite's FTS5 extension is also full of caveats and shortcomings. And since our mission is to also bo beyond what SQLite, we asked ourselves: what if we built a new FTS engine in Turso that is aligned with the current state of the art?

In this article we will explore in details how we have built Turso's native FTS on top of Tantivy: a fast, Apache Lucene-style search engine library written in Rust, powering awesome projects such as ParadeDB and Quickwit. Tantivy already gives us what you’d expect from a serious search engine: tokenizers, BM25 ranking, phrase queries, prefix queries, segment merges, and a battle-tested on-disk format.

#Trying it out.

At the time of writing, FTS is available on the main branch in the CLI, behind the --experimental-index-method flag. We expect it to be available for broader testing in the next release.

SQL surface area: DDL, queries, and maintenance

Because we are not tied to SQLite's FTS5 implementation, we are no longer forced to use SQLite's Virtual Table abstraction and this allows us to both use Postgres-style syntax for defining custom Indexes, and gives us much better direct integration into the query planner.

From SQL, creating an index looks like:

  CREATE INDEX idx_posts ON posts USING fts (title, body);

You can configure the tokenizer Turso will use via:

CREATE INDEX idx_products ON products
USING fts (name) WITH (tokenizer = 'ngram');

Available Tokenizers

Tokenizer	Description	Use Case
`default`	Lowercase, punctuation split, 40 char limit	General English text
`raw`	No tokenization - exact match only	IDs, UUIDs, tags, categories
`simple`	Basic whitespace/punctuation split	Simple text without lowercase
`whitespace`	Split on whitespace only	Space-separated tokens
`ngram`	2-3 character n-grams	Autocomplete, substring matching

For the following examples, we'll use:

CREATE TABLE articles (
  id INTEGER PRIMARY KEY,
  title TEXT,
  body TEXT,
  category TEXT
);
INSERT INTO articles VALUES
(1, 'Rewriting SQLite in Rust',
    'We explore B-tree internals and WAL design in a modern database engine.',
    'databases'),
(2, 'Full Text Search with Tantivy',
    'Using Tantivy to build a fast full text search engine inside a database.',
    'search'),
(3, 'Async IO in Rust',
    'How async IO and event loops interact with a database storage engine.',
    'systems'),
(4, 'Understanding WAL',
    'A deep dive into write-ahead logging and crash recovery.',
    'databases');

Weights

BM25 ranking can be influenced by assigning weights to indexed columns. For example, to give more importance to the title field over the body field during search ranking, you can define weights like this:

-- Create an FTS index with weights
CREATE INDEX idx_articles_fts
ON articles USING fts (title, body)
WITH (weights = 'title=2.0,body=1.0');

Basic Query Examples

-- Get BM25 ranking for matching documents, ordered by relevance
SELECT
  fts_score(title, body, 'database wal') AS score,
  id,
  title
FROM articles ORDER BY score DESC LIMIT 2;

-- Returns:
-- 2.91 | 1 | Rewriting SQLite in Rust
-- 2.10 | 4 | Understanding WAL

-- MATCH in WHERE clause
SELECT
  id,
  title
  FROM articles
  WHERE (title, body) MATCH 'database wal' -- for MATCH: uses the column names instead of the FTS5 table.
  LIMIT 1;
-- Returns:
-- 1 | Understanding WAL'

Highlighting Search Results

The fts_highlight function wraps matching query terms with custom tags for display:

-- Basic highlighting (single column)
SELECT fts_highlight(
  'Learn about database optimization',
  '<b>', '</b>',
  'database'
);
-- Returns: "Learn about <b>database</b> optimization"

-- Multiple columns - text is concatenated with spaces
SELECT
  fts_highlight(title, body, '<mark>', '</mark>', 'database') AS highlighted
FROM articles
WHERE (title, body) MATCH 'database' LIMIT 1;

-- If title='Database Design' and body='Learn about optimization',
-- Returns: "<mark>Database</mark> Design Learn about optimization"

#The Implementation Challenges:

The first problem we ran into, is that by default: Tantivy assumes it owns a directory of files, Turso (very intentionally) does not.

The core idea is to store Tantivy’s “files” directly in Turso's internal BTree.

Tantivy persists an index as a directory full of files: global metadata files and per-segment immutable structures. Those are normally regular OS files, and Tantivy interacts with them via its Directory trait. So we implemented Directory, but instead of reading and writing real filesystem files, we store Tantivy’s bytes directly in our B-Tree storage. That buys us some nice properties for free:

Transactional guarantees: index updates participate in the same transaction as table writes.
Durability: index changes go through the same WAL.
Crash safety / rollback: if the SQL transaction rolls back, the FTS index rolls back too.
Single-file distribution: the full-text index lives inside the database file.

#Internally

Each index is essentially stored like a normal table (simplified):

CREATE TABLE fts_dir_{idx_id} (
  path     TEXT    NOT NULL,
  chunk_no INTEGER NOT NULL,
  bytes    BLOB    NOT NULL
);

Each Tantivy “file” is identified by path and split into fixed-size (512KB) chunks stored under (path, chunk_no).

#Why chunking matters

Tantivy’s segment files can be large, hundreds of megabytes on real datasets. You don’t want to load entire .idx or .store files just to satisfy a single range read.

Chunking lets us do two important things:

Partial reads: if Tantivy asks for byte range [start..end), we only fetch the needed chunks.
Bounded caching: we can cache hot chunks and evict cold ones without treating each segment file as an all-or-nothing memory-resident blob.
Predictable IO: avoids large, unpredictable reads during query execution.

The solution we ended up with was HybridBTreeDirectory:

Catalog – always in memory
Hot cache – small, frequently accessed files
Chunk cache – bounded cache for large segment data

Tantivy's Directory trait is synchronous, our pager's IO is asynchronous (not traditionally async rust, but is driven by a stepper and our own internal event loop) and our B-Tree operations can yield.

The compromise we landed on was to keep metadata cheap and always resident, cache hot small files fully, and lazy-load big segment data in chunks.

We classify files into categories based on Tantivy’s format:

Metadata: (tiny, always resident)
TermDictionary: (very hot for query performance)
FastFields: (small, frequently touched)
SegmentData: (large, cold, demand-loaded)

That classification directly drives our caching decisions:

hot_cache: caches entire files for Metadata/TermDictionary/FastFields

chunk_cache: caches individual chunks for SegmentData

#When a "directory” isn’t really a directory

Tantivy’s default performance story assumes it’s writing to a filesystem directory that supports a few things you don’t really think about until you don’t have them:

multiple files being written concurrently
background threads flushing segments
merge threads reading old segments while writing new ones
independent file handles with OS-level buffering and cheap append/rename patterns

In our case, the “directory” is a B-Tree table inside the database file, and that drastically changes the concurrency model.

Even if Tantivy has multiple worker threads building postings in memory, eventually it needs to persist a segment. When the Directory implementation funnels persistence into a single pager, a single B-Tree root all in one write transaction: we effectively inherit “single writer” semantics at the storage layer. That has a direct implication: merges become dangerous.

Why merges are the real problem

Indexing new segments is: “write new files + update meta.json”. Merging segments looks something like:

read multiple existing segments
write a new merged segment
update meta.json to swap segment sets
possibly delete old segment files

On a filesystem, that’s normal and highly optimized. Inside a single B-Tree with one writer, merges add long-running transactions, high write amplification, more churn in “files” (delete + rewrite chunk sets) and more opportunities for the Directory to be invoked in complex interleavings.

Most importantly: merges want to be background work. But our directory persistence is tied to the DB transaction semantics. Background merge work would either write outside the user’s SQL transaction (bad / inconsistent), or keep a write transaction open much longer than acceptable.

To keep correctness and the integration simple, we configured Tantivy with NoMergePolicy that disables automatic background merges:

Tantivy still writes segments on commit
Deletes are still represented as tombstones until a merge rewrites them away
Segment count grows over time
Query fan-out increases over time (search hits more segments)
Storage overhead can increase due to unmerged deleted docs and small-segment fragmentation

Depending on workload (lots of updates), this can matter quite a bit, and performance can drift significantly if many small updates are done over time.

We needed a database-native way to accomplish these merges intentionally, so we introduced a new SQL command:

Mitigation: OPTIMIZE INDEX

Instead of letting Tantivy merge in the background on its own, we expose a manual, explicit operation:

OPTIMIZE INDEX fts_articles;
-- or
OPTIMIZE INDEX; -- will optimize all FTS indexes

The semantics are deliberately similar to other maintenance operations such as ANALYZE and VACUUM, it runs within controlled execution: you can schedule it, run it after bulk inserts that created many segments after a long period of updates/deletes, when query latency starts creeping up, or during normal maintenance windows.

#Conclusion

In our mission to go beyond SQLite, Turso v0.5 will ship with experimental support with native FTS. In addition with our native support for Vector Search, it provides agents and developers with the tools they need to find the information they need with the speed of an embedded database.

Having native implementation of those primitives allows us to extend the syntax to provide much better ergonomics than SQLite extensions. It is a lift, but the native integration allows us key properties like keeping updates to the index transactional. Excited to try the next generation of FTS on Turso? Check out our Github now!