Local RAG with Ollama, Mistral, and Turso

Jamie BartonJamie Barton
Cover image for Local RAG with Ollama, Mistral, and Turso

Ollama is a powerful open-source language model that can be used for a variety of tasks, such as text generation, summarization, and question answering.

Whether you're building a document search system, a technical support chatbot, or a content recommendation engine, this local-first approach provides the perfect foundation for secure, efficient, and cost-effective RAG implementations.

In this article, we'll take a look at how you can use Ollama with Turso to build a local RAG (Retrieval-Augmented Generation) model that works locally, and offline.

#Why RAG?

RAG has become a crucial technique for enhancing LLM responses with relevant context. Traditional cloud-based RAG solutions come with several challenges that local RAG can address:

#Data Privacy & Security

Sending sensitive documents to cloud services exposes your data to potential security risks and compliance issues. Local RAG keeps your data entirely within your control, whether you use Turso's Vector on your server or on device.

#Cost Efficiency

Cloud vector databases can become expensive as data grows. With Turso's libSQL, you can:

  • Perform embeddings locally
  • Sync only the vectors to Turso
  • Read data efficiently from local SQLite files using Embedded Replicas

#Reduced Network Latency

By embedding your database locally, whether on-device or offline, you can:

  • Perform vector search queries without network round-trips
  • Read directly from SQLite files
  • Significantly improve RAG application performance

#Offline Capability

With Turso's Embedded Databases and Ollama running locally, your RAG architecture remains functional without an internet connection. This makes it ideal for:

  • Air-gapped devices
  • Unreliable network environments
  • Offline-first applications (offline writes in beta)

#Enhanced Simplicity

Turso's libSQL provides a unified solution for data storage:

  • Store both data and vector embeddings in the same table
  • Seamlessly query related data and embeddings
  • Maintain consistency across online and offline modes

For local embedding generation, Ollama provides efficient processing with popular models like Mistral, making it an excellent choice for local-first architectures.

#Working with libSQL and Ollama

Let's build a local RAG system that stores and retrieves more data. You can adapt this to your use case, such as a local search engine, a chatbot, or a PDF chat bot.

We'll first build the database and cover the retrieval with vector similarity search to show how easy that is, then adapt it to follow a RAG pattern.

#Prerequisites

First, install the following dependencies:

npm install @libsql/client ollama

Install Ollama locally, and then run:

ollama run mistral

#Creating the Vector Database

You'll notice below there is very little setup for the database — simply pass the libSQL client a file path, and that's it! There is no extensions to install.

We'll use libSQL Vector that stores its data inside a SQLite file — the vector embeddings will be stored in a BLOB, but the libSQL client will handle the serialization and deserialization for you.

import { createClient } from '@libsql/client';

const client = createClient({
  url: 'file:local.db',
});

// Initialize the database schema
await client.batch(
  [
    // Create table with a vector embedding column (F32_BLOB)
    'CREATE TABLE IF NOT EXISTS movies (id INTEGER PRIMARY KEY, title TEXT NOT NULL, description TEXT NOT NULL, embedding F32_BLOB(4096))',

    // Create a vector index for similarity search
    'CREATE INDEX IF NOT EXISTS movies_embedding_idx ON movies(libsql_vector_idx(embedding))',
  ],
  'write',
);

Let's break down the important parts:

  • F32_BLOB(4096) — This column type stores 32-bit floating-point vectors with 4096 dimensions (matching Mistral's embedding size).
  • libsql_vector_idx — Creates an index optimized for vector similarity search.
  • file:local.db — Stores everything in a local SQLite file.

#Generating Embeddings with Ollama

We'll use Ollama's local API to generate embeddings:

import ollama from 'ollama';

async function getEmbedding(prompt: string) {
  const response = await ollama.embeddings({
    model: 'mistral',
    prompt,
  });

  return response.embedding;
}

#Storing Documents with Embeddings

Here's how you can insert documents (in our case, movies) with their embeddings:

async function insertMovie(title: string, description: string) {
  const embedding = await getEmbedding(description);

  await client.execute({
    sql: `
        INSERT INTO movies (title, description, embedding)
        VALUES (?, ?, vector(?))
        `,
    args: [title, description, JSON.stringify(embedding)],
  });
}

Note the vector() SQL function — this converts the JSON array of embeddings into libSQL's vector format.

#Semantic Search with Cosine Similarity

The real power comes from finding similar documents using vector similarity search:

async function findSimilarMovies(description: string, limit = 3) {
  const queryEmbedding = await getEmbedding(description);

  const results = await client.execute({
    sql: `
        WITH vector_scores AS (
          SELECT
            rowid as id,
            title,
            description,
            embedding,
            1 - vector_distance_cos(embedding, vector32(?)) AS similarity
          FROM movies
          ORDER BY similarity DESC
          LIMIT ?
        )
        SELECT id, title, description, similarity
        FROM vector_scores
        `,
    args: [JSON.stringify(queryEmbedding), limit],
  });

  return results.rows;
}

Let's break down the important parts about the search:

  • vector_distance_cos calculates cosine similarity between vectors.
  • 1 - distance converts distance to similarity (higher is better).
  • vector32(?) converts the JSON embedding array to a vector.
  • Results are ordered by similarity with a limit.

#Inserting example data

// Insert some sample movies
const sampleMovies = [
  {
    title: 'Inception',
    description:
      'A thief who enters the dreams of others to steal secrets from their subconscious.',
  },
  {
    title: 'The Matrix',
    description:
      'A computer programmer discovers that reality as he knows it is a simulation created by machines.',
  },
  {
    title: 'Interstellar',
    description:
      'Astronauts travel through a wormhole in search of a new habitable planet for humanity.',
  },
];

for (const movie of sampleMovies) {
  await insertMovie(movie.title, movie.description);
  console.log(`Inserted: ${movie.title}`);
}

If you wanted to stop here, and just focus on semantic search, you could use the above code to build a local search engine like this:

const query =
  'A sci-fi movie about virtual reality and artificial intelligence';
const similarMovies = await findSimilarMovies(query);

console.log('\nSimilar movies found:');

similarMovies.forEach((movie) => {
  console.log(`\nTitle: ${movie.title}`);
  console.log(`Description: ${movie.description}`);
  console.log(`Similarity: ${movie.similarity.toFixed(4)}`);
});

But let's take it a step further and make it RAG.

#Making it RAG

Now use the findSimilarMovies function and pass the context to Ollama to generate a response for the users prompt.

async function generateResponse(query: string) {
  const similarMovies = await findSimilarMovies(query);

  const context = similarMovies
    .map(
      (movie) =>
        `Title: ${movie.title}\nDescription: ${
          movie.description
        }\nSimilarity Score: ${movie.similarity.toFixed(4)}`,
    )
    .join('\n\n');

  const prompt = `
You are a knowledgeable movie expert.
Use the following movie information to answer the user's question.
Only use information from the provided context.
If the context doesn't contain enough information to answer fully, acknowledge this limitation.

Context:
${context}

User Question: ${query}

Instructions:
1. Base your response only on the movies provided in the context
2. Consider the similarity scores when weighing the relevance of each movie but don't reference it in your response
3. If a movie is only tangentially related, mention this
4. If the context doesn't provide enough information, acknowledge this limitation

Response:`;

  const response = await ollama.chat({
    model: 'mistral',
    messages: [
      {
        role: 'user',
        content: prompt,
      },
    ],
  });

  return {
    response: response.message.content,
    sourceDocuments: similarMovies,
  };
}

The key difference is that RAG doesn't just find similar documents — it uses those documents as context for the LLM to generate a more informed response.

const result = await generateResponse(
  'What movies involve artificial intelligence?',
);
console.log('\nGenerated Response:', result.response);
console.log('\nSource Documents:', result.sourceDocuments);

You should the generated response looks something like this:

In the movies provided within the given context, "The Matrix" might be the one that hints at the concept of artificial intelligence. While not explicitly stating AI, it presents a reality that is a simulation created by machines, which could be interpreted as a form of advanced artificial intelligence.

#Conclusion

This local RAG implementation offers a powerful combination of features that make it an excellent choice for many applications:

  • Privacy and Security

    • Complete data sovereignty with all processing happening locally
    • Perfect for handling sensitive or confidential information
    • Compliant with air-gapped environments and strict security requirements
  • Performance Benefits

    • Lightning-fast queries with direct SQLite access
    • No network latency reads for vector embeddings
    • Efficient vector similarity search through libSQL's optimized indexing
  • Cost and Resource Efficiency

    • Less services to provision and pay for
    • Run locally on your own infrastructure with a single file, or with Turso for synchronization
  • Developer Experience

    • Single SQLite file for both data and vector embeddings
    • No complex cloud infrastructure to manage
    • Easy integration with existing applications using various libSQL SDKs
  • Flexibility and Extensibility

    • Easily adaptable for various use cases (document search, chatbots, recommendation systems)
    • Support for different embedding models through Ollama

By combining libSQL's vector capabilities with Ollama's local embeddings, you can build powerful, privacy-preserving applications that run entirely on your own infrastructure — without the need for external cloud providers.

#Going further with Turso

Think of Turso like iCloud for your vector database. Just as iCloud seamlessly syncs your Notes, Reminders, and Documents across all your Apple devices, Turso handles the synchronization of your vector embeddings and associated data across multiple instances of your application.

Everything in this post is built with the libSQL client locally, using a single SQLite file. Extending this guide to use Turso is straightforward, and comes with many benefits:

  1. Multi-Device Synchronization

    • Keep vector embeddings in sync across all user devices
    • Handle offline changes with automatic conflict resolution (soon with offline writes)
    • Maintain consistent search results across all devices
  2. Tenant Isolation

    • Create separate databases per user or organization
  3. Hybrid Architecture

    • Insert and query embeddings locally for performance
    • Sync vector embeddings to Turso for backup and sharing
    • Maintain offline capability while ensuring data consistency

Updating our RAG code to work with Turso is simple — replace the url with your Turso database URL, and provide the path to your local SQLite file to the syncUrl property:

import { createClient } from '@libsql/client';

const client = createClient({
  url: 'libsql://[your-database].turso.io',
  authToken: '...',

  // Enable local sync
  syncUrl: 'file:local.db',
});

Your application now gets the best of both worlds:

  • Local-first performance with SQLite
  • Cloud-powered synchronization with Turso
  • Offline capability with automatic syncing
  • Multi-device support out of the box

#Building Offline-Capable AI Applications

This architecture opens up exciting possibilities for AI applications:

  • Chatbots that work offline but stay in sync across devices
  • Document search that maintains consistent results everywhere
  • Recommendation systems that work without constant internet access
  • Multi-user systems with isolated, synchronized vector stores

Sign up for free to start building your offline-capable AI applications today.

scarf