How to Generate & Store OpenAI Vector Embeddings with Turso

Learn how to generate and store OpenAI Embeddings.

Cover image for How to Generate & Store OpenAI Vector Embeddings with Turso

Previously we learned about using SQLite as your LLM Vector Database and what embeddings model are and do. Choosing an embedding model is an important decision, as it will ultimately affect the quality of your search results.

Another important factor of choosing an embedding model is cost. Some models are open source, and free to use, while others require a subscription or payment per request.

In this article, we'll be using the new and improved text-embedding-ada-002 model provided by OpenAI, which is optimized for creating text embeddings.

#Upgrade your group to use vectors

First, you will need to create a database in a Turso group that is using the vector version. This is because vector support is still in beta.

We'll be using the Turso CLI throughout this tutorial to create, connect, and query the database — install it if you haven't already.

turso group create <group-name>

Make sure to replace <group-name> with the name of your group. Typically default is used for the group name.

#Create a new database

Begin by creating a new database:

turso db create mydb

Next fetch your database URL and auth token, we'll need them later:

turso db show mydb --url
turso db tokens create mydb

Now let's connect to the database shell to create a table to store the embeddings:

CREATE TABLE IF NOT EXISTS documents (
  embedding F32_BLOB(1536)
);

Notice that column embedding uses a special type F32_BLOB(1536) that is optimized for storing embeddings. The number 1536 is the number of dimensions in the OpenAI embeddings model we are using.

Next let's CREATE INDEX with diskann type and cosine metric. This is a special index type that is optimized for cosine similarity operations:

CREATE INDEX IF NOT EXISTS documents_idx
ON documents (
    libsql_vector_idx(
        embedding,
        'type=diskann',
        'metric=cosine'
    )
);

To confirm everything is set up correctly, you can run the following command to see the table schema:

.schema

You should see the following output:

CREATE TABLE libsql_vector_meta_shadow ( name TEXT, metadata BLOB );
CREATE TABLE documents (embedding F32_BLOB(1536));
CREATE INDEX documents_idx ON documents ( libsql_vector_idx(embedding, 'type=diskann', 'metric=cosine') );
CREATE TABLE documents_idx_shadow (index_key INT, data BLOB);

You'll notice there are a few additional tables — those are managed internally by libSQL for your vector indexes.

#Install libSQL client

Begin by installing the @libsql/client package, if you're using something other than JS/TypeScript, you will need to use that client instead:

npm install @libsql/client

Next, create a client instance using the URL and auth token you fetched earlier:

import { createClient } from '@libsql/client';

const db = createClient({
  url: '...',
  authToken: '...',
});

#OpenAI API Setup

If you haven't already, sign up and create an API key with OpenAI:

Next, install the OpenAI package for your language of choice, here we are using Node.js:

npm install openai

Now import and instantiate the OpenAI client with your API key:

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: '...',
});

#Creating Embeddings

We'll be using the text-embedding-ada-002 model which is optimized for creating text embeddings:

async function createEmbeddings(
  input: string | Array<string> | Array<number> | Array<Array<number>>,
) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-ada-002',
    input,
  });

  return response.data.map((item) => item.embedding);
}

This function takes some input and returns an array of embeddings, each embedding is an array of numbers. The model we're using produces embeddings with 1536 dimensions, which is what we set our database F32_BLOB column to.

Now let's create some embeddings for some sample text:

async function main() {
  const embedding = await createEmbeddings(
    'The quick brown fox jumps over the lazy dog',
  );

  console.log(embedding);
}

main();

If you execute this code, you should see the embeddings:

[ -0.003653927, 0.008363503, -0.01428896, -0.0045085, -0.015432579, 0.018687496, -0.020446911,
  -0.010091499, -0.01285001, -0.028150633, 0.019554636, 0.017129159, 0.02172877, 0.0014962879, ....

#Storing Embeddings

Now let's write some code to process our embeddings in batches of 50 documents at a time. For each document, we create an SQL INSERT statement that adds the original content and embedding into the documents table:

const batchSize = 50;

async function storeEmbeddings(embeddings: number[][]) {
  for (let i = 0; i < embeddings.length; i += batchSize) {
    const batch = embeddings.slice(i, i + batchSize).map((embedding) => ({
      sql: `INSERT INTO documents (embedding) VALUES (?)`,
      args: [new Float32Array(embedding).buffer as ArrayBuffer],
    }));

    try {
      await db.batch(batch);
      console.log(`Stored embeddings ${i + 1} to ${i + batch.length}`);
    } catch (error) {
      console.error(
        `Error storing batch ${i + 1} to ${i + batch.length}:`,
        error,
      );
    }
  }
}

We use the db.batch method to execute multiple SQL statements in a single transaction. This is more efficient than executing each statement individually.

#Trying it out

We're now ready to put the pieces together to generate and store embeddings with libSQL:

async function main() {
  const embeddings = await createEmbeddings(
    'The quick brown fox jumps over the lazy dog',
  );

  await storeEmbeddings(embeddings);
}

main();

If you now query your database using turso db shell, you will see the embeddings stored as a BLOB in the embedding column:

SELECT * FROM documents;

This isn't very useful currently, but we can use the documents_idx index (based on the DiskANN algorithm) we created earlier to perform similarity searches on the embeddings.

It's also useful to store the original content, and any other metadata you might have, alongside the embeddings so you can see what they represent — we'll learn about that in another post.

#Optimizing for real world

So far we've only created embeddings for the text The quick brown fox jumps over the lazy dog but when working with larger documents (such as a PDF), you'll need to split them into smaller chunks to avoid hitting the OpenAI API limits.

Then with the chunks we can invoke createEmbeddings() for each of them, and store them in the database as separate rows.

We'll learn more in the next post about how to handle larger documents by splitting a PDF into chunks using the pdf-parse package together with langchain's text splitter utility, Open AI, and storing each "chunk" in the database as a separate row.

If you're not already using Turso, you can sign up for free, and if you have any questions about SQLite, Vector Embeddings, or anything else — join us on Discord.

scarf