Embedding Chunks and Storing in a Vector Database

Introduction: From Chunks to Searchable Knowledge

Welcome back! In the last lesson, you learned how to split documents into smaller, overlapping chunks. This step is important because it helps us find and use the right parts of a document when answering questions or searching for information.

Now, we are ready to take the next step: making these chunks searchable by computers. To do this, we need to turn each chunk of text into a special format called an embedding. Embeddings are a way for computers to understand the meaning of text and compare different pieces of text quickly.

By the end of this lesson, you will know how to:

Turn text chunks into embeddings using OpenAI's model.
Store these embeddings in a vector database (LibSQLVector) so you can search and compare them efficiently.

Let's get started!

Quick Recall: What Are Chunks?

Before we move on, let's quickly remind ourselves what chunks are. In the previous lesson, you learned how to split a document into smaller pieces called "chunks." For example, if you have a long email, you might break it into several sentences or paragraphs. This makes it easier to find and use the most relevant part of the document later.

Here's a simple example:

Suppose you have this document:

After chunking, you might have:

Chunk 1: "Hello, thank you for reaching out."
Chunk 2: "I am currently out of the office and will reply when I return."
Chunk 3: "For urgent matters, please contact my colleague."

Now, let's see how we can turn these chunks into something a computer can search and compare.

What Are Embeddings and Why Do We Use Them?

When we want a computer to compare pieces of text by meaning (not just by exact words), we use embeddings. An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Texts with similar meanings will have embeddings that are close together, even if the words are different.

Real-world analogy:
Think of embeddings like coordinates on a map. If two cities are close together on the map, they are similar in location. If two text chunks have embeddings that are close together, they are similar in meaning.

For example:

"I am out of the office." and "I am not at work right now." will have embeddings that are close together.
"I am out of the office." and "The weather is sunny." will have embeddings that are far apart.

Embeddings make it possible to search for similar meanings, not just exact words.

Creating Embeddings with OpenAI

Let's see how to create embeddings for our text chunks using OpenAI's embedding model. We'll create a function called generateEmbeddings that handles this process.

Step 1: Prepare the Chunks

First, let's assume you already have your chunks in an array called chunks. Each chunk has a text property.

Step 2: Generate Embeddings

Now, let's create a function to turn each chunk into an embedding using OpenAI's model.

Explanation:

We extract the text from each chunk using map.
embedMany sends all the texts to the OpenAI model at once, which returns an array of embeddings.

What does an embedding look like?
Each embedding is a list of numbers, for example:

You don't need to understand the numbers themselves—just know that similar texts will have similar embeddings.

Storing Embeddings in a Vector Database (LibSQLVector)

Now that we have embeddings, we need a way to store and search them efficiently. This is where a vector database comes in. LibSQLVector is a lightweight choice that works with SQLite databases, perfect for local development and smaller applications.

Step 3: Store Embeddings in LibSQLVector

Let's create a function to store the embeddings along with the original text chunks:

Explanation:

We store both the embeddings (vectors) and the original text as metadata.
The indexName helps organize different types of embeddings in the same database.
The metadata allows us to retrieve the original text when we find similar embeddings later.

Step 4: Putting It All Together

Here's how everything works together in a complete example:

Note:
On CodeSignal, the required libraries are already installed, so you don't need to worry about setup here. On your own device, you would need to install these libraries and set up your database file.

Summary And What's Next

In this lesson, you learned how to:

Turn document chunks into embeddings using OpenAI's model.
Store these embeddings in a vector database (LibSQLVector) for fast and meaningful search.
Keep the original text as metadata so you can retrieve it later.

These steps are key to building a smart email assistant that can find and use the most relevant information from documents. In the next practice exercises, you'll get hands-on experience generating and storing embeddings yourself. Try out the code, and see how embeddings make searching by meaning possible!

Previous Lesson

Next Lesson: Querying for Relevant Context with Embeddings

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal