Generating Document Embeddings with OpenAI

Welcome back! In the previous lesson, you learned how to load and split documents using LangChain in TypeScript, setting the foundation for more advanced document processing tasks. Today, we will take the next step in our journey by exploring embeddings, a crucial concept in document processing.

Embeddings are numerical representations of text data that capture the semantic meaning of words, phrases, or entire documents. They are essential for working with Large Language Models (LLMs) because they allow these models to understand and process text in a meaningful way. By converting text into embeddings, we can perform various tasks, such as similarity search, clustering, and classification.

In this lesson, we will focus on generating embeddings for document chunks using OpenAI and LangChain in TypeScript. This will enable us to enhance our document processing capabilities and prepare for context retrieval tasks in future lessons.

Embeddings and Language Models

Embeddings play a vital role in context retrieval systems. Think of embeddings as a way to translate human language into a format that computers can understand and compare — like giving computers their own secret language decoder ring!

Imagine you have three sentences:

"The Avengers assembled to fight Thanos."
"Earth's mightiest heroes united against the Mad Titan."
"My soufflé collapsed in the oven again."

Even though the first two sentences use completely different words, they're talking about the same superhero showdown. The third sentence? That's just my sad baking disaster. When we convert these sentences into embeddings (vectors of numbers), the vectors for the superhero sentences would be mathematically closer to each other than to my kitchen catastrophe.

Context Retrieval Systems

Here's how embeddings work in a practical context retrieval system:

Document Processing: First, we break down our documents into smaller chunks (like cutting a pizza into slices).
Embedding Generation: We convert each chunk into an embedding vector (giving each slice its own unique flavor profile).
Storage: These vectors are stored in a database or vector store (our digital pizza fridge).
Query Processing: When a user asks a question, we convert that question into an embedding too.
Similarity Search: We find the document chunks whose embeddings are most similar to our question's embedding (matching flavors).
Response Generation: We use these relevant chunks as context for an LLM to generate an accurate answer.

For example, if you have a massive collection of movie scripts and someone asks, "Who said 'I'll be back'?", the system would find and retrieve chunks with embeddings similar to the question — likely passages from Terminator scripts, even if they contain phrases like "Arnold's famous catchphrase" or "Schwarzenegger's iconic line" instead of the exact words in the query.

This powerful technique forms the foundation of modern search engines, chatbots, and question-answering systems, allowing them to understand the meaning behind words rather than just matching keywords — kind of like how your friend knows you're talking about that "one movie with the guy who does the thing" even when you're being incredibly vague!

Document Loading and Splitting

Before we dive into generating embeddings, let's revisit the code from the previous lesson to load and split a document, and then we'll generate embeddings for the resulting chunks.

In this TypeScript code, we explicitly annotate types such as string for the file path and Document[] for the loaded documents. Note that top-level await is supported in modern TypeScript when using ES modules, but if your environment does not support it, you can wrap this logic inside an async function.

OpenAI Embeddings and LangChain

LangChain provides a consistent interface for working with various embedding models. An embedding model is a specialized AI model that converts text into numerical vectors, capturing semantic meaning. When using OpenAI's embedding models, you access them through your OpenAI API key, which is typically set as an environment variable.

Here's how to set up OpenAI embeddings with LangChain in TypeScript:

In this code, we import and instantiate the OpenAIEmbeddings class. TypeScript's type inference handles the type of embeddingModel, but you can always annotate it explicitly if needed.

Configuring Embedding Model Parameters

You can easily customize your OpenAI embeddings by adjusting a few simple settings. Here's how to set up your embedding model with different options in TypeScript:

Let's break down these settings:

model: This is like choosing which brain to use for creating embeddings:
- "text-embedding-3-small": A faster, lighter option that works great for most projects
- "text-embedding-3-large": A more powerful option when you need extra accuracy
- "text-embedding-ada-002": An older model that's still commonly used
dimensions: Think of this as the level of detail in your embeddings. Higher numbers mean more detail but take up more storage space.
chunkSize: When you're processing lots of text at once, this controls how many chunks to handle in each batch.

You don't need to worry about these settings when you're just starting out — the default values work perfectly fine for most projects! As you get more comfortable, you can experiment with these options to find what works best for your specific needs.

Generating Embedding with OpenAI

Now that we have our embeddings model set up, let's generate an embedding for a document chunk in TypeScript:

The embedQuery() method takes a text string and returns an embedding vector — a list of floating-point numbers that represents your text in a high-dimensional space. In TypeScript, we annotate the result as number[] to make it clear that this is an array of numbers. This vector captures the semantic meaning of your text, which will be essential for similarity search and other operations we'll explore in future lessons.

Inspecting Embedding Vectors

Let's take a closer look at the embedding vector we generated. These vectors are the mathematical representation of our text in a high-dimensional space.

When you run this code, you might see output like:

Each number in the vector represents a dimension in the embedding space. These seemingly random numbers actually contain rich semantic information. The pattern of values across all dimensions captures the meaning of our text in a way that allows for mathematical comparison.

Two texts with similar meanings will have embedding vectors that are close to each other in this high-dimensional space, even if they use different words to express the same idea. For example, the embeddings for "I love pizza" and "Pizza is my favorite food" would be much closer to each other than either would be to "I need to fix my car." This mathematical representation of meaning is what makes embeddings so powerful for search, recommendation systems, and other NLP applications.

Summary and Next Steps

In this lesson, you learned how to generate embeddings for document chunks using OpenAI and LangChain in TypeScript. We discussed the importance of embeddings in NLP and their role in context retrieval systems. You also saw a practical example of generating and inspecting embedding vectors using TypeScript's type system for clarity and safety.

As you move on to the practice exercises, you'll have the opportunity to apply these concepts by generating different embeddings and exploring their properties. This will reinforce your understanding and prepare you for the next unit, where we'll focus on retrieving relevant information using similarity search. Keep up the great work!

Previous Lesson

Next Lesson: Retrieving Relevant Information with Similarity Search

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal