Generating Document Embeddings in Go

Welcome back! In the previous lesson, you learned how to load and split documents using Go, setting the foundation for more advanced document processing tasks. Today, we will take the next step in our journey by exploring embeddings, a crucial concept in document processing.

Embeddings are numerical representations of text data that capture the semantic meaning of words, phrases, or entire documents. They are essential for working with Large Language Models (LLMs) because they allow these models to understand and process text in a meaningful way. By converting text into embeddings, we can perform various tasks such as similarity search, clustering, and classification.

In this lesson, we will focus on generating embeddings for document chunks using Go-compatible libraries and APIs. This will enable us to enhance our document processing capabilities and prepare for context retrieval tasks in future lessons.

Embeddings and Language Models

Embeddings play a vital role in context retrieval systems. Think of embeddings as a way to translate human language into a format that computers can understand and compare — like giving computers their own secret language decoder ring!

Imagine you have three sentences:

  • "The Avengers assembled to fight Thanos."
  • "Earth's mightiest heroes united against the Mad Titan."
  • "My soufflé collapsed in the oven again."

Even though the first two sentences use completely different words, they're talking about the same superhero showdown. The third sentence? That's just my sad baking disaster. When we convert these sentences into embeddings (vectors of numbers), the vectors for the superhero sentences would be mathematically closer to each other than to my kitchen catastrophe.

Context Retrieval Systems

Here's how embeddings work in a practical context retrieval system:

  1. Document Processing: First, we break down our documents into smaller chunks (like cutting a pizza into slices).
  2. Embedding Generation: We convert each chunk into an embedding vector (giving each slice its own unique flavor profile).
  3. Storage: These vectors are stored in a database or vector store (our digital pizza fridge).
  4. Query Processing: When a user asks a question, we convert that question into an embedding too.
  5. Similarity Search: We find the document chunks whose embeddings are most similar to our question's embedding (matching flavors).
  6. Response Generation: We use these relevant chunks as context for an LLM to generate an accurate answer.

For example, if you have a massive collection of movie scripts and someone asks, "Who said 'I'll be back'?", the system would find and retrieve chunks with embeddings similar to the question — likely passages from Terminator scripts, even if they contain phrases like "Arnold's famous catchphrase" or "Schwarzenegger's iconic line" instead of the exact words in the query.

This powerful technique forms the foundation of modern search engines, chatbots, and question-answering systems, allowing them to understand the meaning behind words rather than just matching keywords — kind of like how your friend knows you're talking about that "one movie with the guy who does the thing" even when you're being incredibly vague!

Document Loading and Splitting

Before we dive into generating embeddings, let's revisit the code from the previous lesson to load and split a document, and then we'll generate embeddings for the resulting chunks.

In this code, we open a text file, create a text loader, and initialize a recursive character splitter with a chunk size of 500 characters and an overlap of 100 characters. We then load and split the document, and print each chunk to the console.

Choosing Between RecursiveCharacter and TokenSplitter

LangChain provides different splitters for different use cases:

  • RecursiveCharacter (shown above): Your go-to choice for general text processing. It intelligently splits on natural boundaries (paragraphs, then sentences, then words) while respecting your character limits. This preserves readability and context better than arbitrary cuts. Use this for most document types: articles, books, reports, and general content.

  • TokenSplitter: Choose this when you need precise control over token counts for LLM context windows. It splits based on actual tokens (as counted by the model's tokenizer) rather than characters. This is essential when you're working with strict token limits (e.g., fitting chunks into a model's 4096-token context window) or when you're charged per token and need accurate cost estimates.

Rule of thumb: Use RecursiveCharacter for 95% of your document processing needs — it "just works" and produces human-readable chunks. Only switch to TokenSplitter when you're hitting token limits with your LLM or when you need to precisely budget tokens for cost management.

Working with Embeddings in Go

To work with embeddings in Go, we can use APIs that provide embedding services. One such service is OpenAI's API, which can be accessed using the LangChain Go library. We'll demonstrate how to set up and use this API to generate embeddings.

Configuring Embedding Model Parameters

When using LangChain in Go, you can configure various parameters for embedding models to suit your specific needs. Let's see how to set up and use OpenAI's embedding models to generate embeddings for text.

In this code, we configure the OpenAI client with specific options, including the model to use for embeddings (text-embedding-3-large). We then generate embeddings for three example sentences - two about the Avengers and Thanos that are semantically similar, and one unrelated sentence about cooking.

When you run this code, you'll see the first few values of each embedding vector. These vectors represent the semantic meaning of each text in a high-dimensional space. Even though the first two sentences use different words, their embedding vectors will be mathematically closer to each other than to the third sentence, reflecting their semantic similarity.

The embedding dimension (the length of the vector) depends on the model used. For example, text-embedding-3-large produces embeddings with 3072 dimensions, while text-embedding-3-small produces 1536-dimensional vectors. These high-dimensional vectors capture the nuanced semantic information in the text, enabling powerful similarity comparisons.

Choosing the Right Embedding Model

OpenAI offers two primary embedding models with different tradeoffs:

  • text-embedding-3-small (1536 dimensions): Your default choice for most applications. It's faster to generate, costs less, and produces high-quality embeddings suitable for search, clustering, and classification tasks. Think of it as your reliable daily driver.

  • text-embedding-3-large (3072 dimensions): Choose this when you need the absolute best quality for complex, nuanced tasks like legal document analysis, advanced semantic search, or when distinguishing between very similar concepts matters. It costs roughly 2x more and takes longer to generate, but delivers measurably better results in demanding scenarios.

Rule of thumb: Start with text-embedding-3-small for your initial implementation. Only upgrade to text-embedding-3-large if you observe quality issues in your similarity searches or if your use case involves highly specialized domain knowledge where subtle distinctions matter.

Inspecting Embedding Vectors

Let's take a closer look at the embedding vector we generated. These vectors are the mathematical representation of our text in a high-dimensional space.

When you run this code, you might see output like:

Each number in the vector represents a dimension in the embedding space. These seemingly random numbers actually contain rich semantic information. The pattern of values across all dimensions captures the meaning of our text in a way that allows for mathematical comparison.

Vector Databases for Embedding Storage

While we've generated an embedding for a single document chunk, a complete retrieval system needs to efficiently store and search through thousands or millions of these embedding vectors. This is where vector databases come into play.

Vector databases are specialized storage systems optimized for high-dimensional vector data. Unlike traditional databases that excel at exact matching, vector databases are designed for similarity search — finding vectors that are "close" to each other in mathematical space.

Popular vector database options compatible with Go include:

  • Milvus: An open-source vector database designed for scalable similarity search.
  • Weaviate: An open-source vector search engine with classification capabilities.
  • Pinecone: A fully-managed vector database service built specifically for machine learning applications.

These databases use sophisticated indexing techniques like Approximate Nearest Neighbors (ANN) to make similarity searches lightning-fast, even with millions of vectors. Without these specialized techniques, finding the closest vectors would require comparing a query against every single vector in your collection — computationally impossible for large-scale applications.

Summary and Next Steps

In this lesson, you learned how to generate embeddings for document chunks using Go-compatible APIs. We discussed the importance of embeddings in NLP and their role in context retrieval systems. You also saw a practical example of generating and inspecting embedding vectors.

As you move on to the practice exercises, you'll have the opportunity to apply these concepts by generating different embeddings and exploring their properties. This will reinforce your understanding and prepare you for the next unit, where we'll focus on retrieving relevant information using similarity search. Keep up the great work!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal