Converting Documents into Vectors

Introduction

Welcome back to Managing Data for GenAI with Bedrock Knowledge Bases! In this second lesson, having successfully established your vector storage infrastructure in the previous unit, you're now ready to tackle the exciting transformation process that turns ordinary documents into the mathematical representations that power intelligent search. As you may recall from our first lesson, we created a specialized S3 Vectors bucket and index configured to store high-dimensional embeddings with 1024 dimensions and cosine similarity metrics — now it's time to populate that infrastructure with actual content.

Today, we'll bridge the gap between static document storage and dynamic AI-powered search capabilities. We'll explore how to load documents from your file system, convert them into embeddings using Amazon Bedrock's powerful Titan model, and store these vectorized representations in our prepared S3 Vectors infrastructure. By the end of this lesson, you'll understand the complete pipeline that transforms human-readable text into the mathematical language that enables semantic search, similarity matching, and the intelligent retrieval capabilities essential for modern AI applications.

Scenario: Tech Company Internal Docs

Throughout this course, we'll be working with a carefully curated collection of internal documents from Tech Company Inc., a fictitious AWS-first organization. This document collection, stored in our docs/ folder, includes a diverse mix of business requirements documents (BRDs), product requirement documents (PRDs), architectural decision records (ADRs), technical design specifications, operational runbooks, and internal policies. This realistic dataset provides an excellent foundation for understanding how to design and implement RAG-enabled systems, giving you hands-on experience with the kind of heterogeneous content you'll encounter when building knowledge bases for real organizations.

Understanding the Document-to-Vector Transformation Pipeline

Before diving into implementation details, let's build intuition around the transformation process that converts documents into searchable vectors. Think of this pipeline as a sophisticated translation system: just as human translators convert meaning from one language to another while preserving intent and context, our embedding pipeline converts textual content into mathematical representations while preserving semantic meaning and relationships.

The process involves several critical stages that work together to create a comprehensive vector database. First, we load documents from storage and prepare them with appropriate metadata for tracking and organization. Next, we send each document's content to Amazon Bedrock's Titan embedding model, which analyzes the text and generates a 1024-dimensional vector that captures the semantic essence of the content. Finally, we structure these embeddings according to S3 Vectors requirements and insert them into our index, where they become instantly searchable through similarity operations. This pipeline scales from handling single documents to processing thousands of files, making it the foundation for enterprise-grade knowledge management systems.

Loading and Preparing Documents

Our transformation journey begins with a robust document loading system that handles file processing, content extraction, and metadata preparation. Let's examine the utility function that manages this essential first step:

This function implements several important design patterns for robust document processing. The filepath.is_file() check ensures we only process actual files rather than subdirectories, preventing unexpected errors during batch processing. The encoding specification (utf-8) guarantees proper handling of international characters and special symbols commonly found in technical documentation. Each document is structured as a dictionary containing three essential components: a key derived from the filename (without extension) that serves as the unique identifier for vector storage, the raw content that will be converted into embeddings, and metadata that preserves important file information for later retrieval and debugging purposes. The filepath.stem property elegantly extracts the filename without its extension, creating clean identifiers like design-rag-nimbus-assist from filenames like design-rag-nimbus-assist.txt.

Establishing the Embedding Pipeline Infrastructure

With our document loading capability established, we can now set up the core infrastructure needed to convert documents into embeddings. This involves configuring AWS clients and defining the constants that govern our embedding generation, similarly to what we did in the previous lesson:

The dual-client approach reflects the distributed nature of our pipeline: s3_vectors_client handles the specialized vector storage operations, while bedrock_runtime_client manages AI model invocations. The EMBEDDING_MODEL_ID specifically targets Amazon's Titan Text Embeddings V2 model, optimized for retrieval tasks and offering improved accuracy for semantic similarity applications. Notice how our EMBEDDING_DIMENSIONS constant (1024) aligns perfectly with the index configuration from our previous lesson, ensuring mathematical compatibility between the embedding model output and our storage infrastructure. This dimensional alignment is crucial — attempting to insert 512-dimensional vectors into a 1024-dimensional index would result in immediate failure, highlighting the importance of consistent configuration across your entire pipeline. Keep in mind that Bedrock charges for embedding generation based on the number of input tokens processed.

Generating Embeddings with Bedrock

The heart of our transformation pipeline lies in properly configuring requests to Amazon Bedrock's embedding model and processing the responses. Let's examine how we structure these operations to maximize the quality and utility of our generated embeddings:

The embedding_request dictionary contains three critical parameters that control how our text gets converted into mathematical representations. The inputText field carries the actual document content that will be analyzed by the Titan model, supporting up to 8,192 tokens or 50,000 characters of text. The dimensions parameter ensures our output vectors match our index configuration exactly, preventing dimensional mismatches that could cause insertion failures. Most importantly, setting normalize to True optimizes our embeddings for similarity search operations by ensuring all vectors have unit length, making cosine similarity calculations more accurate and meaningful for document retrieval scenarios.

The invoke_model operation returns a complex response object in which the actual embedding data is nested within a JSON structure in the response body. The method is essential because Bedrock returns the response body as a streaming object rather than direct JSON, requiring explicit reading before parsing. The final extraction reveals an array of 1024 floating-point numbers that mathematically represent the semantic meaning of our document content — for instance, documents about "cloud computing" and "AWS services" would have vectors pointing in similar directions in this 1024-dimensional space, even if they share few common words.

Structuring and Storing Vectors in S3

The final transformation step involves formatting our embedding data according to S3 Vectors' specific requirements and executing the batch insertion:

The vector structure follows S3 Vectors' strict format requirements, with each vector containing three essential components. The key field provides the unique identifier that allows individual vector retrieval and updates. The data structure requires explicit float32 conversion to ensure numerical precision compatibility with the index's data type specifications — this conversion is critical because Python's default float type is 64-bit, and S3 Vectors optimizes storage and performance using 32-bit precision. The metadata section deserves special attention: AMAZON_BEDROCK_TEXT stores the original document content for retrieval purposes, while x-amz-bedrock-kb-source-uri provides source tracking compatible with Bedrock Knowledge Bases standards, enabling seamless integration with higher-level services. The spread operator (**doc["metadata"]) preserves any additional metadata from our document loading process, maintaining flexibility for custom attributes.

Conclusion and Next Steps

Congratulations on mastering the complete document-to-vector transformation pipeline! You've successfully learned how to load documents systematically, generate high-quality embeddings using Amazon Bedrock's Titan model, and store vectorized content in S3 Vectors with proper metadata structures. This pipeline forms the technical foundation that enables semantic search, similarity matching, and intelligent document retrieval — the core capabilities that transform static document collections into dynamic, AI-powered knowledge systems.

In our next lesson, we'll explore how to query and retrieve information from your newly populated vector index, implementing search functionality that can understand user intent and find relevant documents based on meaning rather than just keywords. The upcoming practice section will challenge you to implement various aspects of this pipeline yourself, from handling different document formats to optimizing embedding generation for specific use cases. Prepare to see your vector transformation work come to life through powerful hands-on exercises that will test your understanding and help you build real-world document processing capabilities!

Previous Lesson

Next Lesson: Creating a Knowledge Base

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal