Generating Embeddings in Pinecone

Introduction to Embeddings in Pinecone

Welcome back! In the previous lesson, you learned how to set up and initialize Pinecone, a managed vector database service. You also created or connected to an index, which is essential for storing and managing vector data. In this lesson, we will focus on embeddings, which are crucial for converting text into numerical representations that can be efficiently stored and queried in Pinecone. Our goal is to guide you through the process of generating these embeddings, a key step in managing vector data for applications like semantic search.

Preparing Data for Embedding

Before we can generate the embeddings, we need to prepare our data. Let's consider a sample observation where each item has a unique ID, title, content, category, tags, and a date. This observation will be converted into numerical vectors, or embeddings, which Pinecone can index. Here's a sample observation: Python data = [{"id": 1, "title": "Revolutionizing Computing with AI", "content": "Artificial intelligence is transforming the way we approach complex problems in computing. Recent breakthroughs in machine learning have enabled faster data processing and smarter algorithms. The future of technology is expected to integrate AI into every facet of life.", "category": "Technology", "tags": ["AI", "machine learning", "computing", "innovation"], "date": "2025-02-01"}] data = [{"id": 1, "title": "Revolutionizing Computing with AI", "content": "Artificial intelligence is transforming the way we approach complex problems in computing. Recent breakthroughs in machine learning have enabled faster data processing and smarter algorithms. The future of technology is expected to integrate AI into every facet of life.", "category": "Technology", "tags": ["AI", "machine learning", "computing", "innovation"], "date": "2025-02-01"}] This observation includes text about technology, categorized into different aspects. Preparing your data in this structured format is crucial for generating meaningful embeddings.

Generating Embeddings

Now that we have our data ready, let's convert the text into numerical vectors using an embedding model. Since we are using Pinecone locally, we will generate embeddings outside of Pinecone using the sentence-transformers library. Here's how you can generate embeddings: Python from sentence_transformers import SentenceTransformer # Load embedding model model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") def batch_embed_texts(texts, batch_size=50): return model.encode(texts, batch_size=batch_size, show_progress_bar=True).tolist() # Generate embeddings for the content contents = [d["content"] for d in data] embeddings = batch_embed_texts(contents) print("Generated embeddings:") print(embeddings) from sentence_transformers import SentenceTransformer # Load embedding model model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") def batch_embed_texts(texts, batch_size=50): return model.encode(texts, batch_size=batch_size, show_progress_bar=True).tolist() # Generate embeddings for the content contents = [d["content"] for d in data] embeddings = batch_embed_texts(contents) print("Generated embeddings:") print(embeddings) In this code snippet, we use the SentenceTransformer model to convert the text into embeddings. The batch_embed_texts function handles the batch processing of text data to generate embeddings efficiently.

Connecting to the Index in Pinecone Local

When working with Pinecone Local, it’s important to understand that there are two components involved: The controller service (running on http://localhost:5080) The index service (running on a different port such as localhost:5081, localhost:5082, etc.) When we initialize Pinecone like this: Python from pinecone.grpc import PineconeGRPC pc = PineconeGRPC(api_key="pclocal", host="http://localhost:5080") from pinecone.grpc import PineconeGRPC pc = PineconeGRPC(api_key="pclocal", host="http://localhost:5080") we are communicating with the controller . The controller is responsible for operations such as creating, listing, describing, and deleting indexes. However, once an index is created, vector operations like upsert, query, fetch, and update are handled by the index service itself. Because of this, we must explicitly connect to the index using its host. Here’s how we do that: Python from pinecone.grpc import GRPCClientConfig index_host = pc.describe_index(name=index_name).host index = pc.Index(host=index_host, grpc_config=GRPCClientConfig(secure=False)) from pinecone.grpc import GRPCClientConfig index_host = pc.describe_index(name=index_name).host index = pc.Index(host=index_host, grpc_config=GRPCClientConfig(secure=False)) We retrieve the index host using describe_index, then connect directly to that host. The secure=False setting is required because Pinecone Local does not use TLS encryption. Without disabling TLS, the connection would fail. This distinction between the controller and the index service is specific to Pinecone Local and will be used consistently in all upcoming exercises.

Creating an Index in Pinecone

As you know from the previous lesson, creating an index in Pinecone is essential for storing and managing vector data. Here's a concise example using Pinecone locally: Pythonfrom pinecone.grpc import PineconeGRPC, GRPCClientConfig from pinecone import ServerlessSpec import time # Initialize Pinecone client for local usage pc = PineconeGRPC( api_key="pclocal", host="http://localhost:5080" ) # Define a unique index name index_name = "vector-index" # Create index only if it doesn't exist if not pc.has_index(index_name): index_model = pc.create_index( name=index_name, vector_type="dense", dimension=384, metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1"), deletion_protection="disabled", tags={"environment": "development"} ) print("Created index:\n", index_model) time.sleep(5)from pinecone.grpc import PineconeGRPC, GRPCClientConfig from pinecone import ServerlessSpec import time # Initialize Pinecone client for local usage pc = PineconeGRPC( api_key="pclocal", host="http://localhost:5080" ) # Define a unique index name index_name = "vector-index" # Create index only if it doesn't exist if not pc.has_index(index_name): index_model = pc.create_index( name=index_name, vector_type="dense", dimension=384, metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1"), deletion_protection="disabled", tags={"environment": "development"} ) print("Created index:\n", index_model) time.sleep(5) This code checks for an existing index and creates one if it doesn't exist, using a specified dimension and similarity metric.

Inserting Data into Pinecone

Once you have generated the embeddings and created the index, the next step is to insert this data into your Pinecone index. This process involves preparing the records with their corresponding embeddings and metadata, and then upserting them into the index. Here's how you can do it: Python# Initialize index connection with insecure GRPC config index_host = pc.describe_index(name=index_name).host index = pc.Index(host=index_host, grpc_config=GRPCClientConfig(secure=False)) # Prepare the records for upsert records = [] for d, e in zip(data, embeddings): records.append({ "id": str(d["id"]), "values": e, "metadata": { "title": d["title"], "content": d["content"], "category": d.get("category", "unknown"), "tags": ",".join(d.get("tags", [])) if isinstance(d.get("tags"), list) else str(d.get("tags", "")), "date": d.get("date", "unknown") } }) # Upsert the records into the index index.upsert( vectors=records, namespace="example-namespace" ) print("Upserted vectors.")# Initialize index connection with insecure GRPC config index_host = pc.describe_index(name=index_name).host index = pc.Index(host=index_host, grpc_config=GRPCClientConfig(secure=False)) # Prepare the records for upsert records = [] for d, e in zip(data, embeddings): records.append({ "id": str(d["id"]), "values": e, "metadata": { "title": d["title"], "content": d["content"], "category": d.get("category", "unknown"), "tags": ",".join(d.get("tags", [])) if isinstance(d.get("tags"), list) else str(d.get("tags", "")), "date": d.get("date", "unknown") } }) # Upsert the records into the index index.upsert( vectors=records, namespace="example-namespace" ) print("Upserted vectors.") In this section, we first target the index using its unique identifier. We then prepare each record by combining the embedding values with the original text and category as metadata. The upsert method is used to insert these records into the index.

Monitoring Indexing Status

After upserting the records, it's important to ensure that the vectors are properly indexed. We can achieve this by polling the index until the vectors appear or a timeout is reached. Here's how you can monitor the indexing status: Pythondef wait_for_indexing(index, namespace="example-namespace", expected_count=0, max_wait=30, interval=2): elapsed = 0 while elapsed < max_wait: stats = index.describe_index_stats() vector_count = stats.get("namespaces", {}).get(namespace, {}).get("vector_count", 0) print(f"[{elapsed}s] Waiting... current vector count: {vector_count}") if vector_count >= expected_count: print("Vectors are indexed!") return time.sleep(interval) elapsed += interval print("Warning: Timeout reached before vectors appeared in index.") # Call the function after upserting wait_for_indexing(index, namespace="example-namespace", expected_count=len(data))def wait_for_indexing(index, namespace="example-namespace", expected_count=0, max_wait=30, interval=2): elapsed = 0 while elapsed < max_wait: stats = index.describe_index_stats() vector_count = stats.get("namespaces", {}).get(namespace, {}).get("vector_count", 0) print(f"[{elapsed}s] Waiting... current vector count: {vector_count}") if vector_count >= expected_count: print("Vectors are indexed!") return time.sleep(interval) elapsed += interval print("Warning: Timeout reached before vectors appeared in index.") # Call the function after upserting wait_for_indexing(index, namespace="example-namespace", expected_count=len(data)) In this section, we define a function wait_for_indexing that uses the describe_index_stats method to poll the index every few seconds, checking if the vectors have been indexed. This method is crucial as it provides metadata about the index, such as the dimension of the vectors, the similarity metric used, and the total number of vectors indexed. By validating that our vectors have been indexed, we confirm that our index contains the expected data, ensuring the integrity and readiness of our vector data for further operations.

Summary and Next Steps

In this lesson, you learned how to generate embeddings using an external model and manage them in Pinecone locally. We started by preparing a sample dataset and then generated embeddings using the sentence-transformers library. This step is crucial for managing vector data and preparing for more advanced operations. As you move forward, you'll have the opportunity to practice these concepts through exercises that reinforce what you've learned. In the next lesson, we'll explore querying and searching in Pinecone, building on the skills you've developed here.

Previous Lesson

Next Lesson: Querying and Searching in Pinecone

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal