Welcome back! In the previous lesson, you learned how to set up and initialize Pinecone, a managed vector database service. You also created or connected to an index, which is essential for storing and managing vector data. In this lesson, we will focus on embeddings, which are crucial for converting text into numerical representations that can be efficiently stored and queried in Pinecone. Our goal is to guide you through the process of generating these embeddings, a key step in managing vector data for applications like semantic search.
Before we can generate the embeddings, we need to prepare our data. Let's consider a sample observation where each item has a unique ID, title, content, category, tags, and a date. This observation will be converted into numerical vectors, or embeddings, which Pinecone can index. Here's a sample observation:
This observation includes text about technology, categorized into different aspects. Preparing your data in this structured format is crucial for generating meaningful embeddings.
Now that we have our data ready, let's convert the text into numerical vectors using an embedding model. Since we are using Pinecone locally, we will generate embeddings outside of Pinecone using the sentence-transformers library. Here's how you can generate embeddings:
In this code snippet, we use the SentenceTransformer model to convert the text into embeddings. The batch_embed_texts function handles the batch processing of text data to generate embeddings efficiently.
When working with Pinecone Local, it’s important to understand that there are two components involved:
- The controller service (running on
http://localhost:5080) - The index service (running on a different port such as
localhost:5081,localhost:5082, etc.)
When we initialize Pinecone like this:
we are communicating with the controller. The controller is responsible for operations such as creating, listing, describing, and deleting indexes.
However, once an index is created, vector operations like upsert, query, fetch, and update are handled by the index service itself. Because of this, we must explicitly connect to the index using its host.
Here’s how we do that:
We retrieve the index host using describe_index, then connect directly to that host.
The secure=False setting is required because Pinecone Local does not use TLS encryption. Without disabling TLS, the connection would fail.
This distinction between the controller and the index service is specific to Pinecone Local and will be used consistently in all upcoming exercises.
As you know from the previous lesson, creating an index in Pinecone is essential for storing and managing vector data. Here's a concise example using Pinecone locally:
This code checks for an existing index and creates one if it doesn't exist, using a specified dimension and similarity metric.
Once you have generated the embeddings and created the index, the next step is to insert this data into your Pinecone index. This process involves preparing the records with their corresponding embeddings and metadata, and then upserting them into the index. Here's how you can do it:
In this section, we first target the index using its unique identifier. We then prepare each record by combining the embedding values with the original text and category as metadata. The upsert method is used to insert these records into the index.
After upserting the records, it's important to ensure that the vectors are properly indexed. We can achieve this by polling the index until the vectors appear or a timeout is reached. Here's how you can monitor the indexing status:
In this section, we define a function wait_for_indexing that uses the describe_index_stats method to poll the index every few seconds, checking if the vectors have been indexed. This method is crucial as it provides metadata about the index, such as the dimension of the vectors, the similarity metric used, and the total number of vectors indexed. By validating that our vectors have been indexed, we confirm that our index contains the expected data, ensuring the integrity and readiness of our vector data for further operations.
In this lesson, you learned how to generate embeddings using an external model and manage them in Pinecone locally. We started by preparing a sample dataset and then generated embeddings using the sentence-transformers library. This step is crucial for managing vector data and preparing for more advanced operations. As you move forward, you'll have the opportunity to practice these concepts through exercises that reinforce what you've learned. In the next lesson, we'll explore querying and searching in Pinecone, building on the skills you've developed here.
