Running Nearest Neighbor Queries with Different Distances

Introduction: From Viewing Embeddings to Finding Similar Products

In the previous lesson, you explored the data stored in your products table, including the embedding vectors for each product. You learned how to view these embeddings and understand their role in representing product information in a way that is useful for machine learning and search tasks. Now that you are comfortable with how embeddings are stored and displayed, it is time to take the next step: using these embeddings to find products that are similar to each other.

This lesson will show you how to run nearest neighbor queries in PostgreSQL using the pgvector extension. These queries allow you to search for products that are most similar to a given product or embedding, which is a key feature in building recommendation systems and semantic search tools. The main idea is to compare embeddings using different distance metrics, which measure how "close" or "similar" two products are in the embedding space. By the end of this lesson, you will know how to use several types of distance metrics to find the most relevant products for any given query.

Distance Metrics in pgvector: The Basics

When searching for similar products using embeddings, the way you measure "distance" between vectors is very important. In pgvector, there are four main distance metrics you can use: Euclidean (L2), Inner Product, Cosine, and L1. Each metric has its own way of comparing vectors, and the choice of metric can affect which products are considered most similar.

Euclidean (L2) Distance measures the straight-line distance between two points in space. It is often used when you want to find items that are closest in terms of overall position.
Inner Product Distance is related to the dot product of two vectors. It is useful when you care about the direction and magnitude of the vectors.
Cosine Distance measures the angle between two vectors, focusing on their direction rather than their length. This is helpful when you want to find items that are similar in meaning, regardless of their scale.
L1 Distance (also called Manhattan distance) sums the absolute differences of each dimension. It can be useful when you want to measure similarity in a way that is less sensitive to outliers.

In the next sections, you will see how to use each of these metrics in a SQL query to find the nearest neighbors for a given embedding.

Example: Nearest Neighbor Query with Euclidean (L2) Distance

Let’s start with the most common distance metric: Euclidean (L2) distance. Suppose you have a query embedding (for example, representing a product description you want to match against your catalog). You can use the following SQL query to find the 10 products in your table that are closest to this embedding:

In this query, ${QUERY_EMBEDDING} should be replaced with the actual embedding vector you want to search with. The <-> operator tells pgvector to use Euclidean distance to compare the stored embeddings with your query embedding. The results will be ordered so that the most similar products (those with the smallest distance) appear first.

For example, if your query embedding represents a "wireless mouse," the output might look like this:

product_id	product_name
1	Wireless Mouse
7	Bluetooth Mouse
12	Ergonomic Mouse
23	USB Mouse
34	Gaming Mouse
...	...

This table shows the top 10 products that are most similar to your query, based on Euclidean distance.

Examples: Other Distance Metrics (Inner Product, Cosine, L1)

Besides Euclidean distance, pgvector supports three other distance metrics. You can use a different operator in your SQL query to switch between them.

To use Inner Product Distance, you can write:

The <#> operator tells pgvector to use the inner product as the distance metric. This can be useful if you want to prioritize products that are not only similar in direction but also in magnitude.

For Cosine Distance, the query looks like this:

The <=> operator uses cosine distance, which focuses on the angle between vectors. This is often used in semantic search, where you care more about the meaning than the scale of the vectors.

Finally, for L1 Distance, you can use:

The <+> operator uses L1 (Manhattan) distance, which can be more robust to outliers in some cases.

Each of these queries will return a list of the top 10 products that are most similar to your query embedding, but the exact products and their order may change depending on the distance metric you choose. This is because each metric measures similarity in a slightly different way.

Summary and What’s Next

In this lesson, you learned how to use pgvector to run nearest neighbor queries with different distance metrics. You saw how to use Euclidean, Inner Product, Cosine, and L1 distances to find the most similar products to a given embedding. Each distance metric has its own strengths and is suited to different types of search problems.

Now that you know how to write these queries, you are ready to practice running them yourself. In the upcoming exercises, you will get hands-on experience using each distance metric and see how the results change depending on your choice. This will help you build a deeper understanding of how vector search works and how to choose the right metric for your application.

Previous Lesson

Next Lesson: Inspecting Distances and Similarity Scores in pgvector

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal