Inspecting Distances and Similarity Scores in pgvector

Introduction: Why Look at Distances and Similarity Scores?

Now that you have learned how to run nearest neighbor queries using different distance metrics in pgvector, it is important to go a step further and understand the actual numbers behind those results. In the previous lesson, you saw how to retrieve the most similar products to a given embedding, but the queries only showed you the product IDs and names, ordered by similarity. While this is useful, sometimes you need to see the raw distance or similarity scores themselves. These scores can help you understand how close or far apart items are in the embedding space, set thresholds for filtering results, or debug your search system.

In this lesson, you will learn how to modify your queries to display these distance and similarity values directly in your results. This will give you more insight into how your vector search is working and help you make better decisions about which products to show or recommend.

Viewing Raw L2 (Euclidean) Distance Values in SQL

Let’s start by looking at how to view the actual L2 (Euclidean) distance values in your search results. As a reminder, the <-> operator in pgvector is used to calculate the Euclidean distance between two vectors. In the previous lesson, you used this operator to order your results, but you did not display the distance values themselves.

To include the distance in your output, you can add an extra column to your SELECT statement. Here is an example query that shows the product_id, product_name, and the L2 distance from your query embedding:

In this query, ${QUERY_EMBEDDING} should be replaced with the embedding vector you want to compare against. The embedding <-> ${QUERY_EMBEDDING} part calculates the Euclidean distance between each product’s embedding and your query embedding, and the result is shown in a column called distance. The results are ordered so that the products with the smallest distance (i.e., most similar) appear first.

For example, your output might look like this:

product_id	product_name	distance
1	Wireless Mouse	0.231
7	Bluetooth Mouse	0.245
12	Ergonomic Mouse	0.260
23	USB Mouse	0.275
34	Gaming Mouse	0.290

Here, you can see not only which products are most similar to your query but also how close they are in the embedding space. A smaller distance means a higher similarity.

Calculating and Interpreting Cosine Similarity Scores

Another common way to measure similarity between vectors is cosine similarity. In pgvector, the <=> operator gives you the cosine distance, which is a value between 0 and 2, where 0 means the vectors are identical in direction. However, in many applications, it is more useful to see the cosine similarity, which ranges from 1 (most similar) to -1 (most dissimilar).

To convert cosine distance to cosine similarity, you can subtract the distance from 1. Here is how you can write a query to show the cosine similarity for each product:

In this query, embedding <=> ${QUERY_EMBEDDING} calculates the cosine distance, and 1 - (...) converts it to cosine similarity. The results are ordered so that the products with the highest similarity appear first.

A sample output might look like this:

product_id	product_name	cosine_similarity
1	Wireless Mouse	0.982
7	Bluetooth Mouse	0.975
12	Ergonomic Mouse	0.970
23	USB Mouse	0.965
34	Gaming Mouse	0.960

Here, a higher cosine similarity means the product is more similar to your query in terms of direction in the embedding space. This is especially useful in semantic search, where you care about the meaning or context rather than the exact values.

Comparing and Understanding the Results

When you look at the results from both queries, you will notice that the order of products is usually the same or very similar, but the scores themselves are different. L2 distance gives you a sense of how far apart two products are in the embedding space, while cosine similarity tells you how closely aligned they are in terms of direction.

For example, if you see a product with a very small L2 distance or a very high cosine similarity, you can be confident that it is highly relevant to your query. On the other hand, if the distance is large or the similarity is low, the product is less relevant. These scores can help you set thresholds for filtering results, such as only showing products with a cosine similarity above 0.95.

It is also helpful to compare the actual values to get a feel for what is considered "close" or "similar" in your specific dataset. Over time, you will develop an intuition for what these numbers mean in practice.

Summary and What’s Next

In this lesson, you learned how to inspect the actual distance and similarity scores behind your nearest neighbor search results in pgvector. You saw how to display L2 (Euclidean) distance values and how to calculate and interpret cosine similarity scores in your SQL queries. Understanding these numbers will help you make better decisions about which products to show, set thresholds for filtering, and debug your search system.

Next, you will get a chance to practice writing and running these queries yourself. This hands-on experience will help you become more comfortable with interpreting distance and similarity scores and using them to improve your vector search results.

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal