Introduction

In this lesson, we will delve into the world of vector embeddings and how they can be used to compare different models. Vector embeddings are numerical representations of text that capture semantic meaning, allowing us to perform tasks like similarity comparison. We will focus on understanding cosine similarity, a crucial metric for comparing the similarity between vector embeddings. By the end of this lesson, you will understand how to calculate cosine similarity and its importance in evaluating the effectiveness of embedding models.

Understanding Cosine Similarity

You are already familiar with cosine similarity from a previous practice, so let's recall and delve deeper into this concept. Cosine similarity is a measure of similarity between two non-zero vectors. It calculates the cosine of the angle between the vectors, providing a value between -1 and 1. A cosine similarity of 1 indicates that the vectors are identical, 0 indicates orthogonality (no similarity), and -1 indicates complete dissimilarity. In the context of vector embeddings, cosine similarity helps us quantify how similar or different two pieces of text are based on their embeddings.

The mathematical formula for cosine similarity between two vectors A\mathbf{A} and B\mathbf{B} is:

cosine_similarity(A,B)=ABAB\text{cosine\_similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}

Where:

  • AB\mathbf{A} \cdot \mathbf{B} is the dot product of the vectors.
  • A\|\mathbf{A}\| and B\|\mathbf{B}\| are the magnitudes (or Euclidean norms) of the vectors.

In the plot below, you can visually observe how the cosine of the angle between vectors indicates their similarity. A small angle results in a cosine value close to 1, signifying similar meanings. An angle around 90 degrees yields a cosine value near 0, indicating no similarity. Conversely, an angle close to 180 degrees results in a negative cosine value, reflecting opposite meanings.

Cosine similarity is widely used in natural language processing (NLP) for tasks such as semantic search, recommendation systems, and clustering. It allows us to compare the semantic meaning of different texts by evaluating the angle between their embeddings. This makes it a powerful tool for evaluating the performance of embedding models and selecting the most suitable one for specific applications.

Generating OpenAI Embeddings and Calculating Cosine Similarity

To demonstrate the use of cosine similarity, we will generate embeddings for three sentences using the OpenAI pre-trained model and calculate their cosine similarity. This will help us understand how similar or different the sentences are based on their embeddings.

Output:

We use cosine_similarity from sklearn.metrics.pairwise to calculate the similarity scores between the embeddings. The [0][0] indexing is used because cosine_similarity returns a 2D array (a matrix) even when comparing two single vectors. The [0][0] accesses the first element of this matrix, which contains the similarity score between the two vectors.

The output shows that for the OpenAI model, the "Anchor vs. Similar" pair has a higher cosine similarity score compared to the "Anchor vs. Different" pair, indicating that the sentences "I love pizza." and "I enjoy pizza a lot." are more semantically similar than "I love pizza." and "Penguins are cute animals."

Generating Hugging Face Embeddings and Calculating Cosine Similarity

Next, we will use the Hugging Face pre-trained model to generate embeddings for the same three sentences and calculate their cosine similarity. This will allow us to compare the performance of the two models in capturing semantic similarities and differences.

Output:

For the Hugging Face model, the output also shows that the "Anchor vs. Similar" pair has a higher cosine similarity score compared to the "Anchor vs. Different" pair. It's important to note that the dimensions of the embeddings need to be the same to perform cosine similarity calculations. This is why we cannot directly compare the embeddings from the two models for a given text, as the OpenAI model produces a 1536-dimensional output, while the Hugging Face model produces a 384-dimensional output.

Conclusion

In this lesson, we explored the concept of cosine similarity and its importance in comparing vector embeddings. By calculating cosine similarity scores, we can effectively evaluate the semantic similarities and differences between sentences. This understanding allows us to assess the performance of different embedding models and choose the best one for our needs. As you move forward, consider experimenting with other models and datasets to deepen your understanding of embeddings and their applications. Great job on completing this lesson—let's move on to the practices and apply what you've learned!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal