Loading...

Introduction

Welcome back! This is the second lesson in our “Text Representation Techniques for RAG Systems” series. In our previous lesson, we introduced the Bag-of-Words (BOW) approach to converting text into numerical representations. Although BOW is intuitive and lays a solid foundation, it does not capture word order or deeper context.

Picture a helpdesk system that retrieves support tickets. Without a solid way to represent text contextually, customers searching for “account locked” might miss relevant entries labeled “login blocked” because the system can't recognize these phrases as related. This gap in understanding could lead to frustrated users and unresolved queries.

Today, we'll take a big step forward by learning to generate more expressive text embeddings — vectors that represent the semantic meaning of entire sentences. By the end of this lesson, you will know how to produce these embeddings and compare them with each other using cosine similarity.

Understanding Sentence Embeddings

Imagine you have sentences scattered across a high-dimensional space, where each sentence is a point, and closeness in this space reflects semantic similarity. Unlike BOW — which only counts word occurrences — sentence embeddings capture the relationship between words, making similar sentences land near each other in this space. This powerful feature is vital for Retrieval-Augmented Generation (RAG) systems, where retrieving text that is closest in meaning to a query drives more accurate responses.

Sentence embeddings provide a more nuanced understanding of linguistic context, going beyond simple counts of word frequencies. For example, while a BOW model might treat the sentences "I enjoy apples" and "He likes oranges" as quite different, embeddings can capture that both sentences express a personal preference for fruit. This richer representation is especially helpful in complex applications such as semantic search, recommendation engines, and advanced conversational systems, where subtle differences in meaning can greatly impact the results.

Understanding The Cosine Similarity Function

When we start turning words or sentences into vectors, we need a way to measure how similar they are. Cosine similarity is a standard approach that measures how aligned two vectors are by looking at the angle between them. In simple terms:

A value of 1 indicates that the vectors point in exactly the same direction (maximally similar).
A value of 0 means they're orthogonal (no shared direction).
A value of -1 shows they point in completely opposite directions.

Mathematically, cosine similarity between vectors A and B is:

\text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\|\|B\|}

Loading the Sentence Transformers library

We will use a popular tool for embedding sentences: the Sentence Transformers library. It allows us to quickly load pre-trained models that produce high-quality, semantically meaningful embeddings. These models are typically built on top of popular Transformer architectures like BERT or RoBERTa.

Below, we initialize a pre-trained model called 'sentence-transformers/all-MiniLM-L6-v2', known for producing compact, high-quality sentence embeddings:

The all-MiniLM-L6-v2 model is a compact variant of Microsoft's MiniLM. This model strikes a good balance between size and performance, making it especially useful for real-time or large-scale applications. Larger models from the transformers ecosystem may produce even more robust embeddings, but they can be more resource-intensive.

Encoding Sentences into Embeddings

Now let's define some sentences, encode them into numerical vectors, and see what the resulting embeddings look like. This approach produces vectors that capture deeper semantic information compared to a BOW representation:

The model.encode() function transforms each sentence into a vector, where each dimension (e.g., 384 in this example) captures some aspect of its semantic meaning. This is in stark contrast to a BOW representation, where the dimensionality corresponds to the size of the vocabulary, and crucial contextual information (like word order) may be lost. The shape (6, 384) indicates that we have 6 sentence embeddings, each of length 384. This high-dimensional space helps cluster semantically similar sentences, even if their words differ.

Comparing Sentence Embeddings with Cosine Similarity

Finally, let's compare sentences according to how similar they are. We can compute similarity scores for each pair of sentences using the cosine_similarity function introduced before:

Below is an example of what the output looks like. Recall that larger similarity values indicate sentences that are more closely related in meaning (sorted from highest similarity to lowest similarity, not all pairs shown):

In these results, pairs referencing similar topics (e.g., RAG and LLMs, or apples and bananas) often produce higher similarity scores, while pairs referencing different topics yield lower scores. Notice how “Bananas are yellow fruits.” and “What's monkey's favorite food?” have no overlapping words yet still show a relatively high similarity, highlighting how embeddings capture semantic meaning more effectively than a simpler BOW model, which primarily relies on word overlap.

Conclusion And Next Steps

By moving beyond simpler models like Bag-of-Words, you can capture meaningful relationships between words and phrases in a sophisticated way. Sentence embeddings play a central role in Retrieval-Augmented Generation systems because they enable more precise and flexible retrieval of relevant information.

In the upcoming practice section, you will have the opportunity to apply the concepts covered in this lesson. You'll practice setting up embedding models and experimenting with sentence embeddings to understand how they capture semantic meaning. This hands-on experience will help solidify your understanding of sentence embeddings. Keep it up with the good work, and happy coding!

Previous Lesson

Next Lesson: Visualizing Sentence Embeddings with t-SNE

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal