Introduction

Welcome back! This is the second lesson in our “Text Representation Techniques for RAG Systems” series. In our previous lesson, we introduced the Bag-of-Words (BOW) approach to converting text into numerical representations. Although BOW is intuitive and lays a solid foundation, it does not capture word order or deeper context.

Picture a helpdesk system that retrieves support tickets. Without a solid way to represent text contextually, customers searching for “account locked” might miss relevant entries labeled “login blocked” because the system can't recognize these phrases as related. This gap in understanding could lead to frustrated users and unresolved queries.

Today, we'll take a big step forward by learning to generate more expressive text embeddings — vectors that represent the semantic meaning of entire sentences. By the end of this lesson, you will know how to produce these embeddings and compare them with each other using cosine similarity.

Understanding Sentence Embeddings

Imagine you have sentences scattered across a high-dimensional space, where each sentence is a point, and closeness in this space reflects semantic similarity. Unlike BOW — which only counts word occurrences — sentence embeddings capture the relationship between words, making similar sentences land near each other in this space. This powerful feature is vital for Retrieval-Augmented Generation (RAG) systems, where retrieving text that is closest in meaning to a query drives more accurate responses.

Sentence embeddings provide a more nuanced understanding of linguistic context, going beyond simple counts of word frequencies. For example, while a BOW model might treat the sentences "I enjoy apples" and "He likes oranges" as quite different, embeddings can capture that both sentences express a personal preference for fruit. This richer representation is especially helpful in complex applications such as semantic search, recommendation engines, and advanced conversational systems, where subtle differences in meaning can greatly impact the results.

Understanding The Cosine Similarity Function

When we start turning words or sentences into vectors, we need a way to measure how similar they are. Cosine similarity is a standard approach that measures how aligned two vectors are by looking at the angle between them. In simple terms:

  • A value of 1 indicates that the vectors point in exactly the same direction (maximally similar).
  • A value of 0 means they're orthogonal (no shared direction).
  • A value of -1 shows they point in completely opposite directions.

Mathematically, cosine similarity between vectors A and B is:

cosine similarity(A,B)=ABAB\text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\|\|B\|}

where ABA \cdot B is the dot product of AA and BB, and A\|A\| and B\|B\| are the magnitudes (norms) of AA and BB.

Translating this into code, using numpy:

Python
1import numpy as np 2from numpy.linalg import norm 3 4def cosine_similarity(vec_a, vec_b): 5 """ 6 Compute cosine similarity between two vectors. 7 Range: -1 (opposite directions) to 1 (same direction). 8 """ 9 return np.dot(vec_a, vec_b) / (norm(vec_a) * norm(vec_b))

Because cosine similarity is insensitive to overall vector magnitude, it's especially useful in measuring how close two sentence embeddings are in terms of meaning rather than raw length or counts. Moreover, in practical text-embedding scenarios embeddings are trained or normalized so that their cosine similarities typically stay between 0 and 1, indicating varying degrees of semantic closeness rather than exact opposites. This makes it an ideal choice for tasks like semantic search and document retrieval.

Loading the Sentence Transformers library

We will use a popular tool for embedding sentences: the Sentence Transformers library. It allows us to quickly load pre-trained models that produce high-quality, semantically meaningful embeddings. These models are typically built on top of popular Transformer architectures like BERT or RoBERTa.

Below, we initialize a pre-trained model called 'sentence-transformers/all-MiniLM-L6-v2', known for producing compact, high-quality sentence embeddings:

Python
1from sentence_transformers import SentenceTransformer 2 3# Initialize a pre-trained embedding model from Sentence Transformers. 4model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

The all-MiniLM-L6-v2 model is a compact variant of Microsoft's MiniLM. This model strikes a good balance between size and performance, making it especially useful for real-time or large-scale applications. Larger models from the transformers ecosystem may produce even more robust embeddings, but they can be more resource-intensive.

Encoding Sentences into Embeddings

Now let's define some sentences, encode them into numerical vectors, and see what the resulting embeddings look like. This approach produces vectors that capture deeper semantic information compared to a BOW representation:

Python
1sentences = [ 2 "RAG stands for Retrieval Augmented Generation.", 3 "A Large Language Model is a Generative AI model for text generation.", 4 "RAG enhance text generation of LLMs by incorporating external data", 5 "Bananas are yellow fruits.", 6 "Apples are good for your health.", 7 "What's monkey's favorite food?" 8] 9 10embeddings = model.encode(sentences) 11print(embeddings.shape) # e.g., (6, 384), depending on the model 12print(embeddings[0]) # A sample embedding for the first sentence

The model.encode() function transforms each sentence into a vector, where each dimension (e.g., 384 in this example) captures some aspect of its semantic meaning. This is in stark contrast to a BOW representation, where the dimensionality corresponds to the size of the vocabulary, and crucial contextual information (like word order) may be lost. The shape (6, 384) indicates that we have 6 sentence embeddings, each of length 384. This high-dimensional space helps cluster semantically similar sentences, even if their words differ.

Comparing Sentence Embeddings with Cosine Similarity

Finally, let's compare sentences according to how similar they are. We can compute similarity scores for each pair of sentences using the cosine_similarity function introduced before:

Python
1for i, sent_i in enumerate(sentences): 2 # Start the inner loop from the next sentence to avoid redundant comparisons 3 for j, sent_j in enumerate(sentences[i+1:], start=i+1): 4 sim_score = cosine_similarity(embeddings[i], embeddings[j]) 5 print(f"Similarity('{sent_i}' , '{sent_j}') = {sim_score:.4f}")

Below is an example of what the output looks like. Recall that larger similarity values indicate sentences that are more closely related in meaning (sorted from highest similarity to lowest similarity, not all pairs shown):

Plain text
1similarity('A Large Language Model is a Generative AI model for text generation.' , 'RAG enhance text generation of LLMs by incorporating external data') = 0.4983 2similarity('Bananas are yellow fruits.' , 'What's monkey's favorite food?') = 0.4778 3similarity('RAG stands for Retrieval Augmented Generation.' , 'RAG enhance text generation of LLMs by incorporating external data') = 0.4630 4similarity('Bananas are yellow fruits.' , 'Apples are good for your health.') = 0.3568 5... 6similarity('A Large Language Model is a Generative AI model for text generation.' , 'Bananas are yellow fruits.') = 0.0042 7similarity('RAG enhance text generation of LLMs by incorporating external data' , 'Apples are good for your health.') = 0.0025

In these results, pairs referencing similar topics (e.g., RAG and LLMs, or apples and bananas) often produce higher similarity scores, while pairs referencing different topics yield lower scores. Notice how “Bananas are yellow fruits.” and “What's monkey's favorite food?” have no overlapping words yet still show a relatively high similarity, highlighting how embeddings capture semantic meaning more effectively than a simpler BOW model, which primarily relies on word overlap.

Conclusion And Next Steps

By moving beyond simpler models like Bag-of-Words, you can capture meaningful relationships between words and phrases in a sophisticated way. Sentence embeddings play a central role in Retrieval-Augmented Generation systems because they enable more precise and flexible retrieval of relevant information.

In the upcoming practice section, you will have the opportunity to apply the concepts covered in this lesson. You'll practice setting up embedding models and experimenting with sentence embeddings to understand how they capture semantic meaning. This hands-on experience will help solidify your understanding of sentence embeddings. Keep it up with the good work, and happy coding!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal