Introduction

Welcome back! This is the second lesson in our “Text Representation Techniques for RAG Systems” series. In our previous lesson, we introduced the Bag-of-Words (BOW) approach to converting text into numerical representations. Although BOW is intuitive and lays a solid foundation, it does not capture word order or deeper context.

Picture a helpdesk system that retrieves support tickets. Without a solid way to represent text contextually, customers searching for “account locked” might miss relevant entries labeled “login blocked” because the system can't recognize these phrases as related. This gap in understanding could lead to frustrated users and unresolved queries.

Today, we'll take a big step forward by learning to generate more expressive text embeddings — vectors that represent the semantic meaning of entire sentences. By the end of this lesson, you will know how to produce these embeddings and compare them with each other using cosine similarity.

Understanding Sentence Embeddings

Imagine you have sentences scattered across a high-dimensional space, where each sentence is a point, and closeness in this space reflects semantic similarity. Unlike BOW — which only counts word occurrences — sentence embeddings capture the relationship between words, making similar sentences land near each other in this space. This powerful feature is vital for Retrieval-Augmented Generation (RAG) systems, where retrieving text that is closest in meaning to a query drives more accurate responses.

Sentence embeddings provide a more nuanced understanding of linguistic context, going beyond simple counts of word frequencies. For example, while a BOW model might treat the sentences "I enjoy apples" and "He likes oranges" as quite different, embeddings can capture that both sentences express a personal preference for fruit. This richer representation is especially helpful in complex applications such as semantic search, recommendation engines, and advanced conversational systems, where subtle differences in meaning can greatly impact the results.

Understanding The Cosine Similarity Function

When we start turning words or sentences into vectors, we need a way to measure how similar they are. Cosine similarity is a standard approach that measures how aligned two vectors are by looking at the angle between them. In simple terms:

  • A value of 1 indicates that the vectors point in exactly the same direction (maximally similar).
  • A value of 0 means they're orthogonal (no shared direction).
  • A value of -1 shows they point in completely opposite directions.

Mathematically, cosine similarity between vectors A and B is:

cosine similarity(A,B)=ABAB\text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\|\|B\|}

where ABA \cdot B is the dot product of A and B, and A\|A\| and B\|B\| are the magnitudes (norms) of A and B.

Translating this into code, using Java and Apache Commons Math:

Because cosine similarity is insensitive to overall vector magnitude, it's especially useful in measuring how close two sentence embeddings are in terms of meaning rather than raw length or counts. Moreover, in practical text-embedding scenarios, embeddings are trained or normalized so that their cosine similarities typically stay between 0 and 1, indicating varying degrees of semantic closeness rather than exact opposites. This makes it an ideal choice for tasks like semantic search and document retrieval.

Loading a Pre-trained Model for Sentence Embeddings

Cosine similarity cannot be directly applied to raw text. Instead, text must first be converted into numerical vectors. For this conversion, we must calculate sentence embeddings of each sentence.

In Java, we can use the easy-bert library to load pre-trained models for sentence embeddings. Easy-bert provides a straightforward interface to work with BERT models.

Below, we initialize a pre-trained model using easy-bert:

Encoding Sentences into Embeddings

Now let's define some sentences, encode them into numerical vectors, and see what the resulting embeddings look like. This approach produces vectors that capture deeper semantic information compared to a BOW representation:

Comparing Sentence Embeddings with Cosine Similarity

Finally, let's compare sentences according to how similar they are. We can compute similarity scores for each pair of sentences using the cosineSimilarity function introduced before:

In these results, pairs referencing similar topics (e.g., RAG and LLMs, or monkeys and bananas) often produce higher similarity scores, while pairs referencing different topics yield lower scores. Notice how “Bananas are yellow fruits.” and “What's monkey's favorite food?” have no overlapping words yet still show a relatively high similarity, highlighting how embeddings capture semantic meaning more effectively than a simpler BOW model, which primarily relies on word overlap.

Conclusion And Next Steps

By moving beyond simpler models like Bag-of-Words, you can capture meaningful relationships between words and phrases in a sophisticated way. Sentence embeddings play a central role in Retrieval-Augmented Generation systems because they enable more precise and flexible retrieval of relevant information.

In the upcoming practice section, you will have the opportunity to apply the concepts covered in this lesson. You'll practice setting up embedding models and experimenting with sentence embeddings to understand how they capture semantic meaning. This hands-on experience will help solidify your understanding of sentence embeddings. Keep up the good work, and happy coding!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal