Welcome to the third lesson in our course on Text Representation Techniques for RAG systems! In our previous lesson, we explored how to generate sentence embeddings and saw how these richer representations capture semantic meaning better than the classic Bag-of-Words.
Now, we will build on that knowledge to visualize these embeddings in a two-dimensional space using t-SNE (t-distributed Stochastic Neighbor Embedding). By the end of this lesson, you'll have an interactive way to see how thematically similar sentences group closer together, reinforcing the idea that embeddings preserve meaningful relationships between sentences.
t-SNE helps us visualize high-dimensional embeddings by compressing them into a given lower-dimensional space (usually 2D or 3D, for visualization) while preserving relative similarities:
- Similarity First: t-SNE prioritizes keeping similar sentences close. It calculates pairwise similarities in the original space (using a probability distribution) so nearby embeddings get higher similarity scores than distant ones.
- Local Structure: It preserves neighborhoods of related points rather than exact distances. This means clusters you see reflect genuine thematic groupings (e.g., NLP vs. Food), but axis values themselves have no intrinsic meaning.
- Perplexity Matters: This parameter (~5–50) controls neighborhood size. Lower values emphasize tight clusters (good for spotting subtopics), while higher values show broader trends (useful for separating major categories).
- Tradeoffs: While powerful for visualization, t-SNE is computationally expensive for large datasets (as it compares all sentence pairs). For RAG systems, this makes it better suited for exploratory analysis of smaller samples than production-scale data.
You may be asking yourself, why does this matter for RAG? Seeing embeddings cluster by topic validates they're capturing semantic relationships – a prerequisite for effective retrieval. If NLP sentences scattered randomly, we'd question the embedding quality before even building the RAG pipeline, prompting us to reevaluate the choice of the embedding model.
To demonstrate how t-SNE
reveals natural groupings, we'll gather sentences on four different topics: NLP, ML, Food, and Weather. Then, we assign each sentence a category so we can later color-code and shape-code the points in our 2D visualization.
Here's what's going on:
- The first function returns two lists: one with sentences, another labeling each sentence's category.
- The second function creates two dictionaries that tell the plotting function which colors and marker shapes to use per category (e.g., “red circles” for NLP).
Next, we encode the sentences into embeddings and then reduce them to two dimensions using t-SNE
:
Let's break down the process:
- First, we instantiate a
SentenceTransformer
model, which downloads the neural network weights needed to generate embeddings. - Next, we call
model.encode(sentences)
to convert each sentence into a high-dimensional vector. Each dimension captures some aspect of the sentence's meaning. - We then create an instance of
TSNE
(fromscikit-learn
) and configure hyperparameters likeperplexity
(which influences how the algorithm balances local and global aspects of the data),n_iter
(the number of optimization iterations), andrandom_state
(for reproducible results). - Finally, we call
tsne.fit_transform(embeddings)
to reduce these high-dimensional vectors into a 2D representation that attempts to preserve the distances between similar points.
Once we have these 2D points, it's time to plot them so we can see how topics naturally cluster. We'll also annotate each point with a short identifier to make it easy to see what sentence it represents.
Here's how it works:
- We use
plt.scatter
to place each sentence in the plot. - The text label helps you identify each point's approximate topic.
- We build a readable legend by drawing empty points for each category.
The t-SNE resulting visualization beautifully reveals how sentence embeddings capture semantic similarity. Each topic forms distinct clusters: NLP (red circles), ML (blue squares), Food (green triangles), and Weather (purple X's). Related sentences naturally group together, while unrelated ones are farther apart, confirming that embeddings effectively preserve meaning.
Some overlap may occur where topics share common words or contexts, such as a sentence about "GPT models" being pretty close to ML-related points. Also notice how the ML and NLP clusters are generally closer than, say, ML and Weather, or Weather and Food. This highlights how embeddings sometimes capture subtle, unexpected connections between concepts. Overall, this plot provides an intuitive way to explore text data, offering a glimpse into the underlying structure that makes modern NLP so powerful.
You've now seen how to represent text data with embeddings, reduce those embeddings to reveal a simpler underlying structure, and visualize them to uncover meaningful relationships.
Equipped with this knowledge, you can now create plots where related sentences cluster closely, confirming that embeddings capture meaningful relationships. This visualization is often quite revealing when debugging or exploring text data. In the next practice session, you'll get the chance to experiment with the code to see how it affects the final plot.
Give it a try, and have fun discovering the hidden patterns in your text!
