Welcome to the third lesson in our course on Text Representation Techniques for RAG systems! In our previous lesson, we explored how to generate sentence embeddings and saw how these richer representations capture semantic meaning better than the classic Bag-of-Words.
Now, we will build on that knowledge to visualize these embeddings in a two-dimensional space using t-SNE (t-distributed Stochastic Neighbor Embedding). By the end of this lesson, you'll have an interactive way to see how thematically similar sentences group closer together, reinforcing the idea that embeddings preserve meaningful relationships between sentences.
t-SNE helps us visualize high-dimensional embeddings by compressing them into a given lower-dimensional space (usually 2D or 3D, for visualization) while preserving relative similarities:
- Similarity First: t-SNE prioritizes keeping similar sentences close. It calculates pairwise similarities in the original space (using a probability distribution) so nearby embeddings get higher similarity scores than distant ones.
- Local Structure: It preserves neighborhoods of related points rather than exact distances. This means clusters you see reflect genuine thematic groupings (e.g., NLP vs. Food), but axis values themselves have no intrinsic meaning.
- Perplexity Matters: This parameter (~5–50) controls neighborhood size. Lower values emphasize tight clusters (good for spotting subtopics), while higher values show broader trends (useful for separating major categories).
- Tradeoffs: While powerful for visualization, t-SNE is computationally expensive for large datasets (as it compares all sentence pairs). For RAG systems, this makes it better suited for exploratory analysis of smaller samples than production-scale data.
You may be asking yourself, why does this matter for RAG? Seeing embeddings cluster by topic validates they're capturing semantic relationships – a prerequisite for effective retrieval. If NLP sentences scattered randomly, we'd question the embedding quality before even building the RAG pipeline, prompting us to reevaluate the choice of the embedding model.
We’ll construct a dataset of 32 sentences, divided evenly among four topics: NLP, ML, Food, and Weather. We’ll also assign each sentence a category label so that we can color and shape each point on our final plot.
This function outputs two parallel vectors: one with sentences and one with their associated category. These categories will later control the color and shape used for plotting.
To turn our text into a numerical form, we’ll first encode each sentence into an embedding, then reduce it using t-SNE
. Here's how we do that in Rust:
What's happening here:
-
First, we instantiate a
SentenceEmbedder
which downloads the all-MiniLM-L6-v2 model weights. This neural network has been trained to convert sentences into meaningful vector representations. -
We encode our sentences using
embed_texts()
, which transforms each sentence into a 384-dimensional vector. Each dimension captures some semantic aspect of the sentence's meaning, like topic, sentiment, or syntax. -
We create a t-SNE instance from the
bhtsne
crate and configure key hyperparameters:perplexity
: Controls how the algorithm balances local and global structure (set to 10.0)epochs
: Number of optimization iterations (set to 3000)barnes_hut
: Uses tree-based approximation for faster computation
-
Finally, we call
embedding()
to perform the dimensionality reduction. This transforms our 384-dimensional vectors into 2D coordinates while trying to preserve the relationships between similar sentences.
Once we’ve computed the 2D coordinates, we want to visualize them by category using color and shape. This visualization helps us understand the thematic groupings of our sentences. Here’s one way how we can do that using the plotters
crate:
This setup prepares the data for plotting by extracting x and y coordinates.
With the coordinates ready, we proceed to set up the chart for visualization. The chart is configured to display the t-SNE visualization of sentence embeddings, with axes and labels set up for clarity.
This configuration ensures the plot is well-framed and ready for drawing the embeddings.
Finally, we draw each sentence using a different shape and color depending on its category. Each shape is positioned using its 2D coordinates and labeled using the start of the sentence to keep the plot readable.
This function completes the visualization process, providing a clear and informative plot of the sentence embeddings.
The resulting plot helps us understand whether our embedding model clusters semantically similar sentences. Related sentences naturally group together, while unrelated ones are farther apart, confirming that embeddings effectively preserve meaning.
We expect to see four main clusters: red circles for NLP, blue squares for ML, green triangles for Food, and purple diamonds for Weather:
Some overlap may occur where topics share common words or contexts, such as a sentence about "GPT models" being pretty close to ML-related points. Also notice how the ML and NLP clusters are generally closer than, say, ML and Weather, or Weather and Food. This highlights how embeddings sometimes capture subtle, unexpected connections between concepts. Overall, this plot provides an intuitive way to explore text data, offering a glimpse into the underlying structure that makes modern NLP so powerful.
You've now seen how to represent text data with embeddings, reduce those embeddings to reveal a simpler underlying structure, and visualize them to uncover meaningful relationships.
Equipped with this knowledge, you can now create plots where related sentences cluster closely, confirming that embeddings capture meaningful relationships. This visualization is often quite revealing when debugging or exploring text data. In the next practice session, you'll get the chance to experiment with the code to see how it affects the final plot.
Give it a try, and have fun discovering the hidden patterns in your text!
