Introduction

Welcome to the third lesson in our course on Text Representation Techniques for RAG systems! In our previous lesson, we explored how to generate sentence embeddings and saw how these richer representations capture semantic meaning better than the classic Bag-of-Words.

Now, we will build on that knowledge to visualize these embeddings in a two-dimensional space using t-SNE (t-distributed Stochastic Neighbor Embedding). By the end of this lesson, you'll have an interactive way to see how thematically similar sentences group closer together, reinforcing the idea that embeddings preserve meaningful relationships between sentences.

Understanding t-SNE

t-SNE helps us visualize high-dimensional embeddings by compressing them into a given lower-dimensional space (usually 2D or 3D, for visualization) while preserving relative similarities:

  • Similarity First: t-SNE prioritizes keeping similar sentences close. It calculates pairwise similarities in the original space (using a probability distribution) so nearby embeddings get higher similarity scores than distant ones.
  • Local Structure: It preserves neighborhoods of related points rather than exact distances. This means clusters you see reflect genuine thematic groupings (e.g., NLP vs. Food), but axis values themselves have no intrinsic meaning.
  • Perplexity Matters: This parameter (~5–50) controls neighborhood size. Lower values emphasize tight clusters (good for spotting subtopics), while higher values show broader trends (useful for separating major categories).
  • Tradeoffs: While powerful for visualization, t-SNE is computationally expensive for large datasets (as it compares all sentence pairs). For RAG systems, this makes it better suited for exploratory analysis of smaller samples than production-scale data.

You may be asking yourself, why does this matter for RAG? Seeing embeddings cluster by topic validates they're capturing semantic relationships – a prerequisite for effective retrieval. If NLP sentences scattered randomly, we'd question the embedding quality before even building the RAG pipeline, prompting us to reevaluate the choice of the embedding model.

Building Our Data

We’ll construct a dataset of 32 sentences, divided evenly among four topics: NLP, ML, Food, and Weather. We’ll also assign each sentence a category label so that we can color and shape each point on our final plot.

This function outputs two parallel vectors: one with sentences and one with their associated category. These categories will later control the color and shape used for plotting.

Generating and Reducing Embeddings

To turn our text into a numerical form, we’ll first encode each sentence into an embedding, then reduce it using t-SNE. Here's how we do that in Rust:

What's happening here:

  1. First, we instantiate a SentenceEmbedder which downloads the all-MiniLM-L6-v2 model weights. This neural network has been trained to convert sentences into meaningful vector representations.

  2. We encode our sentences using embed_texts(), which transforms each sentence into a 384-dimensional vector. Each dimension captures some semantic aspect of the sentence's meaning, like topic, sentiment, or syntax.

  3. We create a t-SNE instance from the bhtsne crate and configure key hyperparameters:

    • perplexity: Controls how the algorithm balances local and global structure (set to 10.0)
    • epochs: Number of optimization iterations (set to 3000)
    • barnes_hut: Uses tree-based approximation for faster computation
  4. Finally, we call embedding() to perform the dimensionality reduction. This transforms our 384-dimensional vectors into 2D coordinates while trying to preserve the relationships between similar sentences.

Plotting the Embeddings

Once we’ve computed the 2D coordinates, we want to visualize them by category using color and shape. This visualization helps us understand the thematic groupings of our sentences. Here’s one way how we can do that using the plotters crate:

This setup prepares the data for plotting by extracting x and y coordinates.

Setting Up the Chart

With the coordinates ready, we proceed to set up the chart for visualization. The chart is configured to display the t-SNE visualization of sentence embeddings, with axes and labels set up for clarity.

This configuration ensures the plot is well-framed and ready for drawing the embeddings.

Drawing the Embeddings

Finally, we draw each sentence using a different shape and color depending on its category. Each shape is positioned using its 2D coordinates and labeled using the start of the sentence to keep the plot readable.

This function completes the visualization process, providing a clear and informative plot of the sentence embeddings.

Interpreting the Plot

The resulting plot helps us understand whether our embedding model clusters semantically similar sentences. Related sentences naturally group together, while unrelated ones are farther apart, confirming that embeddings effectively preserve meaning.

We expect to see four main clusters: red circles for NLP, blue squares for ML, green triangles for Food, and purple diamonds for Weather:

Some overlap may occur where topics share common words or contexts, such as a sentence about "GPT models" being pretty close to ML-related points. Also notice how the ML and NLP clusters are generally closer than, say, ML and Weather, or Weather and Food. This highlights how embeddings sometimes capture subtle, unexpected connections between concepts. Overall, this plot provides an intuitive way to explore text data, offering a glimpse into the underlying structure that makes modern NLP so powerful.

Conclusion and Next Steps

You've now seen how to represent text data with embeddings, reduce those embeddings to reveal a simpler underlying structure, and visualize them to uncover meaningful relationships.

Equipped with this knowledge, you can now create plots where related sentences cluster closely, confirming that embeddings capture meaningful relationships. This visualization is often quite revealing when debugging or exploring text data. In the next practice session, you'll get the chance to experiment with the code to see how it affects the final plot.

Give it a try, and have fun discovering the hidden patterns in your text!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal