Welcome to the third lesson in our course on Text Representation Techniques for RAG systems! In our previous lesson, we explored how to generate sentence embeddings and saw how these richer representations capture semantic meaning better than the classic Bag-of-Words.
Now, we will build on that knowledge to visualize these embeddings in a two-dimensional space using t-SNE (t-distributed Stochastic Neighbor Embedding). By the end of this lesson, you'll have an interactive way to see how thematically similar sentences group closer together, reinforcing the idea that embeddings preserve meaningful relationships between sentences.
t-SNE helps us visualize high-dimensional embeddings by compressing them into a given lower-dimensional space (usually 2D or 3D, for visualization) while preserving relative similarities:
- Similarity First: t-SNE prioritizes keeping similar sentences close. It calculates pairwise similarities in the original space (using a probability distribution) so nearby embeddings get higher similarity scores than distant ones.
- Local Structure: It preserves neighborhoods of related points rather than exact distances. This means clusters you see reflect genuine thematic groupings (e.g., NLP vs. Food), but axis values themselves have no intrinsic meaning.
- Perplexity Matters: This parameter (~5–50) controls neighborhood size. Lower values emphasize tight clusters (good for spotting subtopics), while higher values show broader trends (useful for separating major categories).
- Tradeoffs: While powerful for visualization, t-SNE is computationally expensive for large datasets (as it compares all sentence pairs). For RAG systems, this makes it better suited for exploratory analysis of smaller samples than production-scale data.
You may be asking yourself, why does this matter for RAG? Seeing embeddings cluster by topic validates they're capturing semantic relationships – a prerequisite for effective retrieval. If NLP sentences scattered randomly, we'd question the embedding quality before even building the RAG pipeline, prompting us to reevaluate the choice of the embedding model.
To demonstrate how t-SNE
reveals natural groupings, we'll gather sentences on four different topics: NLP, ML, Food, and Weather. Then, we assign each sentence a category so we can later color-code and shape-code the points in our 2D visualization.
Python1def get_sentences_and_categories(): 2 """ 3 Return the sentences and their corresponding categories. 4 """ 5 sentences = [ 6 # Topic: NLP 7 "RAG stands for Retrieval-Augmented Generation.", 8 "Retrieval is a crucial aspect of modern NLP systems.", 9 "Generating text with correct facts is challenging.", 10 "Large language models can generate coherent text.", 11 "GPT models have billions of parameters.", 12 "Natural Language Processing enables computers to understand human language.", 13 "Word embeddings capture semantic relationships between words.", 14 "Transformer architectures revolutionized NLP research.", 15 16 # Topic: Machine Learning 17 "Machine learning benefits from large datasets.", 18 "Supervised learning requires labeled data.", 19 "Reinforcement learning is inspired by behavioral psychology.", 20 "Neural networks can learn complex functions.", 21 "Overfitting is a common problem in ML.", 22 "Unsupervised learning uncovers hidden patterns in data.", 23 "Feature engineering is critical for model performance.", 24 "Cross-validation helps in assessing model generalization.", 25 26 # Topic: Food 27 "Bananas are commonly used in smoothies.", 28 "Oranges are rich in vitamin C.", 29 "Pizza is a popular Italian dish.", 30 "Cooking pasta requires boiling water.", 31 "Chocolate can be sweet or bitter.", 32 "Fresh salads are a healthy and refreshing meal.", 33 "Sushi combines rice, fish, and seaweed in a delicate balance.", 34 "Spices can transform simple ingredients into gourmet dishes.", 35 36 # Topic: Weather 37 "It often rains in the Amazon rainforest.", 38 "Summers can be very hot in the desert.", 39 "Hurricanes form over warm ocean waters.", 40 "Snowstorms can disrupt transportation.", 41 "A sunny day can lift people's mood.", 42 "Foggy mornings are common in coastal regions.", 43 "Winter brings frosty nights and chilly winds.", 44 "Thunderstorms can produce lightning and heavy rain." 45 ] 46 47 categories = (["NLP"] * 8 + ["ML"] * 8 + ["Food"] * 8 + ["Weather"] * 8) 48 return sentences, categories 49 50def get_color_and_shape_maps(): 51 """ 52 Return color and marker maps for each category. 53 """ 54 color_map = { 55 "NLP": "red", 56 "ML": "blue", 57 "Food": "green", 58 "Weather": "purple" 59 } 60 shape_map = { 61 "NLP": "o", 62 "ML": "s", 63 "Food": "^", 64 "Weather": "X" 65 } 66 return color_map, shape_map
Here's what's going on:
- The first function returns two lists: one with sentences, another labeling each sentence's category.
- The second function creates two dictionaries that tell the plotting function which colors and marker shapes to use per category (e.g., “red circles” for NLP).
Next, we encode the sentences into embeddings and then reduce them to two dimensions using t-SNE
:
Python1import matplotlib.pyplot as plt 2from sklearn.manifold import TSNE 3from sentence_transformers import SentenceTransformer 4 5def compute_tsne_embeddings(sentences, model_name="sentence-transformers/all-MiniLM-L6-v2", 6 perplexity=10, n_iter=3000, random_state=42): 7 """ 8 Compute and return t-SNE reduced embeddings for the given sentences. 9 """ 10 # 1. Initialize a SentenceTransformer model that balances speed and performance. 11 model = SentenceTransformer(model_name) 12 13 # 2. Convert each sentence into a high-dimensional embedding. 14 embeddings = model.encode(sentences) 15 16 # 3. Configure t-SNE with chosen parameters and reduce embeddings to 2D. 17 tsne = TSNE(n_components=2, random_state=random_state, 18 perplexity=perplexity, n_iter=n_iter) 19 20 # 4. Fit t-SNE on the embeddings and return a 2D representation. 21 return tsne.fit_transform(embeddings)
Let's break down the process:
- First, we instantiate a
SentenceTransformer
model, which downloads the neural network weights needed to generate embeddings. - Next, we call
model.encode(sentences)
to convert each sentence into a high-dimensional vector. Each dimension captures some aspect of the sentence's meaning. - We then create an instance of
TSNE
(fromscikit-learn
) and configure hyperparameters likeperplexity
(which influences how the algorithm balances local and global aspects of the data),n_iter
(the number of optimization iterations), andrandom_state
(for reproducible results). - Finally, we call
tsne.fit_transform(embeddings)
to reduce these high-dimensional vectors into a 2D representation that attempts to preserve the distances between similar points.
Once we have these 2D points, it's time to plot them so we can see how topics naturally cluster. We'll also annotate each point with a short identifier to make it easy to see what sentence it represents.
Python1def plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map, 2 xlim=(-125, 150), ylim=(-175, 125)): 3 """ 4 Plot the 2D embeddings with labels and a legend. 5 """ 6 # 1. Create a figure to hold the scatter plot. 7 plt.figure(figsize=(10, 8)) 8 9 # 2. Plot each sentence: 10 # - Use the category to decide color and marker shape. 11 # - Use the first 20 characters as a short text label. 12 for i, (sentence, category) in enumerate(zip(sentences, categories)): 13 x, y = reduced_embeddings[i] 14 plt.scatter(x, y, color=color_map[category], marker=shape_map[category]) 15 plt.text(x - 2.5, y - 7.5, sentence[:20] + "...", fontsize=9) 16 17 # 3. Construct a legend by plotting empty points, one for each category. 18 for cat, color in color_map.items(): 19 plt.scatter([], [], color=color, label=cat, marker=shape_map[cat]) 20 plt.legend(loc="best") 21 22 # 4. Add labels, set boundaries, and save the final plot. 23 plt.title("t-SNE Visualization of Sentence Embeddings", fontsize=14) 24 plt.xlabel("t-SNE Dimension 1", fontsize=12) 25 plt.ylabel("t-SNE Dimension 2", fontsize=12) 26 plt.tight_layout() 27 plt.xlim(*xlim) 28 plt.ylim(*ylim) 29 plt.savefig('your_plot_image.png') # Saving to an image file of your choice.
Here's how it works:
- We use
plt.scatter
to place each sentence in the plot. - The text label helps you identify each point's approximate topic.
- We build a readable legend by drawing empty points for each category.
The t-SNE resulting visualization beautifully reveals how sentence embeddings capture semantic similarity. Each topic forms distinct clusters: NLP (red circles), ML (blue squares), Food (green triangles), and Weather (purple X's). Related sentences naturally group together, while unrelated ones are farther apart, confirming that embeddings effectively preserve meaning.
Some overlap may occur where topics share common words or contexts, such as a sentence about "GPT models" being pretty close to ML-related points. Also notice how the ML and NLP clusters are generally closer than, say, ML and Weather, or Weather and Food. This highlights how embeddings sometimes capture subtle, unexpected connections between concepts. Overall, this plot provides an intuitive way to explore text data, offering a glimpse into the underlying structure that makes modern NLP so powerful.
You've now seen how to represent text data with embeddings, reduce those embeddings to reveal a simpler underlying structure, and visualize them to uncover meaningful relationships.
Equipped with this knowledge, you can now create plots where related sentences cluster closely, confirming that embeddings capture meaningful relationships. This visualization is often quite revealing when debugging or exploring text data. In the next practice session, you'll get the chance to experiment with the code to see how it affects the final plot.
Give it a try, and have fun discovering the hidden patterns in your text!
