t-SNE in R

Introduction

Embark on a journey into non-linear dimensionality reduction, with a specific focus on t-Distributed Stochastic Neighbor Embedding (t-SNE). Our goal is to understand the theory behind t-SNE and apply it using R's Rtsne package. This journey will take us through an understanding of the difference between linear and non-linear dimensionality reduction, a grasp of the core concepts of t-SNE, an implementation of t-SNE using the Rtsne package, and a discussion of potential pitfalls of t-SNE.

Linear vs. Non-Linear Dimensionality Reduction

Dimensionality reduction is a pragmatic exercise which seeks to condense the number of random variables under consideration, thus obtaining a set of principal variables. By familiarizing ourselves with the dimension, we can select the technique that best suits our needs.

Imagine having a dataset that contains a person's height in inches and centimeters. These two measurements convey the same information, so one can be removed. This is an example of linear dimensionality reduction. Unlike PCA, a popular linear technique, non-linear techniques like t-SNE adopt a different approach, capturing complex relationships by preserving distances and separations, irrespective of the dimension space.

Understanding t-SNE: High-dimensional Space Calculations

t-SNE aims to keep similar data points close and dissimilar ones far apart in a lower-dimensional space. It achieves this by minimizing a cost function over the locations of the points in the lower-dimensional space.

The Gaussian joint probability is mathematically defined as:

$p_{j|i} = \frac{e^{-(\|x_{i}-x_{j}\|^{2} /2\sigma _{i}^{2})}}{\sum_{k \neq i} e^{-(\|x_{i}-x_{k}\|^{2}/2\sigma _{i}^{2})}}$

Understanding t-SNE: Low-dimensional Space Calculations

In the lower-dimensional map, t-SNE employs t-distributions. These distributions, which are heavier-tailed, favor more effective modeling of dissimilarities. The joint probabilities in the low-dimensional space are defined as:

$q_{ij} = \frac{(1+||y_{i}-y_{j}||^{2})^{-1}}{\sum_{k \neq l}(1+||y_{k}-y_{l}||^{2})^{-1}}$

Implementing t-SNE: R Implementation

Now, let's see how to implement t-SNE in R using the Rtsne package, a popular tool for t-SNE in R. We'll use the classic Iris dataset, which contains measurements of iris flowers from three different species. We'll build a t-SNE model using Rtsne, apply it to our data, and visualize both the original and reduced data using ggplot2 with facets for a clear side-by-side comparison.

R Sample code for t-SNE and Analysis

Let's walk through the process step by step, starting with a look at the original dataset, then applying t-SNE, and finally visualizing and comparing the results. We'll also include print statements to help you understand how the data transforms at each stage.

In this code, we:

Load the Iris dataset and display the first few rows to remind you of its structure.

Pitfalls when Using t-SNE

Though modern and effective, t-SNE comes with its share of pitfalls. Firstly, interpreting the global structure can be challenging due to disagreements between the different preservation features in t-SNE. Secondly, reproducibility presents a challenge due to random initialization, which can lead to varied results across different t-SNE runs. Finally, t-SNE is sensitive to hyperparameters such as perplexity and learning_rate, whose tuning will be covered in later lessons.

Lesson Summary and Practice

Great job! We've distinguished between linear and non-linear dimensionality reduction and explored t-SNE. We've covered practical lessons in implementing t-SNE with R's Rtsne package and have had discussions on potential pitfalls that might arise. In future lessons, we will focus on visualizing t-SNE results, delving into t-SNE's parameter tuning, and exploring its application with real-world examples. Let's continue to deepen your understanding in the next stage of this educational journey!

Next Lesson: tSNE Parameter Tuning in R

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal