How does tSNE work?

A Deep Dive into Stochastic Neighbor Embedding

Introduction

In high-dimensional data visualization, t-Distributed Stochastic Neighbor Embedding (t-SNE) has become one of the most popular techniques. It allows us to map complex datasets into two or three dimensions while preserving local structures. But how does it work? In this post, we will break down t-SNE step by step, drawing insights from its mathematical foundations and practical applications.

1. What is t-SNE Trying to Solve?

When we work with high-dimensional data (e.g., images, gene expression data, or word embeddings), understanding its structure becomes difficult. We need a way to: ✅ Reduce the data to 2D or 3D for visualization. ✅ Preserve clusters and local relationships. ✅ Avoid distortions caused by simple linear transformations (like PCA).

Unlike PCA, which is a linear method, t-SNE is non-linear and focuses on preserving local neighborhoods. This makes it ideal for detecting clusters and substructures in data.

2. What Does "Stochastic" Mean in t-SNE?

"Stochastic" means random but with structure. In t-SNE:

We start by randomly positioning the data points in low dimensions.
We then iteratively adjust their positions, using probabilities to match their high-dimensional relationships.
This randomness helps avoid bad solutions and ensures better local clustering.

Because of this, t-SNE’s output can vary slightly across runs—unlike PCA, which always gives the same result:

3. The Key Idea Behind t-SNE

At its core, t-SNE works by modeling similarities between points in both high-dimensional and low-dimensional spaces:

Step 1: Compute Pairwise Similarities in High-Dimensional Space

Each data point gets a probability score for how similar it is to other points.
This is done using a Gaussian distribution centered at each point.
Similar points have high probability, while distant points have low probability.

Step 2: Compute Pairwise Similarities in Low-Dimensional Space

Instead of a Gaussian, we use a t-distribution (which has heavier tails to avoid overcrowding).
The goal is to make sure that the low-dimensional probabilities match the high-dimensional ones as closely as possible.

Step 3: Minimize the Difference (KL Divergence)

The "error" function t-SNE minimizes is called Kullback-Leibler (KL) Divergence.
This function tells us how different two probability distributions are.
Using gradient descent, t-SNE moves points in 2D space until the low-dimensional structure resembles the high-dimensional one.

4. Why Does t-SNE Sometimes Look Different Every Time?

Since t-SNE is stochastic, each run may produce a slightly different layout. This happens because:

The algorithm starts with a random initialization of points.
The optimization process can get stuck in different local minima.
Different perplexity values (a hyperparameter) can change how local/global structure is balanced.

🔹 Solution: Run t-SNE multiple times and look for stable patterns! You can also set a seed for reproducibility.