Dimensionality ReductionVisualizationResearch

Seeing Through the Projection

Why every 2D embedding of your data is a lie — and how to think about which lies are acceptable.

November 10, 2024·8 min read + interactive

Suppose you have 1,000 cells described by the expression levels of 20,000 genes. You want to plot them. You pick t-SNE, run it overnight, and produce a scatter plot showing six beautiful clusters. You call them cell types and publish.

The clusters are real. But the distances between them? Largely fictitious. The tight spacing between cluster 3 and cluster 5 tells you almost nothing about whether those cells are biologically similar. t-SNE optimises local neighbourhood structure; global distances are sacrificed as a deliberate trade-off. Most researchers know this in the abstract. Far fewer internalise what it means for their particular analysis.

Every projection is a choice

When you project high-dimensional data to 2D, you're choosing a plane to look at. Everything on that plane is visible; everything orthogonal to it is collapsed to a point. The structure you see — clusters, gradients, outliers — reflects the structure that exists in those two chosen directions, not necessarily the structure in the full space.

This isn't a t-SNE problem specifically. It applies to PCA, UMAP, MDS, and every other method. Each makes different choices about which structural properties to preserve and which to distort. The distortion is not a bug. It's an inevitable consequence of reducing dimensions.

Try the demo below. The data is 50 points in 3D, split evenly between two Gaussian clusters separated along the x-axis. A single slider controls the projection direction. At 0°, you project along x — the clusters separate cleanly. At 90°, you project along y — the clusters completely overlap. The data didn't change. Only the viewer's perspective did.

Interactive Demo — Projection Angle

Projection angle: 0°Separability: Poor (0.00)

0° — projects on x-axis90° — projects on y-axis180°

Cluster ACluster B

Notice that at intermediate angles — around 40–60° — you get a partial separation: the clusters look like they might overlap but haven't fully merged. If you saw this scatter plot without knowing the ground truth, you might conclude the clusters are poorly defined. That conclusion would be wrong.

The picture you see is not the data. It is the data, as seen from a particular direction, through a particular loss function.

What distortion actually looks like

In linear projections (PCA, random projections), the distortion is at least mathematically tractable. You can compute how much variance is retained, examine which eigenvectors span the projected plane, and reason about what's been lost.

Non-linear methods are trickier. t-SNE and UMAP use repulsive forces between distant points and attractive forces between neighbours. The result is that inter-cluster distances are arbitrary: two clusters that appear close in the embedding might be far apart in the original space, or vice versa. UMAP preserves more global structure than t-SNE, but the improvement is relative, not absolute.

A useful way to measure distortion is Shepard diagrams: plot pairwise distances in the original space against pairwise distances in the embedding. A perfect projection would lie on the diagonal. In practice, you'll see a fan-shaped scatter with significant divergence, especially at large distances. The fan tells you which distance regime the method sacrifices.

The indicatrix idea

Cartographers faced this problem centuries ago. Any map of the Earth distorts either shapes, areas, distances, or directions — usually some combination. In 1891, Nicolas Tissot proposed a diagnostic: draw a tiny circle at every point on the globe, then project it onto the map. The resulting ellipses — the Tissot indicatrix — reveal local distortion. Circles that stay circular indicate low distortion; highly elongated ellipses indicate high distortion.

The same idea applies to dimensionality reduction. Instead of circles on a sphere, you consider small balls in high-dimensional space and ask what shape they take in the embedding. Regions where they inflate, deflate, or shear are regions of high distortion. This is the core idea behind Hypertrix — a tool for computing an analogous indicatrix for arbitrary high-dimensional projections.

Practical implications

None of this means you should stop using t-SNE or UMAP. They're genuinely useful for hypothesis generation and for communicating rough cluster structure to audiences. The problem is not the tools — it's the lack of explicit distortion reporting alongside them.

When you publish a scatter plot, consider also reporting: (a) what method you used and its key hyperparameters, (b) a distortion metric showing how faithfully local and global structure is preserved, and (c) what specific claims are and aren't supported by the embedding. "There are three clusters" is probably fine. "Cluster A is twice as far from cluster B as from cluster C" almost certainly is not.

The goal isn't to make visualisation harder. It's to make the implicit choices explicit — to see through the projection rather than accepting it as a transparent window onto the data.