Sparse autoencoders and feature directions

Sparse autoencoders (SAEs) train on internal activations to reconstruct them from a larger sparse latent code. The hope: dictionary elements align with interpretable feature directions rather than polysemantic mush .

Each example activates only a few latent units (sparsity), making attribution easier than reading dense superposition directly. Reconstruction error trades off against sparsity penalty strength: too sparse and you lose information; too dense and features stay entangled .

Dictionary learning analogies are apt: activations are approximated as sparse combinations of basis directions in activation space. SAEs are research-grade tools, not a complete safety solution .

They reveal partial structure on activations but do not guarantee robust behavior under adversarial prompts or distribution shift .

SAE training requires careful choice of layer and token position: mid-residual features differ from post-MLP activations. Researchers compare reconstruction quality to human concept lists, not deployment safety guarantees .

Increasing sparsity penalty too aggressively yields dead latents that never fire; monitor active latents per token when tuning SAEs .

Open-source SAE releases on mid-layer residuals invite replication: compare sparse feature dashboards across prompts before claiming a feature is universal .

Sparse autoencoders and feature directions

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator