Vision Transformers Don't Need Trained Registers.

UC Berkeley

*Equal contribution

Vision Transformers Research Overview
We discover a sparse set of neurons across various models responsible for high norm tokens and noisy attention maps, harming downstream visual processing. At test-time, we can either shift outlier activations to arbitrary positions ("Shifted"), or outside of the image ("w/ Test-time Register") to mimic register tokens (Darcet et al.) without any retraining.

Abstract

We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers: the emergence of high-norm tokens that lead to noisy attention maps (Darcet et al.). We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to anomalous attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. Our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream tasks, and achieves results comparable to models explicitly trained with register tokens. This suggests that our approach effectively takes on the role of register tokens at test-time, offering a training-free solution for any pre-trained off-the-shelf model released without them.

What Causes High Norm Tokens?

Neuron Analysis
We hypothesize that a small, consistent set of sparsely activating neurons control where outliers emerge. Our main analysis is applied to OpenCLIP ViT-B/16 (see paper for other models). Above, we track the maximum norm of image patches after attention blocks and MLPs across 1000 ImageNet images and find that outlier patches in the residual stream appear after the MLP of layer 6 in OpenCLIP. Subsequently, the max attention score increase dramatically, suggesting that the MLP increases the norms of certain patches and leads to anomalous attention patterns. We find that an extremely small set of neurons with corresponding outlier activations appear in the preceding layers before outliers patches form. For instance, we find 10 out of ~37000 neurons in OpenCLIP ViT-B/16 with this property. Given their importance for the formation of outliers, we call them "register neurons." We present a subset of activation patterns arising from register neurons in the bottom row.

Finding Register Neurons

To identify register neurons, we propose an algorithm that selects neurons with consistently high activations at outlier positions across multiple images. We define a top_layer parameter that restricts the search up to the layer where outliers appear. The outlier positions are defined as indices where the patch norm in the residual stream after top_layer is above a predefined threshold. We then compute the activations from all neurons in the model up to and including top_layer at these positions, averaged over a small set of images. The neurons with the highest average activations at these positions across the dataset are selected as register neurons.

Creating Registers at Test-Time

We mimic the effect of learned register tokens (Darcet et al.) without training by shifting the activations arising from register neurons to a dummy token at test-time. Specifically, we initialize the test-time register to a vector of zeros, which is fed through the model along with the rest of the tokens. During the forward pass, we intervene upon the register neurons by shifting the max activation to the test-time register and zeroing out the register neuron's activations for the image tokens.
Below, we present attention maps averaged over all heads in each layer for DINOv2 and OpenCLIP. For DINOv2, which has a variant trained from scratch with registers, we also compare against trained registers. Qualitatively, test-time registers yield cleaner attention maps, resembling those produced by models trained explicitly with registers. Quantitative results are provided in the paper, where we show comparable performance to trained registers.

OpenCLIP ViT-B/16

DINOv2 Input
DINOv2 Analysis Grid
OpenCLIP Input
Detailed Analysis Grid

DINOv2 ViT-L/14

CLIP Input
CLIP Analysis Grid
EVA-CLIP Input
EVA-CLIP Analysis Grid

Adding Test-Time Registers to Vision-Language Models

We add a test-time register to the CLIP ViT-L/14 vision encoder of LLaVA-Llama-3-8B, improving its interpretability while maintaining benchmark performance. Below, we visualize the patch norm map of the last layer of the vision encoder before projection into the language model input space. Next, we visualize the average attention across all layers and heads of the language model from the response token to the visual tokens. Without a test-time register, the outlier visual tokens lead to artifacts in the language model’s attention. Adding a test-time register removes outliers as seen in the patch norm map, resulting in more interpretable cross-modal attention. This provides clearer insights into the model's behavior (e.g., attention is placed on the parking spot "32" which looks like "28" upside down).
Vision Transformers Research Overview

Citation


 @misc{jiang2025visiontransformersdontneed,
      title={Vision Transformers Don't Need Trained Registers}, 
      author={Nick Jiang and Amil Dravid and Alexei Efros and Yossi Gandelsman},
      year={2025},
      eprint={2506.08010},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.08010}, 
}