We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers: the emergence of
high-norm tokens that lead to noisy attention maps (
Darcet et al.). We observe that in multiple models (e.g., CLIP, DINOv2), a sparse
set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to anomalous attention
patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned
register tokens, we use our findings to create a
training-free approach to mitigate these artifacts.
By shifting the high norm activations from our discovered
register neurons into an additional untrained token,
we can mimic the effect of register tokens on a model already trained without registers.
Our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream
tasks, and achieves results comparable to models explicitly trained with register tokens. This suggests that our approach
effectively takes on the role of register tokens at test-time, offering a training-free solution for any pre-trained
off-the-shelf model released without them.