Transformers without Normalization

Project Lead
1 2
CVPR 2025
Dynamic Tanh (DyT) as a replacement for normalization in Transformers
Left: original Transformer block. Right: block with our proposed Dynamic Tanh (DyT) layer.
DyT is a straightforward replacement for commonly used Layer Norm or RMSNorm layers.
Transformers with DyT match or exceed the performance of their normalized counterparts.

Abstract

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $$\mathrm{DyT}(\boldsymbol{x}) = \tanh(\alpha \boldsymbol{x}),$$ as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Implementation

DyT module can be implemented in a few lines of PyTorch code:

class DyT(nn.Module):
    def __init__(self, num_features, alpha_init_value=0.5):
        super().__init__()
        self.alpha = nn.Parameter(torch.ones(1) * alpha_init_value)
        self.weight = nn.Parameter(torch.ones(num_features))
        self.bias = nn.Parameter(torch.zeros(num_features))
    
    def forward(self, x):
        x = torch.tanh(self.alpha * x)
        return x * self.weight + self.bias

Key Findings

Layer Normalization Behaves Like Scaled Tanh Function

Our analysis shows that layer normalization (LN) in Transformers generates input-output mappings that closely resemble scaled tanh functions. In the earlier layers, these mappings are mostly linear. However, in deeper layers, they take on distinct S-shaped curves characteristic of tanh functions.

Input-output relationships in ViT normalization layers
Input-output relationships in wav2vec 2.0 normalization layers
Input-output relationships in DiT normalization layers
Output vs. input of selected LN layers in Vision Transformer (ViT), wav2vec 2.0 (a Transformer model for speech), and Diffusion Transformer (DiT). We plot the input/output values of four LN layers in each model. The S-shaped curves highly resemble that of a tanh function.

Evaluation

We present a comprehensive evaluation of DyT across a diverse range of architectures and tasks, highlighting its effectiveness and generalizability. Our experiments cover supervised learning in vision (ViT and ConvNeXt), self-supervised learning in vision (MAE and DINO), diffusion models (DiT), large language models (LLaMA), self-supervised learning in speech (wav2vec 2.0), and DNA sequence modeling (HyenaDNA and Caduceus). In every case, Transformers with DyT achieves similar or better performance than their normalized counterparts. For detailed results and comparisons, please refer to our paper.

Resources

Paper

Download our paper for all the details about our research.

Download Paper
Code

Check out our repository for implementation details.

View on GitHub
Summary

Read a concise summary of our research findings on X.

View Summary

BibTeX

@inproceedings{Zhu2025DyT,
  title={Transformers without Normalization},
  author={Zhu, Jiachen and Chen, Xinlei and He, Kaiming and LeCun, Yann and Liu, Zhuang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}

Correspondence

jiachen [dot] zhu [at] nyu [dot] edu

zhuangl [at] princeton [dot] edu