This repository contains JAX, TensorFlow and PyTorch implementations of the Lion optimizer discovered by symbolic program search in the Symbolic Discovery of Optimization Algorithms paper. The symbolic program search space (codenamed "Hero") is open sourced at here.
Lion is available on multiple codebases, including Praxis, Optax, Keras, Timm, T5X, and a popular PyTorch implementation by lucidtrains.
- Simple, memory efficient, fast runtime
- Superior performance on various architectures, tasks, and domains
- Instructions for hyperparameter tuning and batch size choices
- Citation
Compared to AdamW and various adaptive optimizers that need to save both first and second moments, Lion only needs the momentum, halving the additional memory footprint. This is beneficial when training large models and / or with a large batch size. As an example, AdamW needs at least 16 TPU V4 chips to train a ViT-B/16 with image size 224 and batch size 4,096, while Lion only needs eight. Another practical benefit is that Lion has faster runtime (steps / sec) in our experiment due to its simplicity, usually 2-15% speedup compared to AdamW and Adafactor depending on the task, codebase, and hardware.
- Lion outperforms AdamW on various architectures trained from scratch on ImageNet or pre-trained on ImageNet-21K.
- Lion saves up to 5x the pre-training cost on JFT-300M.
- Results after fine-tuning with higher resolution and Polyak averaging. Our obtained ViT-L/16 matches the previous ViT-H/14 results trained by AdamW while being 2x smaller.
- On LiT, Lion beats AdamW on several zero-shot image classification and image-text retrieval benchmarks.
- On BASIC-L, Lion achieves 88.3% zero-shot and 91.1% fine-tuning ImageNet accuracy, surpassing the previous SOTA results by 2% and 0.1%, respectively.
- On diffusion models, Lion exceeds AdamW in terms of the FID score and saves up to 2.3x the training compute. Left to right: 64x64, 128x128, 256x256 image generation trained on ImageNet.
- Lion saves up to 2x compute on the validation perplexity when performing the language modeling task (Left: on Wiki-40B, Right: on PG-19). Lion achieves larger gains on larger Transformers.
- Lion achieves better average in-context learning ability when training LMs compared to Adafactor.
- Lion outperforms AdamW when fine-tuning T5 on GLUE.
-
Lion is simple and has fewer hyperparameters compared to AdamW and Adafactor as it does not require
and factorization-related ones. To ensure a fair comparison, we tune the peak learning rate and decoupled weight decay for both AdamW (Adafactor) and our Lion using a logarithmic scale. The default values for and in AdamW are set as 0.9 and 0.999, respectively, with an of , while in Lion, the default values for and are discovered through the program search process and set as 0.9 and 0.99, respectively. We only tune those hyperparameters in Section 4.4 of the paper, where , in AdamW, and , in Lion. In our experience, reducing results in shorter memorization of historical information and enhanced training stability
. Additionally, thein AdamW is set as instead of the default as it improves stability in our experiments, similar to the observations in RoBERTa. -
The update generated by Lion is an element-wise binary
, as a result of the sign operation, therefore it has a larger norm than those generated by other optimizers. Based on our experience, a suitable learning rate for Lion is typically 3-10x smaller than that for AdamW.
Note that the initial value, peak value, and end value of the learning rate should be changedsimultaneously
with the same ratio compared to AdamW. Wedo not
modify other training settings such as the learning rate schedule, gradient and update clipping. Since the effective weight decay is, the value of $\lambda$ used for Lion is 3-10x larger than that for AdamW in order to maintain a similar strength.
For instance,-
, in Lion and , in AdamW when training ViT-B/16 on ImageNet with strong augmentations, -
, in Lion and , in AdamW for diffusion models, -
, in Lion and , in Adafactor for the 7.5B language modeling.
Please see our paper for all hyperparameters.
-
-
Apart from the peak performance, the sensitivity to hyperparameters and the difficulty in tuning them and are also critical for the adoption of an optimizer in practice. In the figure below, we alter both
and when training ViT-B/16 from scratch on ImageNet. Suggested by the heatmaps, Lion is more robust to different hyperparameter choices compared to AdamW. -
Some may question whether Lion requires a large batch size to accurately determine the direction due to the added noise from the sign operation. To address this concern, we train a ViT-B/16 model on ImageNet using various batch sizes while maintaining the total training epoch as 300, and incorporating RandAug and Mixup techniques. As shown in figure below, the optimal batch size for AdamW is 256, while for Lion is 4,096. This indicates that Lion indeed prefers a larger batch size, but its performance remains robust even with a small 64 batch size. Furthermore, when the batch size enlarges to 32K, leading to only 11K training steps, Lion achieves a significant 2.5% accuracy gain over AdamW (77.9% vs. 75.4%), demonstrating its effectiveness in the large batch training setting.
Left: Ablation for the effect of batch size. Lion prefers a larger batch than AdamW.
ImageNet accuracy of ViT-B/16 trained from scratch when we vary
If you find this work helpful, please cite:
@misc{chen2023symbolic,
title={Symbolic Discovery of Optimization Algorithms},
author={Xiangning Chen and Chen Liang and Da Huang and Esteban Real and Kaiyuan Wang and Yao Liu and Hieu Pham and Xuanyi Dong and Thang Luong and Cho-Jui Hsieh and Yifeng Lu and Quoc V. Le},
year={2023},
eprint={2302.06675},
archivePrefix={arXiv},
primaryClass={cs.LG}
}