12/21/2023
By Vladislav Lialin
The Kennedy College of Sciences, Miner School of Computer & Information Sciences, invites you to attend a doctoral dissertation defense by Vladislav Lialin on "Efficient Training and Fine-Tuning of Transformer Networks for NLP."
Date: Monday, Jan. 8, 2024
Time: 1 to 3 p.m.
Location: DAN 309 (to be updated) and via Zoom
Committee Members:
- Anna Rumshisky (Advisor), Professor, Miner School of Computer & Information Sciences
- Luke Zettlemoyer (Member), Professor, University of Washington, Allen School of Computer Science & Engineering, Meta AI Research
- Reza Ahmadzadeh (Member), Assistant Professor, Miner School of Computer & Information Sciences
- Tingjian Ge (Member), Professor, Miner School of Computer & Information Sciences, CHORDS
Moreover, scaling laws have revealed that model performance can be reliably and predictably improved by increasing pre-training data or model size. Soon, the training costs grew from using just one GPU over a day to using thousands of GPUs over multiple months. Unlike the 2012 - 2018 period, neural network architecture is not the key factor anymore. Instead, good scaling properties and training effectiveness are.
In this thesis, we focus on the topic of computational efficiency in contemporary NLP and what models learn during training. We first present our early work on a practical application of Continual Learning (CL) for neural semantic parsing. Then, we study at scale, how different pre-training objectives, amounts of pre-training data, and architecture variations affect the model's linguistic capabilities. To develop novel efficient training methods we then establish, categorize, and survey the state-of-the-art methods for parameter-efficient fine-tuning.
We introduce ReLoRA, the first-of-its-kind method that utilizes low-rank updates to train high-rank networks.
ReLoRA consistently outperforms LoRA in both fine-tuning and pre-training large transformer models up to 1.3B parameters. ReLoRA becomes more effective as the model size grows. Our largest experiment demonstrate RAM usage reduction by 5Gb per GPU, and up to $40\%$ wall-clock time reduction, depending on the hardware setup. Further, our results show similar performance to regular training making ReLoRA a promising candidate for improving the efficiency of large model training.
We conclude that our novel approach, ReLoRA, serves as a significant advancement in reducing computational and memory costs associated with training large-scale NLP models. Limitations of this study include focus on larger models where model weights and optimizer states take a significant chunk from the GPU memory and the extent to which neural network training can be approximated via a sequence of low-rank updates. However, our research demonstrates promising results for models up to 1B parameters and we expect them to perform even better at larger scale, where ReLoRA can bring even larger improvements.
Overall, the methods and insights presented in this thesis have the potential to significantly influence future developments in NLP and machine learning, pushing the boundaries of what is computationally feasible.
For more information, please contact Vlad Lialin.