07/27/2021
By Karen Volis

The Kennedy College of Sciences, Department of Computer Science, invites you to attend a doctoral dissertation defense by Olga Kovaleva on "Transformer Models in Natural Language Understanding: Strengths, Weaknesses and Limitations."

Ph.D. Candidate: Olga Kovaleva
Defense Date: Wednesday, Aug. 11, 2021
Time: noon EST
Location: This will be a virtual defense via Zoom

Committee Chair (Advisor): Anna Rumshisky, Professor, Computer Science Department, University of Massachusetts Lowell
Committee Members:

  • Hong Yu, Professor, Computer Science Department, University of Massachusetts Lowell
  • Tingjian Ge, Professor, Computer Science Department, University of Massachusetts Lowell
  • Byron Wallace, (external member), Assistant Professor, Khoury College of Computer Sciences, Northeastern University

Abstract:
The recently proposed Transformer neural network architecture has revolutionized the entire field of Natural Language Processing (NLP). Transformer-based architectures currently give state-of-the-art performance on many NLP benchmark tasks, but little is known about the exact mechanisms that contribute to their outstanding success. This dissertation aims to address some of the existing gaps in the understanding of the workings of the Transformer-based models, with a particular focus on the model that has first demonstrated their success in natural language understanding - BERT.

Using a subset of natural language understanding tasks and a set of handcrafted features-of-interest, we first propose the methodology and carry out qualitative and quantitative analysis of the information encoded within the self-attention mechanism of BERT. Our findings suggest that there is a limited set of attention patterns that are repeated across different model elements, indicating the overall model overparametrization. We manually inspect, visualize and propose a classification system of existing self-attention patterns, and succeed in detecting the heads that exhibit semantically important linguistic signals. We also show that manually disabling attention in certain components leads to a performance improvement over the regular fine-tuned BERT models.

Furthermore, we examine the space of hidden representations computed by BERT-like models, and present a heuristic for detecting their most fragile parts. Extending our methodology to other architectures, we confirm that a wide range of Transformer-based models are sensitive to removal of few high-magnitude weights which emerge early in the pre-training stage and which significantly affect the geometry of the embedding space.