07/28/2022
By Saurabh Kulshreshtha

The Kennedy College of Sciences, Department of Computer Science, invites you to attend a doctoral dissertation defense by Saurabh Kulshreshtha on “Identifying and Overcoming Limitations of Pre-trained Transformers with Strong Inductive Biases.”

Ph.D. Candidate: Saurabh Kulshreshtha
Defense Date: Thursday, Aug. 4, 2022
Time: 3 to 5 p.m.
Location: This will be a virtual defense via Zoom. Those interested in attending should contact the student (Saurabh_Kulshreshtha@student.uml.edu) and/or committee advisor Prof. Anna Rumshisky (Anna_Rumshisky@uml.edu) at least 24 hours prior to the defense to request access to the meeting.

Committee Chair (Advisor): Anna Rumshisky, Associate Professor, Computer Science, University of Massachusetts Lowell

Committee Members:

  • Hong Yu, Ph.D., Professor, Computer Science, University of Massachusetts, Lowell
  • José Luis Redondo García, Applied Scientist, Amazon Alexa, Cambridge, UK

Abstract:

Pre-trained Transformers have massively improved the state-of-the-art for many natural language processing tasks under large-scale supervision. However, there is limited understanding of the vector space representations learnt by these models and although most simpler tasks demonstrate good performance under large-scale supervision, there is host of more complex language tasks and low-supervision regimes which remain challenging to solve and where even transformer models do not perform well. This dissertation aims to address task and supervision settings where pre-trained transformers perform poorly and identify the causes of this, further it aims to offer solutions to this by inducing suitable biases in the form of alternative forms of supervision to mitigate poor performance.

We conduct several studies wherein (a) We focus on zero-shot cross-lingual transfer in pre-trained transformers, where downstream task data is only available for some source language (such as English) and objective is to produce a performant system on a target language (such as Turkish) with limited access to labelled data in the target language for training. We study and compare various methods for improving cross-language alignment in multilingual transformers such as mBERT, we also study various sources of cross-lingual supervision such as dictionaries and parallel translated corpora. We evaluate on word-level tasks which are known to poorly transfer cross-language signal compared to sentence level classification tasks, adopting larger transformers pre-trained with more data (b) in our investigations we find certain dimensional positions in pre-trained transformer embeddings are consistently orders of magnitude higher across embeddings produced by pre-trained transformers, we find that high magnitude feature dimensions are caused by a very small number of parameters associated with the LayerNorm layer and have very high detrimental impact on downstream tasks when these are selectively pruned, we track their emergence during pre-training and present recommendations on how to reduce their damaging effect while pruning or serving the model (c) we formulate a new language task of solving crossword clues and entire puzzles, these puzzles are quite challenging to solve given the variety of reasoning and knowledge requirements posed by the clues, we propose to release a large scale dataset specification. We find that generative vanilla transformers perform extremely poorly on this task and identify stronger performance with transformers that have been augmented with an additional bias of a retrieval module to search large sources of world knowledge. (d) We find that transformers with a few billion parameters perform poorly on the tasks which requires multiple steps of reasoning in few-shot supervision regime, particularly we focus on the task of few-shot multi-hop question generation and propose a novel approach based on a few structured human written explanations to improve control over the difficulty of the generated questions which we show through human and automatic evaluations to improve performance.

In the first study, we analyze how different forms of cross-lingual supervision and alignment methods influence transfer in multilingual BERT (mBERT) transformer for eight language pairs. We constrain ourselves to token-level downstream tasks known to poorly improve cross-lingual transfer with increasing parameter count and pre-training data. We identify through systematic analysis that supervision from parallel corpora is typically better than dictionaries for aligning languages. We characterize biases of alignment methods to task type and proximity to the transfer language. We propose a novel method to normalize language sub-spaces in vector spaces of multi-lingual transformers which consistently improves transfer performance across all language pairs when used in tandem with the linear transform alignment mechanism.

In the second study, we examine the space of hidden representations computed by BERT-like models and propose a heuristic for detecting certain high magnitude parameters which we find also detrimentally affect downstream task performance. We propose to extend our methodology to other transformer architectures and confirm that a wide range of transformer-based models are sensitive to removal of few high-magnitude weights that emerge early in the pre-training stage, and which significantly affect the geometry of the embedding space. We further discuss implications for pruning and white box attacks on transformers.

In the third study, we propose to introduce the task of solving crossword puzzles. Solving crossword puzzles requires diverse capabilities like world knowledge and reasoning to satisfy constraints imposed by the puzzle grid. Clue types include historic, factual, word meaning, synonyms/antonyms, fill-in-the-blank, abbreviations, prefixes/suffixes, wordplay, cross-lingual, and clues which depend on the answers to other clues. Unlike all prior work we further constrain all models we experiment with to not have access to large databases of historical clue and answer pairs. We divide the task into two sub-tasks: 1. Answering individual clues and 2. Solving the entire crossword grid. For the clue-answering task, our baselines include several sequence-to-sequence transformers which perform significantly worse than retrieval-augmented generative transformers where a separate retrieval module pulls information relevant to the clue from a source of world knowledge like entire dictionaries, thesauri, and Wikipedia. We also introduce a non-parametric constraint satisfaction baseline to solve the entire crossword puzzle from candidate answers produced.

Finally, in the fourth study, we focus on few-shot multi-hop question generation, wherein given only a few examples to learn from the model must learn to generate questions that require the reader to reason over and combine information spread across multiple passages. This task requires several steps of reasoning for humans to accomplish. Without availability of intermediate reasoning large transformers struggle to perform this task. Inspired by chain-of-thought rationale generation, we introduce a new framework based on structured rationales and a few human generated rationale annotations are obtained. We treat each step of reasoning as a separate task to be performed by a generative transformer. We find that our proposed framework leads to improved control over the difficulty of the generated questions and better performance compared to transformer baselines with no rationale supervision, on automatic evaluation metrics and in human evaluation.