07/08/2022
By Saurabh Kulshreshtha
The Kennedy College of Sciences, Department of Computer Science, invites you to attend a Doctoral Dissertation Proposal defense by Saurabh Kulshreshtha on “Identifying Limitations and Fragility in Pre-trained Transformers and Overcoming Them with Strong Inductive Biases For Solving Challenging Language Tasks.”
Ph.D. Candidate: Saurabh Kulshreshtha
Defense Date: Wednesday, July 20, 2022
Time: 1:30 to 3 p.m. EDT
Location: This will be a virtual defense via Zoom. Those interested in attending should contact saurabh_kulshreshtha@student.uml.edu or committee advisor anna_rumshisky@uml.edu, at least 24 hours prior to the defense to request access to the meeting.
Committee:
- Committee Chair (Advisor): Anna Rumshisky (advisor), Associate Professor, Department of Computer Science
- Hong Yu, Professor, Department of Computer Science
- José Luis Redondo García, Applied Scientist, Alexa AI, Amazon
Abstract:
Pre-trained Transformers have improved the performance for many natural language processing tasks with large-scale supervision. However, there is limited understanding of the vector space representations learnt by these models and although most simpler tasks demonstrate good performance under large-scale supervision, there is host of more complex language tasks and low-supervision regimes which remain challenging to solve and where even transformer models do not perform well. This dissertation proposal aims to address task and supervision settings where pre-trained transformers perform poorly and identify the causes of this, further it aims to offer solutions through induction of suitable biases in the form of alternative forms of supervision to mitigate poor performance.
We conduct several studies wherein (a) we systematically compare several language alignment methods and sources of supervision to improve zero-shot cross-lingual transfer in pre-trained transformers for word-level tasks which are known to improve poorly through simply adopting larger transformers pre-trained with more data (b) we investigate emergence of high-magnitude outlier weights that appear throughout hidden vector spaces learned by pre-trained transformers and the detrimental impact on downstream tasks when these are selectively pruned (c) we formulate a challenging new language task of solving crossword puzzles that transformers perform extremely poorly on, we identify stronger baselines that enable transformers to retrieve relevant information from sources of world knowledge for this task, (d) we find that transformers with a few billion parameters still perform poorly on the tasks which requires multiple steps of reasoning in few-shot supervision regime, particularly we focus on the task of few-shot multi-hop question generation and propose a novel approach based on a few structured human written explanations to improve control over the difficulty of the generated questions.
In the first study, we analyze how different forms of cross-lingual supervision and alignment methods influence transfer in multilingual BERT (mBERT) transformer for eight language pairs. We constrain ourselves to token-level downstream tasks known to poorly improve cross-lingual transfer with increasing parameter count and pre-training data. We identify through systematic analysis that supervision from parallel corpora is typically better than dictionaries for aligning languages. We characterize biases of alignment methods to task type and proximity to the transfer language. We propose a novel method to normalize language sub-spaces in vector spaces of multi-lingual transformers which consistently improves transfer performance across all language pairs through
In the second study, we examine the space of hidden representations computed by BERT-like models and propose a heuristic for detecting certain high magnitude parameters which we find also detrimentally affect downstream task performance. We propose to extend our methodology to other transformer architectures and confirm that a wide range of transformer-based models are sensitive to removal of few high-magnitude weights that emerge early in the pre-training stage, and which significantly affect the geometry of the embedding space.
In the third study, we propose to introduce the task of solving crossword puzzles. Solving crossword puzzles requires diverse capabilities like world knowledge and reasoning to satisfy constraints imposed by the puzzle grid. Clue types include historic, factual, word meaning, synonyms/antonyms, fill-in-the-blank, abbreviations, prefixes/suffixes, wordplay, cross-lingual, and clues which depend on the answers to other clues. Unlike all prior work we further constrain all models we experiment with to not have access to large databases of historical clue and answer pairs. We divide the task into two sub-tasks: 1. Answering individual clues and 2. Solving the entire crossword grid. For the clue-answering task, our baselines include several sequence-to-sequence transformers which perform significantly worse than retrieval-augmented generative transformers where a separate retrieval module pulls information relevant to the clue from a source of world knowledge like entire dictionaries, thesauri, and Wikipedia. We also introduce a non-parametric constraint satisfaction baseline to solve the entire crossword puzzle from candidate answers produced.
Finally, in the fourth study, we focus on few-shot multi-hop question generation, wherein given only a few examples to learn from the model must learn to generate questions that require the reader to reason over and combine information spread across multiple passages. This task requires several steps of reasoning for humans to accomplish. Without availability of intermediate reasoning large transformers struggle to perform this task. Inspired by chain-of-thought rationale generation, we introduce a new framework based on structured rationales and a few human generated rationale annotations are obtained. We treat each step of reasoning as a separate task to be performed by a generative transformer. We find that our proposed framework leads to improved control over the difficulty of the generated questions and better performance compared to transformer baselines with no rationale supervision, on automatic evaluation metrics and in human evaluation.
All interested students and faculty members are invited to attend the online defense via remote access.