11/27/2025
By Mohamed Elgaar
The Kennedy College of Science, Richard A. Miner School of Computer & Information Sciences, invites you to attend a doctoral dissertation defense by Mohamed Elgaar titled, "Linguistic Knowledge for Steering Learning and Generation Dynamics in Large Language Models (LLMs)."
Ph.D. Candidate: Mohamed Elgaar
Date: Dec. 11, 2025
Time: 1 - 2 p.m. EST
Location: Dandeneau 309 or via Zoom
Committee members:
- Hadi Amiri (Advisor), Assistant Professor, Computer Science Department, UMass Lowell
- Hong Yu, Professor, Miner School of Computer and Information Sciences, UMass Lowell
- Anna Rumshisky, Associate Professor, Miner School of Computer and Information Sciences, UMass Lowell
- Wei Xu, Associate Professor, College of Computing, Georgia Institute of Technology
Abstract:
Large Language Models (LLMs) have transformed natural language processing, yet two fundamental challenges persist: the computational expense of training and the difficulty of precisely controlling their outputs. This dissertation demonstrates that linguistic complexity, quantified using measures of lexical, syntactic, and discourse sophistication, provides an effective solution to both problems. We show that linguistic complexity serves a bidirectional function: it predicts what models find difficult to learn, and it specifies what we want models to generate. This enables us to build models that learn more efficiently and generate more precisely.
The thesis is organized into two parts. Part I establishes linguistic complexity as a training signal that allows us to better understand model behavior and improve training efficiency. We introduce curriculum discovery frameworks that reveal optimal training schedules are often non-monotonic and transferable across scales. We then show that curricula based on linguistic complexity reveal which linguistic features matter most for specific tasks, improving our understanding of model behavior. Part II inverts this framework: if linguistic complexity metrics can predict what a model finds difficult to learn, they can also be used to control what a model generates. By injecting linguistic knowledge, we can effectively steer the model's latent representations to produce text with desired characteristics. We introduce three systems that implement this principle of fine-grained linguistic controllability. This also enables new forms of linguistic stress-testing by evaluating which linguistic attributes models find easiest or hardest to control. First, we develop a model that achieves precise control through a novel architecture featuring a dedicated attribute embedding network and an inference-time quality control mechanism that iteratively refines outputs to precisely match target attributes while preserving semantic meaning. Second, we develop a power-law-based masking strategy for robust control over variable attribute subsets. Finally, we develop an interactive system that makes these capabilities accessible through a user-friendly interface.
The above contributions establish that linguistic complexity provides a unified principle for steering LLMs: it improves training efficacy and our understanding of model behavior when used as a curriculum signal, and enables fine-grained linguistic controllability and stress-testing when used as a generation target.
The proposal concludes by outlining the remaining thesis work, which extends these principles to large-scale pretraining and specialized application domains. First, we will scale our work to LLM pretraining, where we investigate how linguistic complexity metrics can be integrated into the pretraining pipeline to improve its computational efficacy. Second, we will investigate methods to identify and mitigate linguistic blind spots-weaknesses in how LLMs respond to core linguistic queries about their input text. Specifically, we will characterize regions of linguistic space where LLMs underperform and develop models to reduce these deficiencies. Finally, we will extend curriculum learning to the healthcare domain, where we apply our models to extract structured information from unstructured clinical notes; as part of this effort, we will organize the MedExACT shared task, which will benchmark the detection and labeling of medical decisions in discharge summaries as part of the BioNLP 2026 workshop.