Colloquium Abstracts
Fall 2025
- September 12 — Hua Xu — "Large Language Models for Biomedical Applications"
- September 19 — Marcos Zampieri — "LLMs in Education: Integrating Code and Text Generation Models in Educational Applications"
- September 22 — Weishen Pan — "Identification of Predictive Subphenotypes for Clinical Outcomes Using Real-World Data and Machine Learning"
- September 26 — Diane Litman — "eRevise+RF: A Writing Evaluation System for Assessing Student Essay Revisions and Providing Formative Feedback"
Spring 2025
- February 7 — Diyi Yang — "Enabling and Evaluating Human-Agent Collaboration"
- February 14 — Rada Mihalcea — "Why AI Is W.E.I.R.D. And Shouldn't Be This Way"
- February 28 — Wei Xu — "Human-AI Collaboration in Evaluating Large Language Models"
- March 7 — Emily Prud’hommeaux — "Overcoming Obstacles in NLP for Endangered Languages"
- March 21 — Byron Wallace — "LLMs for healthcare: Risks and interpretability methods to (possibly) mitigate them"
- March 28 — Tom McCoy — "Understanding the abilities of AI systems: Memorization, generalization, and points in between"
- April 4 — Jiawei Han — "A Retrieval and Structuring Approach for LLM-Enhanced, Theme-Focused Science Discovery"
- April 11 — Greg Durrett — "Specializing LLMs for Reliability"
- April 18 — Jessy Li — "Discourse models with language models"
- April 25 — Mihai Surdeanu — "Neuro-symbolic Approaches for Explainable Natural Language Processing"
Fall 2024
- October 25 — Weiyan Shi — "Persuasion for Social Good: How to Build and Break Persuasive Chatbots"
- November 1 — Ankush Das — "Programming Language Principles for Distributed Systems"
- November 8 — Tianyi Zhou — "Synthetic Data for Self-Evolving AI"
- November 15 — Marco Gaboardi — "Reasoning about Programs’ Adaptivity, with applications to Adaptive Data Analysis"
- November 20 — Liang Zhao — "Graph Representation Learning for Network Generation, Optimization, and Verbalization"
- November 22 — Danielle S. Bitterman — "Bridging the AI Translational Gap in Oncology"
- December 6 — Sijia Liu — "Machine Unlearning for Generative AI: A Model-Based Perspective"
Fall 2025
Large Language Models for Biomedical Applications
Date: September 12, 2025 Time:
Recent breakthroughs in Large Language Models (LLMs) have sparked an unprecedented transformation across diverse fields, particularly in the biomedical domain. These advanced models are not only accelerating research but also revolutionizing clinical practices by enabling more efficient data analysis, improving decision-making processes, and facilitating innovative discoveries. In this talk, I'll share our cutting-edge methodologies and practical software solutions leveraging state-of-the-art LLMs such as GPT and LLaMA. We will highlight their applications in real-world evidence generation, medical diagnoses, and literature-based discovery. Additionally, we will discuss the compelling insights, challenges, and real-world experiences gained from applying these transformative technologies, illustrating how LLMs are reshaping the future of biomedical research and healthcare.
LLMs in Education: Integrating Code and Text Generation Models in Educational Applications
Date:
Recent advances in Generative AI and Large Language Models (LLMs) have the potential to transform education. LLMs are becoming an important part of various educational applications including intelligent tutoring systems capable of handling text, images, and programming code. In this talk, I present recent work on the use of LLMs in education. The talk is divided into two parts. In the first part, I present research on LLMs in Computer Science (CS) education. In particular, I describe use cases of LLMs in CS education and code generation, including recent benchmark work on introductory programming assignments and low-resource programming languages. In the second part of this talk, I describe LLMs applied to Natural Language Processing (NLP) within educational applications. I describe research on tasks such as lexical complexity prediction and text simplification.
Identification of Predictive Subphenotypes for Clinical Outcomes Using Real-World Data and Machine Learning
Date:
Predicting treatment response is an important problem in real-world applications, where the heterogeneity of the treatment response remains a significant challenge in practice. The growing availability of real-world data (RWD), such as electronic health records (EHRs), provides opportunities to address this challenge by clustering patients based on RWD. In this talk, I will review traditional unsupervised machine learning methods for subphenotyping and highlight their limitation of not ensuring coherent outcomes within identified groups. I will then introduce our proposed Graph-Encoded Mixture Survival (GEMS) framework, a general machine learning approach designed to identify predictive subphenotypes that simultaneously ensure coherent survival outcomes and consistent baseline characteristics. I will present results from applying GEMS to a large real-world dataset of advanced non-small cell lung cancer (aNSCLC) patients, demonstrating its effectiveness in predicting overall survival (OS) and uncovering clinically interpretable subgroups. I will conclude by discussing future opportunities and challenges in extending this framework to other disease contexts.
eRevise+RF: A Writing Evaluation System for Assessing Student Essay Revisions and Providing Formative Feedback
Date:
The ability to revise essays in response to feedback is important for students’ writing success. An automated writing evaluation (AWE) system that supports students in revising their essays is thus essential. In this talk, I will first present the NLP technology behind eRevise+RF, an enhanced AWE system for assessing student essay revisions (e.g., changes made to an essay to improve its quality in response to essay feedback) and providing revision feedback. Next, I will present evaluation results from a system deployment with 406 students in Pennsylvania and Louisiana, confirming its effectiveness in 1) assessing student essays in terms of evidence usage, 2) extracting evidence and reasoning revisions across essays, 3) determining revision success in responding to feedback, and 4) helping students improve their argumentative writing skills through revision and feedback. Finally, I will present a method for efficient layer-wise LLM fine-tuning developed for low resource scenarios such as revision classification; the method fine-tunes a subset of important LLM layers that are dynamically selected based on their gradient norm distribution, while freezing those of redundant layers. Experiments using revision data from both eRevise+RF and a community benchmark show that our method surpasses several layer-wise PEFT baselines over diverse text revisions, while achieving fast convergence and low GPU memory consumption.
Spring 2025
Enabling and Evaluating Human-Agent Collaboration
Date:
Recent advances in large language models (LLMs) have revolutionized human-AI interaction, but their success depends on addressing key challenges like privacy and effective collaboration. In this talk, we first explore PrivacyLens, a general framework to evaluate privacy leakage in LLM agents’ actions, by extending privacy-sensitive seeds into agent trajectories. By evaluating state-of-the-art models, PrivacyLens reveals contextual and long-tail privacy vulnerabilities, even under privacy-enhancing instructions. We then introduce Co-Gym, a novel framework for studying and enhancing human-agent collaboration across various tasks. Our findings reveal that collaborative agents consistently outperform their fully autonomous counterparts in task performance. Via PrivacyLens and Co-Gym, this talk highlights how to develop AI systems that are trustworthy and capable of fostering meaningful collaboration with human users.
Why AI Is W.E.I.R.D. And Shouldn't Be This Way
Date:
Recent years have witnessed remarkable advancements in AI, with language and vision models that have enabled progress in numerous applications and opened the door to the integration of AI in areas such as communication, transportation, healthcare, and arts. Yet, many of these models and their corresponding datasets are W.E.I.R.D. (Western, Educated, Industrialized, Rich, Democratic) and they are reflective of a small fraction of the population.(*) In this talk, I will show some of the limitations and lack of representation of current AI models, and highlight the need for cross-cultural language and vision models that can capture the diversity of behaviors, beliefs, and language expressions across different groups. I will also explore ways in which we can address these limitations by developing models that are re-centered around people and their unique characteristics. (*) W.E.I.R.D. is an acronym widely used in psychology to indicate the limitation of many of the studies carried out in the field
Human-AI Collaboration in Evaluating Large Language Models
Date:
To support real-world applications more responsibly and further improve large language models (LLMs), it is essential to design reliable and reusable frameworks for their evaluation. In this talk, I will discuss three forms of human-AI collaboration for evaluation that combine the strengths of both: (1) the reliability and user-centric aspect of human evaluation, and (2) the cost efficiency and reproducibility offered by automatic evaluation. The first part focuses on systematically assessing LLMs’ favoritism towards Western culture, using a hybrid approach of manual effort and automated analysis. The second part will showcase an LLM-powered privacy preservation tool, designed to safeguard users against the disclosure of personal information. I will share some interesting findings from an HCI user study that involves real Reddit users utilizing our tool, which in turn informs our ongoing efforts to improve the design of NLP models. Lastly, we will delve into the evaluation of LLM-generated texts, where human judgments can be used to train automatic evaluation metrics to detect errors. We also highlight the opportunity of engaging both laypeople and experts in evaluating LLM-generated simplified medical texts in high-stake healthcare applications.
Overcoming Obstacles in NLP for Endangered Languages
Date:
A majority of the world's languages lack sufficient resources to train the state-of-the-art NLP models we've come to expect for high-resource languages like English or Mandarin. The situation is particularly dire for endangered languages, which could benefit enormously from these technologies but will never have abundant high-quality training resources. In this talk, I will discuss some approaches for addressing these challenges in automatic speech recognition and machine translation, with a focus on several different endangered and under-resourced languages.
LLMs for healthcare: Risks and interpretability methods to (possibly) mitigate them
Date:
Large Language Models (LLMs) are poised to transform specialist fields like healthcare. Such models promise to free domain experts, including physicians, from drudgery, enabling better care to be delivered at scale. But the use of LLMs in healthcare—and similar high-stakes, specialized domains—brings real risks. Used naively, such models may worsen existing biases in practice. They might also result in medical errors owing to "hallucinations". In this talk I will discuss a few recent efforts designing and critically evaluating LLMs for medical language processing tasks, e.g., summarizing clinical notes in patient electronic health records (EHRs). I will highlight current limitations and associated risks of LLMs in the context of these applications, particularly related to robustness and bias. Finally, I will discuss recent work on adopting "mechanistic" interpretability methods in the space of healthcare as a potential means of mitigating these issues.
Understanding the abilities of AI systems: Memorization, generalization, and points in between
Date:
Large language models (LLMs) can perform a wide range of tasks impressively well. To what extent are these abilities driven by shallow heuristics vs. deeper abstractions? I will argue that, to answer this question, we must view LLMs through the lens of generalization. That is, we should consider the data that LLMs were trained on so that we can identify whether and how their abilities go beyond their training data. In the analyses of LLMs that I will discuss, this perspective reveals both impressive strengths and surprising limitations. For instance, LLMs often produce sentence structures that are well-formed but that never appeared in their training data, yet they also struggle on some seemingly simple algorithmic tasks (e.g., decoding simple ciphers) in ways that are well-explained by training data statistics. In sum, to understand what AI systems are, we must understand what we have trained them to be.
A Retrieval and Structuring Approach for LLM-Enhanced, Theme-Focused Science Discovery
Date:
Large Language Models (LLMs) may bring unprecedent power in scientific discovery. However, current LLMs may still encounter major challenges for effective scientific exploration due to their lack of in-depth, theme-focused data and knowledge. Retrieval augmented generation (RAG) has recently become an interesting approach for augmenting LLMs with grounded, theme-specific datasets. We discuss the challenges of RAG and propose a retrieval and structuring (RAS) approach, which enhances RAG by improving retrieval quality and mining structures (e.g., extracting entities and relations and building knowledge graphs) to ensure its effective integration of theme-specific data with LLM. We show the promise of this approach at augmenting LLMs and discuss its potential power for LLM-enabled science exploration.
Specializing LLMs for Reliability
Date:
Large language models (LLMs) have advanced the frontiers of AI reasoning: they can synthesize information from multiple sources, derive new conclusions, and explain those conclusions to their users. However, LLMs do not do this reliably. They hallucinate facts, convincingly state incorrect deductions, and exhibit logical fallacies like confirmation bias. In this talk, I will describe my lab's work on making LLM systems reliable by introspecting their behavior. First, I will argue that automating fine-grained evaluation of LLM output provides a level of understanding necessary for further progress. I will describe the ingredients of effective automated evaluators and a state-of-the-art factuality evaluation system, MiniCheck, showing that analyzing the nature of hallucinations can help reduce them. Second, I will demonstrate that better understanding of LLMs' internal reasoning processes helps us train them to be more reliable. Our work shows that model interpretation techniques can advance training methodology and dataset curation for reasoning models. Finally, I will describe how deeper understanding of LLMs will let us tackle their most fundamental limitations, such as their inconsistency when given different inputs. I will propose how these pieces might soon be combined to form reliable AI systems.
Discourse models with language models
Date:
How are sentences in a document connected, and why do they make the document feel “coherent”? Computational models of discourse aim to solve this myth by recovering the structural organization of texts, through which writers convey intent and meaning. In the first part of this talk, I will discuss our efforts on modeling human curiosity through question generation, and understanding its connection with discourse representations based on the linguistic theory of Questions Under Discussion. We show that LLMs, with design and training, resurface curiosity-driven questions and ground their elicitation and answers in text. Next, I will demonstrate how such generative discourse models can be used to measure discourse similarities in LLM-generated texts, as well as to derive explainable measures of information salience in LLMs using summarization as a behavioral probe.
Neuro-symbolic Approaches for Explainable Natural Language Processing
Date:
Deep learning approaches to natural language processing (NLP) such as GPT* have achieved tremendous successes recently. However, these systems are difficult to understand, augment, or maintain as needs shift. In this talk I will discuss two of our recent efforts that aim to bring explainability back into deep learning methods for NLP. In the first part of the talk, I will introduce an explainable approach for information extraction (IE), an important language understanding task that focuses on finding structured information in text such as who did what to whom when and where. Our approach mitigates the tension between generalization and explainability by jointly training for the two goals. The proposed method uses a multi-task learning architecture, which jointly trains a classifier for information extraction, and a sequence model that labels words in the context that explain the decisions of the previous classifier. We show that, even with minimal guidance for what makes a good explanation, the sequence model learns to provide accurate explanations. Further, we show that the joint training generally improves the performance of the IE classifier. In the second part of the talk, I will discuss a neuro-symbolic architecture for information extraction that preserves the advantages of both directions, i.e., the generalization power of neural methods and the pliability of symbolic approaches. Our modular approach contains two components: a declarative rule-based model and a neural component. The former implements information extraction with a set of explainable rules that rely on syntax; the latter increases the generalizability of rules by semantically matching them over text. I'll show that the proposed approach outperforms all neural models on a challenging IE task. More importantly, I'll show that that the underlying symbolic representation can be locally modified to correct model mistakes without retraining the neural component.
Fall 2024
Persuasion for Social Good: How to Build and Break Persuasive Chatbots
Date:
AI research has so far focused on modeling common human skills, such as building systems to see, read, or talk. As these systems gradually achieve a human level in standard benchmarks, it is increasingly important to develop next-generation interactive AI systems with more advanced human skills, to function in realistic and critical applications such as providing personalized emotional support. In this talk, I will cover (1) how to build such expert-like AI systems specialized in social influence that can persuade, negotiate, and cooperate with other humans during conversations. (2) I will also discuss how humans perceive such specialized AI systems. This study validates the necessity of Autobot Law and proposes guidance to regulate such systems. (3) As these systems become more powerful, AI safety problems becomes more important. I will also describe how to persuade AI models to jailbreak them and study AI safety problems. Finally, I will conclude with my long-term vision to build a natural interface between human intelligence and machine intelligence via dialogues, from a multi-angel approach that combines Artificial Intelligence, Human-Computer Interaction, and social sciences, to develop expert AI systems for everyone.
Programming Language Principles for Distributed Systems
Date:
With the proliferation of distributed systems, the design of safe, secure, and efficient software has become an ever more complex task. The heterogeneous nature of these distributed systems have further introduced domain-specific programming requirements such as inferring execution cost, accounting for randomized behavior, and preventing communication errors. To develop programming languages and reasoning tools for such multi-threaded environments, we need two main ingredients: concurrency and domain-specific support. In this talk, I will use session types as a base type system that already comes equipped with reasoning capabilities for message-passing concurrent systems. On top, I will introduce domain-specific support for 3 different domains: digital transactions, randomized systems, and program verification. Programming smart contracts comes with its unique challenges, which include enforcing protocols of interaction, tracking linear assets, and analyzing execution cost. To address these challenges, the talk introduces Nomos that employs linear session types to enforce protocols and prevent assets from being duplicated or discarded. To predict execution cost, Nomos uses resource-aware types and automatic amortized resource analysis, a type-based technique for inferring cost bounds. For randomized systems, Nomos is further enhanced with probabilistic types that track the probability distribution of message exchanges in a distributed system. Finally, to verify concurrent programs, I will introduce dependent refinement session types that can naturally track intrinsic properties such as sizes and values in the type of messages, which can then be used for lightweight verification. The talk concludes with my future plans on exploring how programming languages can aid in the specification, verification, and possibly synthesis of cryptographic protocols.
Synthetic Data for Self-Evolving AI
Date:
Data is the new oil for training large AI models. However, the "oil" created by humans may run out someday or grow much slower than the speed of AI consuming them. Moreover, the human-created data are less controllable in terms of quality, opinions, format, style, etc., and may lead to biases or privacy concerns when used for model training. Can we leverage the power of Generative-AI and automatically create synthetic data in a more efficient, controllable, and safe manner, for training or benchmark purposes? How can we avoid model collapse caused by continuously training a model on self-generated synthetic data? In this talk, I will present our recent works that aim to investigate whether and how synthetic data can be created to improve large language models (LLMs) and vision-language models (VLMs), especially when the real data is non-perfect. These works include Mosaic-IT (compositional data augmentation for instruction tuning), DEBATunE (data generation by LLM debate), Diffusion Curriculum (generative curriculum learning of low-quality images), and AutoHallusion (hallucination benchmark generation via automatic image editing). These projects are led by Ming Li, Yijun Liang, Xiyang Wu, and Tianrui Guan.
Reasoning about Programs’ Adaptivity, with applications to Adaptive Data Analysis
Date:
An adaptive program is a program that interacts with other components and whose choice for the next interaction depends on the results of previous interactions. Adaptive programs find applications in many areas of computer science, such as in adaptive data analysis, in the analysis of interactive protocols in security and privacy, in database systems, etc. In many of these applications it is important to quantify the level of adaptivity of a program.
Graph Representation Learning for Network Generation, Optimization, and Verbalization
Date:
In my talk, I will focus on adaptive programs in the context of adaptive data analysis. In this area, one is interested in guaranteeing that the result of a data analysis run on sample data does not differ too much from the result one would achieve by running the same analysis over the entire population. To achieve this goal, one can use several techniques that were designed to control the generalization errors of data analyses, but in order to choose well among the different techniques one has to know the adaptivity of a program. I will show how program analysis can help with this task.
Bridging the AI Translational Gap in Oncology
Date:
Concretely, I will first present a programming model for adaptive data analyses based on a simple imperative programming language that is suitable to integrate different techniques that can be used for controlling the generalization error. I will then introduce a program analysis for this language that, given an input program implementing an adaptive data analysis, generates an upper bound on the total number of queries that the data analysis will run, and more interestingly also an upper bound on the depth of the chain of queries implemented by the program. These two measures can be used to select the right technique to guarantee a bound on the generalization error of the input data analysis. I will then discuss limitations and potential future works.
Machine Unlearning for Generative AI: A Model-Based Perspective
Date:
In this talk, I will introduce the concept of Machine Unlearning (MU), a transformative approach to removing undesirable data influence or associated model capabilities from learned discriminative and generative models. To bridge the gap between exact and approximate unlearning, I will present a novel model-based perspective that integrates model sparsity, gradient-based weight saliency, and weight influence attribution. This model-centric approach achieves significant advancements in MU for vision and language models, balancing effectiveness, preserved utility, and enhanced efficiency. Additionally, I will explore the practical implications of MU in addressing critical challenges in AI safety.