07/30/2024
By You Zhou
The Kennedy College of Sciences, Miner School of Computer & Information Sciences, invites you to attend a doctoral dissertation proposal defense by You Zhou on "Detecting AI-Generated Texts in Cross-Domains"
Ph.D. Candidate: You Zhou
Time: Monday, Aug. 5, 2024
Time: 2 p.m. ET
Location: This will be a virtual defense via Zoom. Meeting ID: 2904839748
Committee Members:
- Jie Wang (advisor), Professor, Miner School of Computer and Information Sciences
- Benyuan Liu (member), Professor, Miner School of Computer and Information Sciences
- Li Feng (member), Instructional design manager, The TJX companies
Abstract
Identifying texts generated by Large-Language Models (LLMs) is crucial in education, research, and various other domains to prevent misuse such as cheating on assignments or spreading misinformation. Existing detection tools, while effective on training dataset-like texts, struggle with texts from different domains due to content variability and complexity. AI-generated texts can mimic human styles in literature, maintain precision in scientific writing, and achieve contextual accuracy in journalism, making detection challenging. To address these challenges, we introduce RoBERTa-Ranker, a deep classifier modifying RoBERTa with a margin ranking loss function and a mean-polling layer. Trained on a dataset augmented with texts generated by various LLMs, RoBERTa-Ranker outputs a label, Human or LLM. To enhance cross-domain detection accuracy, we propose a fine-tuning method using a small set of labeled texts and their predicted labels and confidence levels. By computing a content-significance distribution vector for each article and determining its domain, our method outperforms existing tools on cross-domain texts and texts generated by unseen LLMs.
Additionally, we explore the significance of sentence placements in different article types, hypothesizing that skilled writers assign varying importance to specific sentences and sub-sequences. For instance, news articles often follow an inverted-pyramid structure, where sentences at the beginning are more significant. We present the first quantitative analysis of this phenomenon, which has been primarily qualitatively classified into structures like inverted pyramid, hourglass, diamond, and narrative. Using existing datasets for argument, narrative, news, and scholarly research articles, we further investigate content significance distribution (CSD) for sentence locations and text blocks, building on prior research that assigned ad hoc weights to sentence locations for ranking purposes. Our findings aim to enhance text mining tasks by leveraging these quantitative descriptions.
To enhance the base model's accuracy, we will gather diverse datasets from various domains, ensuring balance to minimize biases. Key methods such as TF-IDF, stylistic features, and machine learning techniques will be evaluated for their effectiveness in domain-specific data analysis. The process includes training n-gram tokenizers, cross-validation, and simulating a Mixture of Experts model for efficient resource allocation. Interpretability and explainability will be prioritized using visualization tools and user studies. Tailoring the model to specific domains with domain expert collaboration will ensure optimal performance and reliability.