07/01/2025
By Zhang Zhang
The Kennedy College of Science, Richard A. Miner School of Computer & Information Sciences, invites you to attend a doctoral dissertation defense by Zhang Zhang titled, "Enhancing Endoscopic Lesion Detection with Lightweight Transformers, Multi-Scale Attention, and Vision-Language Models."
Candidate Name: Zhang Zhang
Date: Thursday, July 17th, 2025
Time: 10 a.m. to noon EST
Location: This will be a virtual defense via Zoom.
Committee Members:
- Yu Cao (Advisor), Professor, Director, Miner School of Computer & Information Sciences, UMass Center for Digital Health (CDH)
- Benyuan Liu (Advisor), Professor, Director, Miner School of Computer & Information Sciences, UMass Center for Digital Health (CDH), Computer Networking Lab, CHORDS
- Hengyong Yu (Member), Professor, FIEEE, FAAPM, FAIMBE, FAAIA, FAIIA, Department of Electrical and Computer Engineering
- Ming Shao (Member), Associate Professor, Department of Engineering, University of Massachusetts Boston
Abstract:
Gastrointestinal (GI) endoscopy is crucial for diagnosing digestive tract disorders, but its efficacy is often limited by human factors. This dissertation investigates advanced artificial intelligence (AI) techniques to enhance endoscopic image analysis.
Firstly, we propose an enhanced lesion detection framework by integrating a lightweight transformer head into the YOLOX architecture and a novel multi-level and multi-scale attention (MLMSA) module, demonstrating superior detection accuracy and computational efficiency on both gastroscopy and colonoscopy datasets. We also introduce a new large-scale gastroscopy dataset for erosions and ulcers.
Secondly, we further explore attention mechanisms by developing the MLMSA module as a dedicated neck component for deep learning object detection networks. This module is designed to handle the complexities of GI abnormalities, including multiple co-occurring lesion types and subtle early-stage cancers, by effectively fusing and adaptively weighting multi-level and multi-scale features. Experiments on a comprehensive custom gastroscopy dataset, featuring five distinct lesion categories including early-stage cancer, show that integrating MLMSA into YOLOv7 and YOLOv8 significantly improves mean Average Precision (mAP@.5).
Finally, we introduce Endo-RVL, a novel framework leveraging Reinforcement Learning with Verified Rewards (RLVR) to train Large Vision-Language Models (VLMs) for nuanced endoscopic image interpretation. Using a strategically constructed dataset and the Generative Reward Policy Optimization (GRPO) strategy with the Qwen2.5-VL-3B model, Endo-RVL demonstrates competitive performance on domain-specific tasks, particularly Referring Expression Comprehension, and maintains general multimodal reasoning capabilities. This multi-faceted research aims to bridge the gap between advanced AI research and its impactful application in clinical endoscopic practice, paving the way for more reliable, interpretable, and efficient diagnostic tools.