06/09/2025
By Xiaolong Liang

The Kennedy College of Science, Richard A. Miner School of Computer & Information Sciences, invites you to attend a doctoral dissertation defense by Xiaolong Liang titled, "Advanced Deep Learning Approaches for Real-time Scene Classification, Polyp Detection, and Intestine Localization in Endoscopy Videos."

Ph.D. Candidate: Xiaolong Liang
Date: Thursday, June 19, 2025
Time: 1-2 p.m. EST
Location: This will be a virtual defense via Zoom

Committee Members:

  • Yu Cao (Advisor), Professor, Director, Miner School of Computer & Information Sciences, UMass Center for Digital Health (CDH)
  • Benyuan Liu (Advisor), Professor, Director, Miner School of Computer & Information Sciences, UMass Center for Digital Health (CDH), Computer Networking Lab, CHORDS
  • Hengyong Yu (Member), Professor, FIEEE, FAAPM, FAIMBE, FAAIA, FAIIA, Department of Electrical and Computer Engineering
  • Honggang Zhang (Member), Professor, Department of Engineering, University of Massachusetts Boston

Abstract:
Deep learning and computer vision have become pivotal technologies in advancing medical applications, particularly in the analysis of endoscopy videos for the early detection and diagnosis of gastrointestinal diseases. Despite significant progress, challenges persist in achieving accurate, real-time scene classification and efficient polyp detection within the diverse and complex visual environments of endoscopic procedures. This thesis aims to address these challenges to enhance the performance of scene classification, polyp detection, and intestine localization in endoscopy videos.

Endoscopy serves as a vital diagnostic tool in medical imaging, particularly in the examination of the esophagus, stomach, and intestines. This framework introduces a two-stage system for the automated classification of scene categories (Colonoscopy, Gastroscopy, Extracorporal, Blur) within endoscopy videos. The initial stage employs the Clear-Blur model to determine frame blurriness. If non-blurred, the subsequent stage utilizes the Three-Scene model for frame classification. The class results are then verified by the label of the video. This integrated system achieves 97% average classification accuracy evaluated on 197 clinical endoscopy video clips. Additionally, the system incorporates a temporal label accumulation algorithm, demonstrating over 90% average classification accuracy after 50±15 seconds of endoscopy entry into the gastrointestinal tracts.

Colorectal cancer (CRC) poses a significant global health challenge, ranking as a leading cause of cancer-related mortality. Colonoscopy, the most effective means of preventing CRC, is utilized for early detection and removal of precancerous growths. However, while there have been many efforts that utilize deep learning based approaches for automatic polyp detection, false positive rates in polyp detection during colonoscopy remain high due to the diverse characteristics of polyps and the presence of various artifacts. This research introduces an innovative technique aimed at improving polyp detection accuracy in colonoscopy video frames. The proposed method introduces a novel framework incorporating a cross-channel self-attention fusion unit, aimed at enhancing polyp detection accuracy in endoscopic procedures. The integration of this unit proves to play an important role in refining prediction quality, resulting in more precise detection outcomes in complex medical imaging scenarios. Thorough experiments and ablation studies are conducted to assess the performance of our proposed approach. The results demonstrate that our framework, featuring key technical innovations, significantly reduces false detections and achieves a higher recall rate.

Accurate localization of intestinal anatomical sections is crucial for enhancing the efficacy of colonoscopy by enabling precise navigation and improved lesion detection. This study proposes a deep learning-based framework that integrates spatially robust models, with a temporal majority voting strategy via a sliding window mechanism. Experimental results demonstrate that this combined approach significantly improves classification stability and reduces frame-level noise, particularly in morphologically similar anatomical regions. The proposed window-threshold method effectively enhances temporal consistency, leading to substantial gains in segment-level accuracy. Notably, configurations such as DenseNet161 (5-3) and YOLOv8m (3-2) consistently achieved the lowest error rates across the test video.

Polyp detection plays a pivotal role in early colorectal cancer diagnosis, as the accurate identification of polyps during endoscopy is critical for timely intervention. The Modified YOLOv8 Extra Large (YOLOv8xm) model, incorporating weight2 and Centerness, demonstrated a superior F1-Score, presenting its robust detection capabilities. Comparative analyses and ablation studies further confirm that the YOLOv8xm model offers competitive performance, rivaling the best-performing VAN-Base model.

Large Vision Models (LVMs) have emerged as a transformative tool in the field of medical imaging. In the context of endoscopy, LVMs are particularly advantageous due to their ability to capture subtle features and detect small anomalies, which are critical for early diagnosis. This study compares the performance of two LVMs, the pre-trained Endo-FM and the self-enriched Endo-Xy, across classification, segmentation, and detection tasks. The results demonstrate that Endo-Xy, augmented with a private dataset, outperforms Endo-FM in segmentation and detection, showing enhanced performance in tasks requiring high spatial precision. While Endo-FM retains better stability for classification tasks, Endo-Xy’s improved performance across a variety of tasks underscores the value of integrating large-scale datasets for more robust and accurate endoscopic image analysis.