08/13/2025
By Dipika Boro
The Richard A. Miner School of Computer & Information Sciences in the Kennedy College of Sciences invites you to attend a doctoral proposal defense by Dipika Boro entitled: "Advancing Endoscopic Polyp Detection, Segmentation and Visualization with Pretrained Deep Architectures and Self-Supervised Representation Learning."
Candidate Name: Dipika Boro
Date: Thursday, Aug. 21, 2025
Time: 11 a.m. - noon EST
Location: This will be a virtual defense via Zoom.
Committee Members:
- Yu Cao (Advisor), Professor, Director, Miner School of Computer & Information Sciences, UMass Center for Digital Health (CDH)
- Benyuan Liu (Advisor), Professor, Director, Miner School of Computer & Information Sciences, UMass Center for Digital Health (CDH), Computer Networking Lab, CHORDS
- Hengyong Yu (Member), FIEEE, FAAPM, Professor, Department of Electrical & Computer Engineering
- Qilei Chen (Member), Research scientist, Miner School of Computer & Information Sciences
Abstract:
Colorectal cancer is the third most common cancer worldwide, accounting for roughly 10% of all cancer cases and ranking as the second leading cause of cancer-related deaths. Routine endoscopic screening can prevent the progression of benign lesions into malignancies, while early detection of cancerous lesions greatly improves treatment outcomes and patient survival. The video data generated during endoscopic procedures represent a vast resource for training modern deep learning models. However, annotating this data for tasks such as polyp boundary delineation or diagnostic assessment requires the expertise of trained medical professionals. Unlike natural image domains, where large-scale crowd-sourced labeling is feasible, medical imaging demands specialized clinical knowledge, making large, high-quality labeled datasets rare. Moreover, strict privacy regulations further exacerbate data scarcity, limiting the development of robust medical image analysis models. The annotation process also adds to the workload of physicians and healthcare staff, who are already burdened by their routine clinical responsibilities.
Traditionally, this challenge has been addressed by adapting models pretrained on annotated datasets such as ImageNet or labeled medical data via transfer learning. More recently, advances in self-supervised learning (SSL) within computer vision have driven its growing adoption in medical imaging. Another persistent issue is that, despite strong research results, the clinical adoption of such techniques remains minimal due to limited interpretability. The scarcity of labeled data, coupled with the slow translation of AI research into clinical, continues to limit the development and deployment of robust medical image analysis systems. In this thesis proposal, we explore approaches that address both the scarcity of labeled data and the gap between research and real-world clinical use with a focus on the endoscopic imaging domain.
First, we present an exploratory study of cross-domain transfer learning for polyp segmentation, aiming to identify effective starting points when only limited labeled endoscopic data are available. We conduct a comprehensive set of experiments comparing models pretrained on large-scale natural image datasets with those pretrained on diverse medical imaging modalities, including CT, MRI, and histopathology. We evaluate both convolutional neural network and transformer-based architectures, specifically, ResNet-50 and ViT-Small. These backbones are coupled with a DeepLabV3+ decoder for semantic polyp segmentation. Experiments are performed on three public datasets: CVC-ClinicDB, Kvasir-SEG, and SUN-SEG. Across all settings, ImageNet-pretrained models consistently outperform those pretrained on medical datasets. These results indicate that medical domain pretraining is not universally advantageous and highlight the importance of modality alignment when selecting pretrained models for polyp segmentation and other medical imaging tasks.
Then, to address the scarcity of labeled endoscopic data, we leverage self-supervised learning (SSL) and introduce EndoMAE, a foundation model for endoscopic image analysis. We construct a dataset of over 6.5 million high-quality unlabeled frames extracted from clinical endoscopy videos and combine it with 3.5 million publicly available endoscopic images, resulting in more than 10 million frames for pretraining EndoMAE. To our knowledge, this is the largest dataset to date used for pre-training an endoscopic foundation model. EndoMAE follows the Masked Autoencoder (MAE) framework, employing masked image modeling to reconstruct missing patches of input endoscopic images, enabling the network to learn rich, domain-specific representations without manual labels or prompt engineering. We evaluate EndoMAE on fully disjoint endoscopic benchmarks for downstream image classification and polyp segmentation tasks. Our model demonstrates strong generalization to unseen data and outperform both supervised and self-supervised baselines. These results highlight the potential of SSL-based pretraining to create robust foundation models for endoscopic image analysis, reducing reliance on extensive manual annotation in downstream clinical tasks.
Finally, we plan to address the challenge of translating research advances into more clinically relevant tools by extending our work from two-dimensional segmentation to three-dimensional visualization. In this final project, we propose a multistage framework that sequentially leverages state-of-the-art deep learning models to progress from 2D polyp segmentation to detailed 3D reconstruction of polyp surfaces in endoscopic images. The pipeline integrates a YOLO-based polyp detector, SAM2 for segmentation via bounding-box prompting, MiDaS for depth estimation, and a final reconstruction stage that fuses depth and mask data to produce realistic three-dimensional visualizations. Such reconstructions can help physicians better assess polyp morphology, plan resections, and communicate findings to patients. By augmenting conventional polyp detection with spatial depth information and intuitive 3D views, this work aims to bridge the gap between detection accuracy and clinical interpretability, providing richer spatial context to support future integration into endoscopic workflows, which is a critical step toward the next generation of computer-aided endoscopic systems.