09/11/2024
By Zinan Xiong

The Richard A. Miner School of Computer & Information Sciences invites you to attend a doctoral dissertation defense by Zinan Xiong on "Advancing Healthcare Through Deep Learning: From Disease Recognition to Human Pose Estimation in Imaging and Video Data."

Ph.D. Candidate: Zinan Xiong
Date: Monday, Sept. 23, 2024
Time: 9:30 a.m. EST
Location: This will be a virtual defense via Zoom.

Committee Members:

  • Yu Cao (advisor), Professor, Director, Miner School of Computer & Information Sciences, UMass Center for Digital Health (CDH)
  • Benyuan Liu (advisor), Professor, Director, Miner School of Computer & Information Sciences, UMass Center for Digital Health (CDH), Computer Networking Lab, CHORDS
  • Hengyong Yu (member), FIEEE, FAAPM, Professor, Department of Electrical & Computer Engineering
  • Yan Luo (member), Professor, Department of Electrical & Computer Engineering

Abstract:

Over the past decade, the relentless advancement of deep learning algorithms has unleashed a transformative wave across an array of industries, reshaping landscapes from autonomous vehicle technology to immersive gaming experiences and critically, revolutionizing the frontiers of healthcare. These algorithms, characterized by their remarkable computational capabilities and adaptive learning prowess, have emerged as indispensable tools, effectively alleviating the burdens of laborious, resource-intensive tasks that once demanded substantial human involvement.

In the context of enhancing patient comfort and evaluating physician proficiency, particularly for novice practitioners, it is essential to minimize the duration of the endoscope's passage from the oral cavity to the throat. To address this, in Chapter 2, we propose a framework for an automated algorithm based on deep learning and image classification to accurately measure the oral-pharyngeal transit time. This framework is designed to identify distinct points along the endoscope's path, automating the recognition of various segments and calculating the time differentials between these points.

The advent of the Transformer model has revolutionized the field of NLP, significantly enhancing its capabilities and paving the way for the large language models we see today. When the Transformer architecture was adapted for computer vision, it also brought about substantial advancements in the field. To evaluate the performance of the Transformer in healthcare, particularly in human pose estimation, Chapter 3 introduces a model adapted from the Swin Transformer. Our findings indicate that this Transformer-based approach achieves competitive performance when compared to traditional CNN based models, suggesting promising applications for Transformer architectures in medical image analysis and healthcare diagnostics.

Atrophic gastritis is a common pathology linked to the risk of gastric cancer. However, relying solely on image classification methods often results in discontinuous outcomes when applied to actual video scenarios, causing challenges in diagnosis for healthcare professionals. To address this issue of inconsistent classification between frames in videos, in Chapter 4, we introduce the Adapify algorithm. By leveraging both the main model and auxiliary model to analyze the video content separately, this algorithm performs weighted summation of their outcomes, subsequently adjusting the final classification results.

As we look to the future of medical imaging and diagnostics, we cannot ignore the rapid advancements in large language models such as ChatGPT and LLaMA. Inspired by their success, the field of computer vision is intensifying efforts to develop a visual foundational model that holds a similar prominent position in the visual domain. In response to this challenge, Meta has recently proposed the Segment Anything Model (SAM), a new visual foundational model capable of segmenting objects and scenes in real life. Trained on an extremely large dataset, SAM demonstrates strong zero-shot generalization capabilities. Building upon this groundbreaking work, in Chapter 5, we introduce a novel model based on SAM. Our approach fine-tunes the image encoder using a Masked Autoencoder strategy specifically on medical images. The results are promising, showing strong performance in segmentation tasks on medical image data, thus bridging the gap between general-purpose visual AI and specialized medical applications.

In conclusion, this thesis presents a comprehensive exploration of cutting-edge deep learning techniques applied to critical areas of medical imaging and diagnostics. From enhancing endoscopic procedures and improving pose estimation to refining video-based diagnostics and adapting state-of-the-art visual AI for medical use, our work demonstrates the immense potential of these technologies to revolutionize healthcare. By pushing the boundaries of what's possible in medical image analysis, we aim to contribute to a future where AI-assisted healthcare is more accurate, efficient, and accessible, ultimately leading to improved patient outcomes and a transformation in medical practice.