Dual-modality visual feature flow for medical report generation

Quan Tang; Liming Xu; Yongheng Wang; Bochuan Zheng; Jiancheng Lv; Xianhua Zeng; Weisheng Li

doi:10.1016/j.media.2024.103413

Dual-modality visual feature flow for medical report generation

Med Image Anal. 2024 Dec 1:101:103413. doi: 10.1016/j.media.2024.103413. Online ahead of print.

Authors

Quan Tang¹, Liming Xu², Yongheng Wang³, Bochuan Zheng¹, Jiancheng Lv⁴, Xianhua Zeng⁵, Weisheng Li⁵

Affiliations

¹ School of Computer Science, China West Normal University, Nanchong, 637009, Sichuan, China.
² School of Computer Science, China West Normal University, Nanchong, 637009, Sichuan, China; College of Computer Science, Sichuan University, Chengdu, 610041, Sichuan, China. Electronic address: xulm@cwnu.edu.cn.
³ School of Electronic Information Engineering, China West Normal University, Nanchong, 637009, Sichuan, China.
⁴ College of Computer Science, Sichuan University, Chengdu, 610041, Sichuan, China.
⁵ Chongqing Key Laboratory of Image Cognition, College of Computer Science and Technology, Chongqing University of Posts and Telecommunication, Chongqing, 400065, China.

PMID: 39693718
DOI: 10.1016/j.media.2024.103413

Abstract

Medical report generation, a cross-modal task of generating medical text information, aiming to provide professional descriptions of medical images in clinical language. Despite some methods have made progress, there are still some limitations, including insufficient focus on lesion areas, omission of internal edge features, and difficulty in aligning cross-modal data. To address these issues, we propose Dual-Modality Visual Feature Flow (DMVF) for medical report generation. Firstly, we introduce region-level features based on grid-level features to enhance the method's ability to identify lesions and key areas. Then, we enhance two types of feature flows based on their attributes to prevent the loss of key information, respectively. Finally, we align visual mappings from different visual feature with report textual embeddings through a feature fusion module to perform cross-modal learning. Extensive experiments conducted on four benchmark datasets demonstrate that our approach outperforms the state-of-the-art methods in both natural language generation and clinical efficacy metrics.

Keywords: Feature fusion; Medical report generation; Multi-modal learning; Region feature.