Vision Transformers have recently emerged as a competitive architecture in image classification. The tremendous popularity of this model and its variants comes from its high performance and its ability to produce interpretable predictions. However, both of these characteristics remain to be assessed in depth on retinal images. This study proposes a thorough performance evaluation of several Transformers compared to traditional Convolutional Neural Network (CNN) models for retinal disease classification. Special attention is given to multi-modality imaging (fundus and OCT) and generalization to external data. In addition, we propose a novel mechanism to generate interpretable predictions via attribution maps. Existing attribution methods from Transformer models have the disadvantage of producing low-resolution heatmaps. Our contribution, called Focused Attention, uses iterative conditional patch resampling to tackle this issue. By means of a survey involving four retinal specialists, we validated both the superior interpretability of Vision Transformers compared to the attribution maps produced from CNNs and the relevance of Focused Attention as a lesion detector.
Keywords: Attention; Classification; Fundus; Interpretability; OCT; Ophthalmology; Transformers.
Copyright © 2022 Elsevier B.V. All rights reserved.