Rendezvous in time: an attention-based temporal fusion approach for surgical triplet recognition

Saurav Sharma; Chinedu Innocent Nwoye; Didier Mutter; Nicolas Padoy

doi:10.1007/s11548-023-02914-1

Rendezvous in time: an attention-based temporal fusion approach for surgical triplet recognition

Int J Comput Assist Radiol Surg. 2023 Jun;18(6):1053-1059. doi: 10.1007/s11548-023-02914-1. Epub 2023 Apr 25.

Authors

Saurav Sharma¹, Chinedu Innocent Nwoye², Didier Mutter^{3

4}, Nicolas Padoy^{2

3}

Affiliations

¹ ICube, University of Strasbourg, CNRS, Strasbourg, France. ssharma@unistra.fr.
² ICube, University of Strasbourg, CNRS, Strasbourg, France.
³ IHU Strasbourg, Strasbourg, France.
⁴ University Hospital of Strasbourg, Strasbourg, France.

PMID: 37097518
DOI: 10.1007/s11548-023-02914-1

Abstract

Purpose: One of the recent advances in surgical AI is the recognition of surgical activities as triplets of [Formula: see text]instrument, verb, target[Formula: see text]. Albeit providing detailed information for computer-assisted intervention, current triplet recognition approaches rely only on single-frame features. Exploiting the temporal cues from earlier frames would improve the recognition of surgical action triplets from videos.

Methods: In this paper, we propose Rendezvous in Time (RiT)-a deep learning model that extends the state-of-the-art model, Rendezvous, with temporal modeling. Focusing more on the verbs, our RiT explores the connectedness of current and past frames to learn temporal attention-based features for enhanced triplet recognition.

Results: We validate our proposal on the challenging surgical triplet dataset, CholecT45, demonstrating an improved recognition of the verb and triplet along with other interactions involving the verb such as [Formula: see text]instrument, verb[Formula: see text]. Qualitative results show that the RiT produces smoother predictions for most triplet instances than the state-of-the-arts.

Conclusion: We present a novel attention-based approach that leverages the temporal fusion of video frames to model the evolution of surgical actions and exploit their benefits for surgical triplet recognition.

Keywords: Action triplet; Attention model; Laparoscopic surgery; Surgical triplet recognition; Temporal modeling.

Abstract

Grants and funding