Parallel sequence tagging for concept recognition

Lenz Furrer; Joseph Cornelius; Fabio Rinaldi

doi:10.1186/s12859-021-04511-y

Parallel sequence tagging for concept recognition

BMC Bioinformatics. 2022 Mar 24;22(Suppl 1):623. doi: 10.1186/s12859-021-04511-y.

Authors

Lenz Furrer^{1

2}, Joseph Cornelius^{3

2}, Fabio Rinaldi^{4

5

6

7}

Affiliations

¹ Department of Computational Linguistics, University of Zurich, Zurich, Switzerland.
² Swiss Institute of Bioinformatics, Zurich, Switzerland.
³ Dalle Molle Institute for Artificial Intelligence Research (IDSIA USI/SUPSI), Lugano, Switzerland.
⁴ Dalle Molle Institute for Artificial Intelligence Research (IDSIA USI/SUPSI), Lugano, Switzerland. fabio.rinaldi@idsia.ch.
⁵ Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland. fabio.rinaldi@idsia.ch.
⁶ Swiss Institute of Bioinformatics, Zurich, Switzerland. fabio.rinaldi@idsia.ch.
⁷ Fondazione Bruno Kessler, Trento, Italy. fabio.rinaldi@idsia.ch.

Abstract

Background: Named Entity Recognition (NER) and Normalisation (NEN) are core components of any text-mining system for biomedical texts. In a traditional concept-recognition pipeline, these tasks are combined in a serial way, which is inherently prone to error propagation from NER to NEN. We propose a parallel architecture, where both NER and NEN are modeled as a sequence-labeling task, operating directly on the source text. We examine different harmonisation strategies for merging the predictions of the two classifiers into a single output sequence.

Results: We test our approach on the recent Version 4 of the CRAFT corpus. In all 20 annotation sets of the concept-annotation task, our system outperforms the pipeline system reported as a baseline in the CRAFT shared task, a competition of the BioNLP Open Shared Tasks 2019. We further refine the systems from the shared task by optimising the harmonisation strategy separately for each annotation set.

Conclusions: Our analysis shows that the strengths of the two classifiers can be combined in a fruitful way. However, prediction harmonisation requires individual calibration on a development set for each annotation set. This allows achieving a good trade-off between established knowledge (training set) and novel information (unseen concepts).

Keywords: Concept recognition; Named entity recognition and normalization; Neural network; Sequence tagging; Text mining.

MeSH terms

Data Mining*

Abstract

MeSH terms

Grants and funding