COVID-Twitter-BERT: A natural language processing model to analyse COVID-19 content on Twitter

Martin Müller; Marcel Salathé; Per E Kummervold

doi:10.3389/frai.2023.1023281

COVID-Twitter-BERT: A natural language processing model to analyse COVID-19 content on Twitter

Front Artif Intell. 2023 Mar 14:6:1023281. doi: 10.3389/frai.2023.1023281. eCollection 2023.

Authors

Martin Müller¹, Marcel Salathé¹, Per E Kummervold²

Affiliations

¹ Digital Epidemiology Lab, EPFL, Geneva, Switzerland.
² FISABIO-Public Health, Vaccine Research Department, Valencia, Spain.

Abstract

Introduction: This study presents COVID-Twitter-BERT (CT-BERT), a transformer-based model that is pre-trained on a large corpus of COVID-19 related Twitter messages. CT-BERT is specifically designed to be used on COVID-19 content, particularly from social media, and can be utilized for various natural language processing tasks such as classification, question-answering, and chatbots. This paper aims to evaluate the performance of CT-BERT on different classification datasets and compare it with BERT-LARGE, its base model.

Methods: The study utilizes CT-BERT, which is pre-trained on a large corpus of COVID-19 related Twitter messages. The authors evaluated the performance of CT-BERT on five different classification datasets, including one in the target domain. The model's performance is compared to its base model, BERT-LARGE, to measure the marginal improvement. The authors also provide detailed information on the training process and the technical specifications of the model.

Results: The results indicate that CT-BERT outperforms BERT-LARGE with a marginal improvement of 10-30% on all five classification datasets. The largest improvements are observed in the target domain. The authors provide detailed performance metrics and discuss the significance of these results.

Discussion: The study demonstrates the potential of pre-trained transformer models, such as CT-BERT, for COVID-19 related natural language processing tasks. The results indicate that CT-BERT can improve the classification performance on COVID-19 related content, especially on social media. These findings have important implications for various applications, such as monitoring public sentiment and developing chatbots to provide COVID-19 related information. The study also highlights the importance of using domain-specific pre-trained models for specific natural language processing tasks. Overall, this work provides a valuable contribution to the development of COVID-19 related NLP models.

Keywords: BERT; COVID-19; language model (LM); natural language processing (NLP); text classification.

Grants and funding

PK received funding from the European Commission for the call H2020-MSCA-IF-2017 and the funding scheme MSCA-IF-EF-ST for the VACMA project (grant agreement ID: 797876). MM and MS received funding through the Versatile Emerging infectious disease Observatory grant as a part of the European Commission's Horizon 2020 framework programme (grant agreement ID: 874735). The research was supported with Cloud TPUs from Google's TensorFlow Research Cloud and Google Cloud credits in the context of COVID-19-related research.