Incremental learning algorithm for dynamic evolution of domain specific vocabulary with its stability and plasticity analysis

Mansi Jain; Harmeet Kaur; Bhavna Gupta; Jaya Gera; Vandana Kalra

doi:10.1038/s41598-024-78785-6

Incremental learning algorithm for dynamic evolution of domain specific vocabulary with its stability and plasticity analysis

Sci Rep. 2025 Jan 2;15(1):272. doi: 10.1038/s41598-024-78785-6.

Authors

Mansi Jain¹, Harmeet Kaur², Bhavna Gupta³, Jaya Gera¹, Vandana Kalra⁴

Affiliations

¹ Department of Computer Science, Shyama Prasad Mukherji College for Women, University of Delhi, Delhi, India.
² Department of Computer Science, Hansraj College, University of Delhi, Delhi, India.
³ Department of Computer Science, Keshav Mahavidyalaya, University of Delhi, Delhi, India. bgupta@keshav.du.ac.in.
⁴ Department of Computer Science, Sri Guru Gobind Singh College of Commerce, University of Delhi, Delhi, India.

Abstract

Domain-specific vocabulary, which is crucial in fields such as Information Retrieval and Natural Language Processing, requires continuous updates to remain effective. Incremental Learning, unlike conventional methods, updates existing knowledge without retraining from scratch. This paper presents an incremental learning algorithm for updating domain-specific vocabularies. It introduces DocLib, an archive used to capture a compact footprint of previously seen data and vocabulary terms. Task-based evaluation measures the effectiveness of the updated vocabulary by using vocabulary terms to perform a downstream task of text classification. The classification accuracy gauges the effectiveness of the vocabulary in discerning unseen documents related to the domain. Experiments illustrate that multiple incremental updates maintain vocabulary relevance without compromising its effectiveness. The proposed algorithm ensures bounded memory and processing requirements, distinguishing it from conventional approaches. Novel algorithms are introduced to assess the stability and plasticity of the proposed approach, demonstrating its ability to assimilate new knowledge while retaining old insights. The generalizability of the vocabulary is tested across datasets, achieving 97.89% accuracy in identifying domain-related data. A comparison with state-of-the-art techniques using a benchmark dataset confirms the effectiveness of the proposed approach. Importantly, this approach extends beyond classification tasks, potentially benefiting other research fields.

Keywords: Bigrams; Incremental learning; Natural language processing; Text analytics; Unigrams; n-grams.