Improved self-training-based distant label denoising method for cybersecurity entity extractions

PLoS One. 2024 Dec 17;19(12):e0315479. doi: 10.1371/journal.pone.0315479. eCollection 2024.

Abstract

The task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific corpus. In this paper, we propose an improved self-training-based distant label denoising method for cybersecurity entity extraction. Firstly, we create two domain dictionaries of cybersecurity. Then, an algorithm that combines reverse maximum matching and part-of-speech tagging restrictions is proposed, for generating distant labels for the cybersecurity domain corpus. Lastly, we propose a high-confidence text selection method and an improved self-training algorithm that incorporates a teacher-student model and weight update constraints, for exploring the true labels of low-confidence text using a model trained on high-confidence text, thereby reducing the noise in the distant annotation data. Experimental results demonstrate that the cybersecurity distantly-labelled data we obtained is of high quality. Additionally, the proposed constrained self-training algorithm effectively improves the F1 score of several state-of-the-art NER models on this dataset, yielding a 3.5% improvement for the Vendor class and a 3.35% improvement for the Product class.

MeSH terms

  • Algorithms*
  • Computer Security*
  • Data Mining / methods
  • Humans

Grants and funding

This work was supported in part by the Joint Innovation Fund of Sichuan University and the Nuclear Power Institute of China (Grant No. HG2022143), as well as the Sichuan Province Science and Technology Plan Key Research and Development Project (Grant No. 2023YFG0294.