Machine Learning Using Template-Based-Predicted Structure of Haemagglutinin Predicts Pathogenicity of Avian Influenza

J Microbiol Biotechnol. 2024 Oct 28;34(10):2033-2040. doi: 10.4014/jmb.2405.05022. Epub 2024 Aug 6.

Abstract

Deep learning presents a promising approach to complex biological classifications, contingent upon the availability of well-curated datasets. This study addresses the challenge of analyzing three-dimensional protein structures by introducing a novel pipeline that utilizes open-source tools to convert protein structures into a format amenable to computational analysis. Applying a two-dimensional convolutional neural network (CNN) to a dataset of 12,143 avian influenza virus genomes from 64 countries, encompassing 119 hemagglutinin (HA) and neuraminidase (NA) types, we achieved significant classification accuracy. The pathogenicity was determined based on the presence of H5 or H7 subtypes, and our models, ranging from zero to six mid-layers, indicated that a four-layer model most effectively identified highly pathogenic strains, with accuracies over 0.9. To enhance our approach, we incorporated Principal Component Analysis (PCA) for dimensionality reduction and one-class SVM for abnormality detection, improving model robustness through bootstrapping. Furthermore, the K-nearest neighbor (K-NN) algorithm was fine-tuned via hyperparameter optimization to corroborate the findings. The PCA identified distinct clustering for pathogenic HA, yielding an AUC of up to 0.85. The optimized K-NN model demonstrated an impressive accuracy between 0.96 and 0.97. These combined methodologies underscore our deep learning framework's capacity for rapid and precise identification of pathogenic avian influenza strains, thus providing a critical tool for managing global avian influenza threats.

Keywords: Convolutional neural network; abnormality detection; avian influenza; haemagglutinin; machine learning; principal component analysis.

MeSH terms

  • Algorithms
  • Animals
  • Birds* / virology
  • Computational Biology / methods
  • Deep Learning
  • Genome, Viral / genetics
  • Hemagglutinin Glycoproteins, Influenza Virus* / genetics
  • Hemagglutinins / genetics
  • Influenza A virus* / genetics
  • Influenza A virus* / pathogenicity
  • Influenza in Birds* / virology
  • Machine Learning*
  • Neural Networks, Computer
  • Neuraminidase* / genetics
  • Principal Component Analysis

Substances

  • Hemagglutinin Glycoproteins, Influenza Virus
  • Neuraminidase
  • Hemagglutinins