ML-GAP: machine learning-enhanced genomic analysis pipeline using autoencoders and data augmentation

Front Genet. 2024 Sep 25:15:1442759. doi: 10.3389/fgene.2024.1442759. eCollection 2024.

Abstract

Introduction: The advent of RNA sequencing (RNA-Seq) has significantly advanced our understanding of the transcriptomic landscape, revealing intricate gene expression patterns across biological states and conditions. However, the complexity and volume of RNA-Seq data pose challenges in identifying differentially expressed genes (DEGs), critical for understanding the molecular basis of diseases like cancer.

Methods: We introduce a novel Machine Learning-Enhanced Genomic Data Analysis Pipeline (ML-GAP) that incorporates autoencoders and innovative data augmentation strategies, notably the MixUp method, to overcome these challenges. By creating synthetic training examples through a linear combination of input pairs and their labels, MixUp significantly enhances the model's ability to generalize from the training data to unseen examples.

Results: Our results demonstrate the ML-GAP's superiority in accuracy, efficiency, and insights, particularly crediting the MixUp method for its substantial contribution to the pipeline's effectiveness, advancing greatly genomic data analysis and setting a new standard in the field.

Discussion: This, in turn, suggests that ML-GAP has the potential to perform more accurate detection of DEGs but also offers new avenues for therapeutic intervention and research. By integrating explainable artificial intelligence (XAI) techniques, ML-GAP ensures a transparent and interpretable analysis, highlighting the significance of identified genetic markers.

Keywords: RNA-seq; differential expression; feature selection; machine learning; mixup.

Grants and funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This work was supported in part by facilities and resources at the VA Providence Healthcare System, the Cardiopulmonary Vascular Biology (CPVB) COBRE core facilities (P20GM103652) and P30GM149398. The views expressed in this article are those of the authors and do not necessarily reflect the position or policy of the Department of Veterans Affairs or the U.S. government.