Background: Polycystic ovary syndrome (PCOS) is a common endocrine disorder affecting women of reproductive age. It is characterized by symptoms such as hyperandrogenemia, oligo or anovulation and polycystic ovarian, significantly impacting quality of life. However, the practical implementation of machine learning (ML) in PCOS diagnosis is hindered by the limitations related to data size and algorithmic models. To address this research gap, we have increased the sample size in our study and aim to utilize two ML algorithms to analyze and validate diagnostic biomarkers, as well as explore immune cell infiltration patterns in PCOS.
Methods: We performed RNA-seq analysis on granulosa cell, including 13 samples from normal controls and 25 samples from women with PCOS. The data from our study were combined with publicly available databases. Batch effects were corrected using the 'sva' package in R software. Differential expression analysis was performed to identify genes that exhibited significant differences between the two groups. These differentially expressed genes (DEGs) were further analyzed for Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Hub genes were selected by intersecting the results of both methods after using LASSO and SVM-RFE for central gene selection for DEGs. Receiver Operating Characteristic (ROC) curves were employed to verify the accuracy of models by SVM and XGBoost. CIBERSORT analysis was performed to determine the relative abundances of immune cell populations. GSEA was analyzed to illustrate the expression patterns of genes within highly enriched functional pathways. RT-qPCR was used to validate the reliability of hub genes.
Results: 824 DEGs were found between the normal control and PCOS groups, including 376 upregulated and 448 downregulated genes. These DEGs were associated with endocytosis, salmonella infection and focal adhesion based on the KEGG enrichment analysis. Through overlapping LASSO and SVM-RFE algorithms, we identified four hub genes (CNTN2, CASR, CACNB3, MFAP2) that are significantly associated with the PCOS group. The diagnostic efficacy validation set using SVM and XGBoost yielded AUC values of 0.795 and 0.875, respectively, indicating their potential as diagnostic biomarkers. Consistent with the data analysis, the upregulation of CNTN2, CASR, CACNB3, and MFAP2 in PCOS was confirmed by RT-qPCR analysis on human granulosa cells. Furthermore, according to CIBERSORT analysis, a significant reduction in CD4 memory resting T cells was revealed in the PCOS group compared to the normal control group (P < 0.05).
Conclusions: This study identified CNTN2, CASR, CACNB3, and MFAP2 as potential diagnostic biomarkers for PCOS, which provides strong evidence for existing research on hub genes. Furthermore, the analysis of immune cell infiltration revealed the significant involvement of CD4 memory resting T cells in the onset and progression of PCOS. These findings shed light on potential mechanisms underlying PCOS pathogenesis and provide valuable insights for future research and therapeutic interventions.
Keywords: Bioinformatics; CIBERSORT; Hub gene; Machine learning; Polycystic ovary syndrome; Predictive models.
© 2024. The Author(s).