Aim: This study aimed to detect gene signatures in RNA-sequencing (RNA-seq) data using Pareto-optimal cluster size identification.
Background: RNA-seq has emerged as an important technology for transcriptome profiling in recent years. Gene expression signatures involving tens of genes have been proven to be predictive of disease type and patient response to treatment.
Methods: Data related to the liver cancer RNA-seq dataset, which included 35 paired hepatocellular carcinoma (HCC) and non-tumor tissue samples, was used in this study. The differentially expressed genes (DEGs) were identified after performing pre-filtering and normalization. After that, a multi-objective optimization technique, namely multi-objective optimization for collecting cluster alternatives (MOCCA), was used to discover the Pareto-optimal cluster size for these DEGs. Then, the k-means clustering method was performed on the RNA-seq data. The best cluster, as a signature for the disease, was found by calculating the average Spearman's correlation score of all genes in the module in a pair-wise manner. All analyses were performed in the R 4.1.1 package in virtual space with 100 Gb of RAM memory.
Results: Using MOCCA, eight Pareto-optimal clusters were obtained. Ultimately, two clusters with the greatest average Spearman's correlation coefficient scores were chosen as gene signatures. Eleven prognostic genes involved in HCC's abnormal metabolism were identified. In addition, three differentially expressed pathways were identified between tumor and non-tumor tissues.
Conclusion: These identified metabolic prognostic genes help us to provide more powerful prognostic information and enhance survival prediction for HCC patients. In addition, Pareto-optimal cluster size identification is suggested for gene signature in other RNA-Seq data.
Keywords: Clustering; Gene expression signature; Hepatocellular carcinoma; RNA-Seq.