Designer chromosomes are artificially synthesized chromosomes. Nowadays, these chromosomes have numerous applications ranging from medical research to the development of biofuels. However, some chromosome fragments can interfere with the chemical synthesis of designer chromosomes and eventually limit the widespread use of this technology. To address this issue, this study aimed to develop an interpretable machine learning framework to predict and quantify the synthesis difficulties of designer chromosomes in advance. Through the use of this framework, six key sequence features leading to synthesis difficulties were identified, and an eXtreme Gradient Boosting model was established to integrate these features. The predictive model achieved high-quality performance with an AUC of 0.895 in cross-validation and an AUC of 0.885 on an independent test set. Based on these results, the synthesis difficulty index (S-index) was proposed as a means of scoring and interpreting synthesis difficulties of chromosomes from prokaryotes to eukaryotes. The findings of this study emphasize the significant variability in synthesis difficulties between chromosomes and demonstrate the potential of the proposed model to predict and mitigate these difficulties through the optimization of the synthesis process and genome rewriting.
Keywords: artificial chromosome; chemical synthesis; machine learning; synthetic biology.
© 2023. Science China Press.