Deep Learning Benchmarks on L1000 Gene Expression Data

Matthew B A McDermott; Jennifer Wang; Wen-Ning Zhao; Steven D Sheridan; Peter Szolovits; Isaac Kohane; Stephen J Haggarty; Roy H Perlis

doi:10.1109/TCBB.2019.2910061

Deep Learning Benchmarks on L1000 Gene Expression Data

IEEE/ACM Trans Comput Biol Bioinform. 2020 Nov-Dec;17(6):1846-1857. doi: 10.1109/TCBB.2019.2910061. Epub 2020 Dec 8.

Authors

Matthew B A McDermott, Jennifer Wang, Wen-Ning Zhao, Steven D Sheridan, Peter Szolovits, Isaac Kohane, Stephen J Haggarty, Roy H Perlis

Abstract

Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of published benchmark tasks and well characterized baselines. In this work, we establish such benchmarks and baselines by profiling many classifiers against biologically motivated tasks on two curated views of a large, public gene expression dataset (the LINCS corpus) and one privately produced dataset. We provide these two curated views of the public LINCS dataset and our benchmark tasks to enable direct comparisons to future methodological work and help spur deep learning method development on this modality. In addition to profiling a battery of traditional classifiers, including linear models, random forests, decision trees, K nearest neighbor (KNN) classifiers, and feed-forward artificial neural networks (FF-ANNs), we also test a method novel to this data modality: graph convolugtional neural networks (GCNNs), which allow us to incorporate prior biological domain knowledge. We find that GCNNs can be highly performant, with large datasets, whereas FF-ANNs consistently perform well. Non-neural classifiers are dominated by linear models and KNN classifiers.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms
Cell Line
Computational Biology / methods*
Databases, Genetic*
Deep Learning*
Gene Expression Profiling* / methods
Gene Expression Profiling* / standards
Humans
Models, Genetic
Protein Interaction Maps
Transcriptome / genetics*

Abstract

Publication types

MeSH terms

Grants and funding