Generating segmentation masks of herbarium specimens and a data set for training segmentation models using deep learning

Appl Plant Sci. 2020 Jul 1;8(6):e11352. doi: 10.1002/aps3.11352. eCollection 2020 Jun.

Abstract

Premise: Digitized images of herbarium specimens are highly diverse with many potential sources of visual noise and bias. The systematic removal of noise and minimization of bias must be achieved in order to generate biological insights based on the plants rather than the digitization and mounting practices involved. Here, we develop a workflow and data set of high-resolution image masks to segment plant tissues in herbarium specimen images and remove background pixels using deep learning.

Methods and results: We generated 400 curated, high-resolution masks of ferns using a combination of automatic and manual tools for image manipulation. We used those images to train a U-Net-style deep learning model for image segmentation, achieving a final Sørensen-Dice coefficient of 0.96. The resulting model can automatically, efficiently, and accurately segment massive data sets of digitized herbarium specimens, particularly for ferns.

Conclusions: The application of deep learning in herbarium sciences requires transparent and systematic protocols for generating training data so that these labor-intensive resources can be generalized to other deep learning applications. Segmentation ground-truth masks are hard-won data, and we share these data and the model openly in the hopes of furthering model training and transfer learning opportunities for broader herbarium applications.

Keywords: U‐Net; deep learning; digitized herbarium specimens; ferns; machine learning; semantic segmentation.