Prescription information and adverse drug reactions (ADR) are two components of detailed medication instructions that can benefit many aspects of clinical research. Automatic extraction of this information from free-text narratives via Information Extraction (IE) can open it up to downstream uses. IE is commonly tackled by supervised Natural Language Processing (NLP) systems which rely on annotated training data. However, training data generation is manual, time-consuming, and labor-intensive. It is desirable to develop automatic methods for augmenting manually labeled data. We propose pseudo-data generation as one such automatic method. Pseudo-data are synthetic data generated by combining elements of existing labeled data. We propose and evaluate two sets of pseudo-data generation methods: knowledge-driven methods based on gazetteers and data-driven methods based on deep learning. We use the resulting pseudo-data to improve medication and ADR extraction. Data-driven pseudo-data are suitable for concept categories with high semantic regularities and short textual spans. Knowledge-driven pseudo-data are effective for concept categories with longer textual spans, assuming the knowledge base offers good coverage of these concepts. Combining the knowledge- and data-driven pseudo-data achieves significant performance improvement on medication names and ADRs over baselines limited to the use of available labeled data.
Keywords: Information Storage and Retrieval; Machine Learning; Natural Language Processing.