Contextual Word Embeddings and Topic Modeling in Healthy Dieting and Obesity

J Healthc Inform Res. 2019 Jun 10;3(2):159-183. doi: 10.1007/s41666-019-00052-5. eCollection 2019 Jun.

Abstract

An alarming proportion of the US population is overweight. Obesity increases the risk of illnesses such as diabetes and cardiovascular diseases. In this paper, we propose the Contextual Word Embeddings (ContWEB) framework that aims to build contextual word embeddings on the relationship between obesity and healthy eating from the crowd domain (Twitter) and the expert domain (PubMed). For this purpose, our work is based on a pipeline model that consists of a chain of processing elements as follows: (1) to use term frequency and inverse document frequency (TF-IDF) and Word2Vec in the data collected from the crowd and expert domains; (2) to apply natural language processing (NLP) algorithms to the corpus; (3) to construct social word embeddings by sentiment analysis; (4) to discover the contextual word embeddings using co-occurrence and conditional probability; (5) to find an optimal number of topics in a topic modeling with the obesity and healthy dieting corpus; (6) to extract latent features extracted using Latent Dirichlet Allocation (LDA). The ContWEB framework has been implemented on Apache Spark and TensorFlow platforms. We have evaluated the ContWEB framework in terms of the effectiveness in contextual word embeddings constructed from the crowd and the expert domains. We conclude that the ContWEB framework would be useful in enhancing the decision-making process for healthy eating and obesity prevention.

Keywords: Natural language processing; Obesity and healthy dieting; Sentiment analysis; Topic modeling; Word embeddings.