Visual question generation involves the generation of meaningful questions about an image. Although we have made significant progress in automatically generating a single high-quality question related to an image, existing methods often ignore the diversity and interpretability of generated questions, which are important for various daily tasks that require clear question sources. In this paper, we propose an explicitly diverse visual question generation model that aims to generate diverse questions based on interpretable question sources. To explicitly perform question generation, our model first extracts the scene graph from the image using the unbiased scene graph generation method, where questions generated based on the scene graphs have interpretable question sources. To ensure the diversity of generated questions, our model selects different subgraphs from the scene graph as question sources. Specifically, we employ a subgraph selector to learn how humans select multiple subgraphs that are suitable for question generation. Finally, our model generates diverse questions based on different selected subgraphs. Extensive experiments on the VQA v2.0 and COCO-QA datasets show that the proposed model outperforms the baselines and is able to interpretably generate diverse questions.
Keywords: Diverse visual question generation; Interpretable text generation; Multimodal; Unbiased scene graph generation.
Copyright © 2024. Published by Elsevier Ltd.