PodGPT: An audio-augmented large language model for research and education

medRxiv [Preprint]. 2024 Nov 27:2024.07.11.24310304. doi: 10.1101/2024.07.11.24310304.

Abstract

The proliferation of scientific podcasts has generated an extensive repository of audio content, rich in specialized terminology, diverse topics, and expert dialogues. Here, we introduce a computational framework designed to enhance large language models (LLMs) by leveraging this informational content from publicly accessible podcast data across science, technology, engineering, mathematics and medical (STEMM) disciplines. This dataset, comprising over 3, 700 hours of audio content, was transcribed to generate over 42 million text tokens. Our model, PodGPT, integrates this wealth of complex dialogue found in audio podcasts to improve understanding of natural language nuances, cultural contexts, as well as scientific and medical knowledge. PodGPT also employs retrieval augmented generation (RAG) on a vector database built from articles in Creative Commons PubMed Central and The New England Journal of Medicine , enhancing STEMM research and education by providing real-time access to emerging scientific literature. Evaluated across multiple benchmarks, PodGPT demonstrated an average improvement of 3.51 percentage points over standard open-source benchmarks and 3.81 percentage points when augmented with evidence from the RAG pipeline. Moreover, it showcased an average improvement of 4.06 percentage points in its zero-shot multi-lingual transfer ability, effectively generalizing to different linguistic contexts. By harnessing the untapped potential of podcast content, PodGPT advances natural language processing and conversational AI, offering enhanced capabilities for STEMM research and education.

Publication types

  • Preprint