This paper introduces the Morphologically-Analyzed and Syntactically-Annotated Quran (MASAQ) dataset, a comprehensive resource designed to address the scarcity of annotated Quranic Arabic corpora and facilitate the development of advanced Natural Language Processing (NLP) models. The Quran, being a cornerstone of classical Arabic, presents unique challenges for NLP due to its sacred nature and complex linguistic features. MASAQ provides a detailed syntactic and morphological annotation of the entire Quranic text, utilizing a rigorously verified text from Tanzil.net. The dataset includes more than 131K morphological entries and 123K instances of syntactic functions, covering a wide range of grammatical roles and relationships. The annotation process involved a team of expert Arabic linguists who employed traditional i'rab methodologies to ensure high accuracy and consistency. The dataset is structured in multiple formats (tab-separated text file (tsv), SQLite3 database (.db), comma-separated file (csv), and JavaScript Object Notation (.JSON)) to cater to various research needs. MASAQ's unique features include a comprehensive tagset of 72 syntactic roles, detailed morphological analysis, and context-specific annotations. This dataset is particularly valuable for tasks such as dependency parsing, grammar checking, machine translation, and text summarization. The potential applications of MASAQ are vast, ranging from pedagogical uses in teaching Arabic grammar to developing sophisticated NLP tools. By providing a high-quality, syntactically annotated dataset, MASAQ aims to advance the field of Arabic NLP, enabling more accurate and more efficient language processing tools. The dataset is made available under the Creative Commons Attribution 3.0 License, which governs its use and distribution. It has been created in compliance with ethical guidelines and with respect for the integrity of the Quranic text.
Keywords: Syntactic annotation; analysis; i'rab إعراب (ʾi‘rāb); morphological annotation; semantic relations; syntactic relations; tagset.
© 2024 The Authors.