A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data

This article describes the Extended Quranic Treebank (EQTB), a comprehensive, multi-layered, and computationally accessible linguistic resource for Classical Arabic (CA), meticulously developed to overcome the documented limitations of the original Quranic Treebank. Leveraging foundational data from...

Full description

Saved in:
Bibliographic Details
Main Authors: Wadee A. Nashir, Abdulqader M. Mohsen, Asma A. Al-Shargabi, Mohamed K. Nour, Badriyya B. Al-onazi
Format: Article
Language:English
Published: Elsevier 2025-10-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S235234092500664X
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849237448280768512
author Wadee A. Nashir
Abdulqader M. Mohsen
Asma A. Al-Shargabi
Mohamed K. Nour
Badriyya B. Al-onazi
author_facet Wadee A. Nashir
Abdulqader M. Mohsen
Asma A. Al-Shargabi
Mohamed K. Nour
Badriyya B. Al-onazi
author_sort Wadee A. Nashir
collection DOAJ
description This article describes the Extended Quranic Treebank (EQTB), a comprehensive, multi-layered, and computationally accessible linguistic resource for Classical Arabic (CA), meticulously developed to overcome the documented limitations of the original Quranic Treebank. Leveraging foundational data from established Quranic digital resources, EQTB features systematically expanded orthographic representations generated via algorithmic processing and validation; rigorously refined morphological annotations based on expanded expert-informed schemas, automated re-annotation, and manual curation; and critically, a novel, complete syntactic layer constructed through algorithmic conversion of prior graphical data, Deep Learning-based parsing achieving full coverage under a hybrid constituency-dependency framework, and expert validation. Encompassing the entire Quran (∼132,736 tokens), the dataset is structured in an adapted CoNLL-X format across 43 columns, detailing multiple orthographies, fine-grained morphology (45 tags), and complete hybrid syntax (140 tags/labels), complemented by auxiliary lexicons and schemas. EQTB offers significant reuse potential, providing crucial training/evaluation data for diverse CA NLP tasks (parsing, morphology, diacritization), supporting linguistic research, and enabling the development of advanced pedagogical tools and language technologies.
format Article
id doaj-art-ee9115c28c454b5fa8a46f834e64e8e4
institution Kabale University
issn 2352-3409
language English
publishDate 2025-10-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj-art-ee9115c28c454b5fa8a46f834e64e8e42025-08-20T04:01:57ZengElsevierData in Brief2352-34092025-10-016211194010.1016/j.dib.2025.111940A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley DataWadee A. Nashir0Abdulqader M. Mohsen1Asma A. Al-Shargabi2Mohamed K. Nour3Badriyya B. Al-onazi4Department of Computer Science, Faculty of Computing and Information Technology, University of Science and Technology, YemenFaculty of Computer and Information Technology, University of Aden, Aden, YemenDepartment of Information Technology, Collage of Computer, Qassim University, Buraydah 51452, Saudi Arabia; Department of Computer Science, Faculty of Computing and Information Technology, University of Science and Technology, YemenDepartment of Computer Sciences, College of Computing and Information System, Umm Al-Qura University, Saudi ArabiaDepartment of Language Preparation, Arabic Language Teaching Institute, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia; Corresponding author.This article describes the Extended Quranic Treebank (EQTB), a comprehensive, multi-layered, and computationally accessible linguistic resource for Classical Arabic (CA), meticulously developed to overcome the documented limitations of the original Quranic Treebank. Leveraging foundational data from established Quranic digital resources, EQTB features systematically expanded orthographic representations generated via algorithmic processing and validation; rigorously refined morphological annotations based on expanded expert-informed schemas, automated re-annotation, and manual curation; and critically, a novel, complete syntactic layer constructed through algorithmic conversion of prior graphical data, Deep Learning-based parsing achieving full coverage under a hybrid constituency-dependency framework, and expert validation. Encompassing the entire Quran (∼132,736 tokens), the dataset is structured in an adapted CoNLL-X format across 43 columns, detailing multiple orthographies, fine-grained morphology (45 tags), and complete hybrid syntax (140 tags/labels), complemented by auxiliary lexicons and schemas. EQTB offers significant reuse potential, providing crucial training/evaluation data for diverse CA NLP tasks (parsing, morphology, diacritization), supporting linguistic research, and enabling the development of advanced pedagogical tools and language technologies.http://www.sciencedirect.com/science/article/pii/S235234092500664XHoly quranTreebankSyntactic analysisMorphological analysisLinguistic resourcesDependency Parsing
spellingShingle Wadee A. Nashir
Abdulqader M. Mohsen
Asma A. Al-Shargabi
Mohamed K. Nour
Badriyya B. Al-onazi
A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data
Data in Brief
Holy quran
Treebank
Syntactic analysis
Morphological analysis
Linguistic resources
Dependency Parsing
title A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data
title_full A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data
title_fullStr A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data
title_full_unstemmed A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data
title_short A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data
title_sort complete multi layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingmendeley data
topic Holy quran
Treebank
Syntactic analysis
Morphological analysis
Linguistic resources
Dependency Parsing
url http://www.sciencedirect.com/science/article/pii/S235234092500664X
work_keys_str_mv AT wadeeanashir acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata
AT abdulqadermmohsen acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata
AT asmaaalshargabi acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata
AT mohamedknour acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata
AT badriyyabalonazi acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata
AT wadeeanashir completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata
AT abdulqadermmohsen completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata
AT asmaaalshargabi completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata
AT mohamedknour completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata
AT badriyyabalonazi completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata