A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data

This article describes the Extended Quranic Treebank (EQTB), a comprehensive, multi-layered, and computationally accessible linguistic resource for Classical Arabic (CA), meticulously developed to overcome the documented limitations of the original Quranic Treebank. Leveraging foundational data from...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wadee A. Nashir, Abdulqader M. Mohsen, Asma A. Al-Shargabi, Mohamed K. Nour, Badriyya B. Al-onazi
Format:	Article
Language:	English
Published:	Elsevier 2025-10-01
Series:	Data in Brief
Subjects:	Holy quran Treebank Syntactic analysis Morphological analysis Linguistic resources Dependency Parsing
Online Access:	http://www.sciencedirect.com/science/article/pii/S235234092500664X
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849237448280768512
author	Wadee A. Nashir Abdulqader M. Mohsen Asma A. Al-Shargabi Mohamed K. Nour Badriyya B. Al-onazi
author_facet	Wadee A. Nashir Abdulqader M. Mohsen Asma A. Al-Shargabi Mohamed K. Nour Badriyya B. Al-onazi
author_sort	Wadee A. Nashir
collection	DOAJ
description	This article describes the Extended Quranic Treebank (EQTB), a comprehensive, multi-layered, and computationally accessible linguistic resource for Classical Arabic (CA), meticulously developed to overcome the documented limitations of the original Quranic Treebank. Leveraging foundational data from established Quranic digital resources, EQTB features systematically expanded orthographic representations generated via algorithmic processing and validation; rigorously refined morphological annotations based on expanded expert-informed schemas, automated re-annotation, and manual curation; and critically, a novel, complete syntactic layer constructed through algorithmic conversion of prior graphical data, Deep Learning-based parsing achieving full coverage under a hybrid constituency-dependency framework, and expert validation. Encompassing the entire Quran (∼132,736 tokens), the dataset is structured in an adapted CoNLL-X format across 43 columns, detailing multiple orthographies, fine-grained morphology (45 tags), and complete hybrid syntax (140 tags/labels), complemented by auxiliary lexicons and schemas. EQTB offers significant reuse potential, providing crucial training/evaluation data for diverse CA NLP tasks (parsing, morphology, diacritization), supporting linguistic research, and enabling the development of advanced pedagogical tools and language technologies.
format	Article
id	doaj-art-ee9115c28c454b5fa8a46f834e64e8e4
institution	Kabale University
issn	2352-3409
language	English
publishDate	2025-10-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj-art-ee9115c28c454b5fa8a46f834e64e8e42025-08-20T04:01:57ZengElsevierData in Brief2352-34092025-10-016211194010.1016/j.dib.2025.111940A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley DataWadee A. Nashir0Abdulqader M. Mohsen1Asma A. Al-Shargabi2Mohamed K. Nour3Badriyya B. Al-onazi4Department of Computer Science, Faculty of Computing and Information Technology, University of Science and Technology, YemenFaculty of Computer and Information Technology, University of Aden, Aden, YemenDepartment of Information Technology, Collage of Computer, Qassim University, Buraydah 51452, Saudi Arabia; Department of Computer Science, Faculty of Computing and Information Technology, University of Science and Technology, YemenDepartment of Computer Sciences, College of Computing and Information System, Umm Al-Qura University, Saudi ArabiaDepartment of Language Preparation, Arabic Language Teaching Institute, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia; Corresponding author.This article describes the Extended Quranic Treebank (EQTB), a comprehensive, multi-layered, and computationally accessible linguistic resource for Classical Arabic (CA), meticulously developed to overcome the documented limitations of the original Quranic Treebank. Leveraging foundational data from established Quranic digital resources, EQTB features systematically expanded orthographic representations generated via algorithmic processing and validation; rigorously refined morphological annotations based on expanded expert-informed schemas, automated re-annotation, and manual curation; and critically, a novel, complete syntactic layer constructed through algorithmic conversion of prior graphical data, Deep Learning-based parsing achieving full coverage under a hybrid constituency-dependency framework, and expert validation. Encompassing the entire Quran (∼132,736 tokens), the dataset is structured in an adapted CoNLL-X format across 43 columns, detailing multiple orthographies, fine-grained morphology (45 tags), and complete hybrid syntax (140 tags/labels), complemented by auxiliary lexicons and schemas. EQTB offers significant reuse potential, providing crucial training/evaluation data for diverse CA NLP tasks (parsing, morphology, diacritization), supporting linguistic research, and enabling the development of advanced pedagogical tools and language technologies.http://www.sciencedirect.com/science/article/pii/S235234092500664XHoly quranTreebankSyntactic analysisMorphological analysisLinguistic resourcesDependency Parsing
spellingShingle	Wadee A. Nashir Abdulqader M. Mohsen Asma A. Al-Shargabi Mohamed K. Nour Badriyya B. Al-onazi A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data Data in Brief Holy quran Treebank Syntactic analysis Morphological analysis Linguistic resources Dependency Parsing
title	A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data
title_full	A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data
title_fullStr	A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data
title_full_unstemmed	A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data
title_short	A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data
title_sort	complete multi layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingmendeley data
topic	Holy quran Treebank Syntactic analysis Morphological analysis Linguistic resources Dependency Parsing
url	http://www.sciencedirect.com/science/article/pii/S235234092500664X
work_keys_str_mv	AT wadeeanashir acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT abdulqadermmohsen acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT asmaaalshargabi acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT mohamedknour acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT badriyyabalonazi acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT wadeeanashir completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT abdulqadermmohsen completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT asmaaalshargabi completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT mohamedknour completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT badriyyabalonazi completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata

A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data

Similar Items