A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data
This article describes the Extended Quranic Treebank (EQTB), a comprehensive, multi-layered, and computationally accessible linguistic resource for Classical Arabic (CA), meticulously developed to overcome the documented limitations of the original Quranic Treebank. Leveraging foundational data from...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-10-01
|
| Series: | Data in Brief |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S235234092500664X |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849237448280768512 |
|---|---|
| author | Wadee A. Nashir Abdulqader M. Mohsen Asma A. Al-Shargabi Mohamed K. Nour Badriyya B. Al-onazi |
| author_facet | Wadee A. Nashir Abdulqader M. Mohsen Asma A. Al-Shargabi Mohamed K. Nour Badriyya B. Al-onazi |
| author_sort | Wadee A. Nashir |
| collection | DOAJ |
| description | This article describes the Extended Quranic Treebank (EQTB), a comprehensive, multi-layered, and computationally accessible linguistic resource for Classical Arabic (CA), meticulously developed to overcome the documented limitations of the original Quranic Treebank. Leveraging foundational data from established Quranic digital resources, EQTB features systematically expanded orthographic representations generated via algorithmic processing and validation; rigorously refined morphological annotations based on expanded expert-informed schemas, automated re-annotation, and manual curation; and critically, a novel, complete syntactic layer constructed through algorithmic conversion of prior graphical data, Deep Learning-based parsing achieving full coverage under a hybrid constituency-dependency framework, and expert validation. Encompassing the entire Quran (∼132,736 tokens), the dataset is structured in an adapted CoNLL-X format across 43 columns, detailing multiple orthographies, fine-grained morphology (45 tags), and complete hybrid syntax (140 tags/labels), complemented by auxiliary lexicons and schemas. EQTB offers significant reuse potential, providing crucial training/evaluation data for diverse CA NLP tasks (parsing, morphology, diacritization), supporting linguistic research, and enabling the development of advanced pedagogical tools and language technologies. |
| format | Article |
| id | doaj-art-ee9115c28c454b5fa8a46f834e64e8e4 |
| institution | Kabale University |
| issn | 2352-3409 |
| language | English |
| publishDate | 2025-10-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Data in Brief |
| spelling | doaj-art-ee9115c28c454b5fa8a46f834e64e8e42025-08-20T04:01:57ZengElsevierData in Brief2352-34092025-10-016211194010.1016/j.dib.2025.111940A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley DataWadee A. Nashir0Abdulqader M. Mohsen1Asma A. Al-Shargabi2Mohamed K. Nour3Badriyya B. Al-onazi4Department of Computer Science, Faculty of Computing and Information Technology, University of Science and Technology, YemenFaculty of Computer and Information Technology, University of Aden, Aden, YemenDepartment of Information Technology, Collage of Computer, Qassim University, Buraydah 51452, Saudi Arabia; Department of Computer Science, Faculty of Computing and Information Technology, University of Science and Technology, YemenDepartment of Computer Sciences, College of Computing and Information System, Umm Al-Qura University, Saudi ArabiaDepartment of Language Preparation, Arabic Language Teaching Institute, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia; Corresponding author.This article describes the Extended Quranic Treebank (EQTB), a comprehensive, multi-layered, and computationally accessible linguistic resource for Classical Arabic (CA), meticulously developed to overcome the documented limitations of the original Quranic Treebank. Leveraging foundational data from established Quranic digital resources, EQTB features systematically expanded orthographic representations generated via algorithmic processing and validation; rigorously refined morphological annotations based on expanded expert-informed schemas, automated re-annotation, and manual curation; and critically, a novel, complete syntactic layer constructed through algorithmic conversion of prior graphical data, Deep Learning-based parsing achieving full coverage under a hybrid constituency-dependency framework, and expert validation. Encompassing the entire Quran (∼132,736 tokens), the dataset is structured in an adapted CoNLL-X format across 43 columns, detailing multiple orthographies, fine-grained morphology (45 tags), and complete hybrid syntax (140 tags/labels), complemented by auxiliary lexicons and schemas. EQTB offers significant reuse potential, providing crucial training/evaluation data for diverse CA NLP tasks (parsing, morphology, diacritization), supporting linguistic research, and enabling the development of advanced pedagogical tools and language technologies.http://www.sciencedirect.com/science/article/pii/S235234092500664XHoly quranTreebankSyntactic analysisMorphological analysisLinguistic resourcesDependency Parsing |
| spellingShingle | Wadee A. Nashir Abdulqader M. Mohsen Asma A. Al-Shargabi Mohamed K. Nour Badriyya B. Al-onazi A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data Data in Brief Holy quran Treebank Syntactic analysis Morphological analysis Linguistic resources Dependency Parsing |
| title | A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data |
| title_full | A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data |
| title_fullStr | A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data |
| title_full_unstemmed | A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data |
| title_short | A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingMendeley Data |
| title_sort | complete multi layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processingmendeley data |
| topic | Holy quran Treebank Syntactic analysis Morphological analysis Linguistic resources Dependency Parsing |
| url | http://www.sciencedirect.com/science/article/pii/S235234092500664X |
| work_keys_str_mv | AT wadeeanashir acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT abdulqadermmohsen acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT asmaaalshargabi acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT mohamedknour acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT badriyyabalonazi acompletemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT wadeeanashir completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT abdulqadermmohsen completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT asmaaalshargabi completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT mohamedknour completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata AT badriyyabalonazi completemultilayeredquranictreebankdatasetwithhybridsyntacticannotationsforclassicalarabicprocessingmendeleydata |