PreparedLLM: effective pre-pretraining framework for domain-specific large language models

The direct application of large language models (LLMs) to specific domain tasks frequently encounters challenges due to the scarcity of domain data, variations in domain semantics, and the complexity of domain knowledge. Further pretraining of advanced foundational models on extensive domain-specifi...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhou Chen, Ming Lin, Zimeng Wang, Mingrun Zang, Yuqi Bai
Format: Article
Language:English
Published: Taylor & Francis Group 2024-10-01
Series:Big Earth Data
Subjects:
Online Access:https://www.tandfonline.com/doi/10.1080/20964471.2024.2396159
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846099838695899136
author Zhou Chen
Ming Lin
Zimeng Wang
Mingrun Zang
Yuqi Bai
author_facet Zhou Chen
Ming Lin
Zimeng Wang
Mingrun Zang
Yuqi Bai
author_sort Zhou Chen
collection DOAJ
description The direct application of large language models (LLMs) to specific domain tasks frequently encounters challenges due to the scarcity of domain data, variations in domain semantics, and the complexity of domain knowledge. Further pretraining of advanced foundational models on extensive domain-specific corpora can infuse these models with domain-specific knowledge and enhancing their ability of solving domain-specific tasks. However, the development of most domain-specific models focuses primarily on collecting large-scale domain data, often overlooking the crucial optimization of the pre-pretraining stage, which significantly impacts both model performance and training efficiency. This paper introduces PRE-PretrAining FRamEwork for Domain-specific Large Language Models (PreparedLLM), a framework designed to enhance the pre-pretraining stage for domain specialization of LLMs. PreparedLLM employs advanced techniques in data recipe, data cleaning, vocabulary expansion, and embedding initialization. These techniques are implemented to optimize both the composition and quality of the training data, and to enhance understanding of domain terminologies and concepts, as well as improve token embedding initialization. Utilizing the geoscience domain as a case study, this paper applies PreparedLLM for the domain specialization of the Llama, a widely recognized general-purpose LLM. Experimental results demonstrate that PreparedLLM enhances model convergence speed, training speed, inference speed, the text volume of the context window, and overall performance in domain specialization. The utilization of PreparedLLM in developing domain-specific LLMs has significantly increased performance while reducing both time and resource investment. The case study provides valuable insights into the development of domain-specific LLMs.
format Article
id doaj-art-e0447887a3a5424ea7b84e34f3c86a22
institution Kabale University
issn 2096-4471
2574-5417
language English
publishDate 2024-10-01
publisher Taylor & Francis Group
record_format Article
series Big Earth Data
spelling doaj-art-e0447887a3a5424ea7b84e34f3c86a222024-12-31T05:18:55ZengTaylor & Francis GroupBig Earth Data2096-44712574-54172024-10-018464967210.1080/20964471.2024.2396159PreparedLLM: effective pre-pretraining framework for domain-specific large language modelsZhou Chen0Ming Lin1Zimeng Wang2Mingrun Zang3Yuqi Bai4Department of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaThe direct application of large language models (LLMs) to specific domain tasks frequently encounters challenges due to the scarcity of domain data, variations in domain semantics, and the complexity of domain knowledge. Further pretraining of advanced foundational models on extensive domain-specific corpora can infuse these models with domain-specific knowledge and enhancing their ability of solving domain-specific tasks. However, the development of most domain-specific models focuses primarily on collecting large-scale domain data, often overlooking the crucial optimization of the pre-pretraining stage, which significantly impacts both model performance and training efficiency. This paper introduces PRE-PretrAining FRamEwork for Domain-specific Large Language Models (PreparedLLM), a framework designed to enhance the pre-pretraining stage for domain specialization of LLMs. PreparedLLM employs advanced techniques in data recipe, data cleaning, vocabulary expansion, and embedding initialization. These techniques are implemented to optimize both the composition and quality of the training data, and to enhance understanding of domain terminologies and concepts, as well as improve token embedding initialization. Utilizing the geoscience domain as a case study, this paper applies PreparedLLM for the domain specialization of the Llama, a widely recognized general-purpose LLM. Experimental results demonstrate that PreparedLLM enhances model convergence speed, training speed, inference speed, the text volume of the context window, and overall performance in domain specialization. The utilization of PreparedLLM in developing domain-specific LLMs has significantly increased performance while reducing both time and resource investment. The case study provides valuable insights into the development of domain-specific LLMs.https://www.tandfonline.com/doi/10.1080/20964471.2024.2396159Domain specializationgeosciencelarge language modelpre-pretraining
spellingShingle Zhou Chen
Ming Lin
Zimeng Wang
Mingrun Zang
Yuqi Bai
PreparedLLM: effective pre-pretraining framework for domain-specific large language models
Big Earth Data
Domain specialization
geoscience
large language model
pre-pretraining
title PreparedLLM: effective pre-pretraining framework for domain-specific large language models
title_full PreparedLLM: effective pre-pretraining framework for domain-specific large language models
title_fullStr PreparedLLM: effective pre-pretraining framework for domain-specific large language models
title_full_unstemmed PreparedLLM: effective pre-pretraining framework for domain-specific large language models
title_short PreparedLLM: effective pre-pretraining framework for domain-specific large language models
title_sort preparedllm effective pre pretraining framework for domain specific large language models
topic Domain specialization
geoscience
large language model
pre-pretraining
url https://www.tandfonline.com/doi/10.1080/20964471.2024.2396159
work_keys_str_mv AT zhouchen preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels
AT minglin preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels
AT zimengwang preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels
AT mingrunzang preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels
AT yuqibai preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels