PreparedLLM: effective pre-pretraining framework for domain-specific large language models

The direct application of large language models (LLMs) to specific domain tasks frequently encounters challenges due to the scarcity of domain data, variations in domain semantics, and the complexity of domain knowledge. Further pretraining of advanced foundational models on extensive domain-specifi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhou Chen, Ming Lin, Zimeng Wang, Mingrun Zang, Yuqi Bai
Format:	Article
Language:	English
Published:	Taylor & Francis Group 2024-10-01
Series:	Big Earth Data
Subjects:	Domain specialization geoscience large language model pre-pretraining
Online Access:	https://www.tandfonline.com/doi/10.1080/20964471.2024.2396159
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846099838695899136
author	Zhou Chen Ming Lin Zimeng Wang Mingrun Zang Yuqi Bai
author_facet	Zhou Chen Ming Lin Zimeng Wang Mingrun Zang Yuqi Bai
author_sort	Zhou Chen
collection	DOAJ
description	The direct application of large language models (LLMs) to specific domain tasks frequently encounters challenges due to the scarcity of domain data, variations in domain semantics, and the complexity of domain knowledge. Further pretraining of advanced foundational models on extensive domain-specific corpora can infuse these models with domain-specific knowledge and enhancing their ability of solving domain-specific tasks. However, the development of most domain-specific models focuses primarily on collecting large-scale domain data, often overlooking the crucial optimization of the pre-pretraining stage, which significantly impacts both model performance and training efficiency. This paper introduces PRE-PretrAining FRamEwork for Domain-specific Large Language Models (PreparedLLM), a framework designed to enhance the pre-pretraining stage for domain specialization of LLMs. PreparedLLM employs advanced techniques in data recipe, data cleaning, vocabulary expansion, and embedding initialization. These techniques are implemented to optimize both the composition and quality of the training data, and to enhance understanding of domain terminologies and concepts, as well as improve token embedding initialization. Utilizing the geoscience domain as a case study, this paper applies PreparedLLM for the domain specialization of the Llama, a widely recognized general-purpose LLM. Experimental results demonstrate that PreparedLLM enhances model convergence speed, training speed, inference speed, the text volume of the context window, and overall performance in domain specialization. The utilization of PreparedLLM in developing domain-specific LLMs has significantly increased performance while reducing both time and resource investment. The case study provides valuable insights into the development of domain-specific LLMs.
format	Article
id	doaj-art-e0447887a3a5424ea7b84e34f3c86a22
institution	Kabale University
issn	2096-4471 2574-5417
language	English
publishDate	2024-10-01
publisher	Taylor & Francis Group
record_format	Article
series	Big Earth Data
spelling	doaj-art-e0447887a3a5424ea7b84e34f3c86a222024-12-31T05:18:55ZengTaylor & Francis GroupBig Earth Data2096-44712574-54172024-10-018464967210.1080/20964471.2024.2396159PreparedLLM: effective pre-pretraining framework for domain-specific large language modelsZhou Chen0Ming Lin1Zimeng Wang2Mingrun Zang3Yuqi Bai4Department of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaThe direct application of large language models (LLMs) to specific domain tasks frequently encounters challenges due to the scarcity of domain data, variations in domain semantics, and the complexity of domain knowledge. Further pretraining of advanced foundational models on extensive domain-specific corpora can infuse these models with domain-specific knowledge and enhancing their ability of solving domain-specific tasks. However, the development of most domain-specific models focuses primarily on collecting large-scale domain data, often overlooking the crucial optimization of the pre-pretraining stage, which significantly impacts both model performance and training efficiency. This paper introduces PRE-PretrAining FRamEwork for Domain-specific Large Language Models (PreparedLLM), a framework designed to enhance the pre-pretraining stage for domain specialization of LLMs. PreparedLLM employs advanced techniques in data recipe, data cleaning, vocabulary expansion, and embedding initialization. These techniques are implemented to optimize both the composition and quality of the training data, and to enhance understanding of domain terminologies and concepts, as well as improve token embedding initialization. Utilizing the geoscience domain as a case study, this paper applies PreparedLLM for the domain specialization of the Llama, a widely recognized general-purpose LLM. Experimental results demonstrate that PreparedLLM enhances model convergence speed, training speed, inference speed, the text volume of the context window, and overall performance in domain specialization. The utilization of PreparedLLM in developing domain-specific LLMs has significantly increased performance while reducing both time and resource investment. The case study provides valuable insights into the development of domain-specific LLMs.https://www.tandfonline.com/doi/10.1080/20964471.2024.2396159Domain specializationgeosciencelarge language modelpre-pretraining
spellingShingle	Zhou Chen Ming Lin Zimeng Wang Mingrun Zang Yuqi Bai PreparedLLM: effective pre-pretraining framework for domain-specific large language models Big Earth Data Domain specialization geoscience large language model pre-pretraining
title	PreparedLLM: effective pre-pretraining framework for domain-specific large language models
title_full	PreparedLLM: effective pre-pretraining framework for domain-specific large language models
title_fullStr	PreparedLLM: effective pre-pretraining framework for domain-specific large language models
title_full_unstemmed	PreparedLLM: effective pre-pretraining framework for domain-specific large language models
title_short	PreparedLLM: effective pre-pretraining framework for domain-specific large language models
title_sort	preparedllm effective pre pretraining framework for domain specific large language models
topic	Domain specialization geoscience large language model pre-pretraining
url	https://www.tandfonline.com/doi/10.1080/20964471.2024.2396159
work_keys_str_mv	AT zhouchen preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels AT minglin preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels AT zimengwang preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels AT mingrunzang preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels AT yuqibai preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels

PreparedLLM: effective pre-pretraining framework for domain-specific large language models

Similar Items