PreparedLLM: effective pre-pretraining framework for domain-specific large language models
The direct application of large language models (LLMs) to specific domain tasks frequently encounters challenges due to the scarcity of domain data, variations in domain semantics, and the complexity of domain knowledge. Further pretraining of advanced foundational models on extensive domain-specifi...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Taylor & Francis Group
2024-10-01
|
| Series: | Big Earth Data |
| Subjects: | |
| Online Access: | https://www.tandfonline.com/doi/10.1080/20964471.2024.2396159 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846099838695899136 |
|---|---|
| author | Zhou Chen Ming Lin Zimeng Wang Mingrun Zang Yuqi Bai |
| author_facet | Zhou Chen Ming Lin Zimeng Wang Mingrun Zang Yuqi Bai |
| author_sort | Zhou Chen |
| collection | DOAJ |
| description | The direct application of large language models (LLMs) to specific domain tasks frequently encounters challenges due to the scarcity of domain data, variations in domain semantics, and the complexity of domain knowledge. Further pretraining of advanced foundational models on extensive domain-specific corpora can infuse these models with domain-specific knowledge and enhancing their ability of solving domain-specific tasks. However, the development of most domain-specific models focuses primarily on collecting large-scale domain data, often overlooking the crucial optimization of the pre-pretraining stage, which significantly impacts both model performance and training efficiency. This paper introduces PRE-PretrAining FRamEwork for Domain-specific Large Language Models (PreparedLLM), a framework designed to enhance the pre-pretraining stage for domain specialization of LLMs. PreparedLLM employs advanced techniques in data recipe, data cleaning, vocabulary expansion, and embedding initialization. These techniques are implemented to optimize both the composition and quality of the training data, and to enhance understanding of domain terminologies and concepts, as well as improve token embedding initialization. Utilizing the geoscience domain as a case study, this paper applies PreparedLLM for the domain specialization of the Llama, a widely recognized general-purpose LLM. Experimental results demonstrate that PreparedLLM enhances model convergence speed, training speed, inference speed, the text volume of the context window, and overall performance in domain specialization. The utilization of PreparedLLM in developing domain-specific LLMs has significantly increased performance while reducing both time and resource investment. The case study provides valuable insights into the development of domain-specific LLMs. |
| format | Article |
| id | doaj-art-e0447887a3a5424ea7b84e34f3c86a22 |
| institution | Kabale University |
| issn | 2096-4471 2574-5417 |
| language | English |
| publishDate | 2024-10-01 |
| publisher | Taylor & Francis Group |
| record_format | Article |
| series | Big Earth Data |
| spelling | doaj-art-e0447887a3a5424ea7b84e34f3c86a222024-12-31T05:18:55ZengTaylor & Francis GroupBig Earth Data2096-44712574-54172024-10-018464967210.1080/20964471.2024.2396159PreparedLLM: effective pre-pretraining framework for domain-specific large language modelsZhou Chen0Ming Lin1Zimeng Wang2Mingrun Zang3Yuqi Bai4Department of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, ChinaThe direct application of large language models (LLMs) to specific domain tasks frequently encounters challenges due to the scarcity of domain data, variations in domain semantics, and the complexity of domain knowledge. Further pretraining of advanced foundational models on extensive domain-specific corpora can infuse these models with domain-specific knowledge and enhancing their ability of solving domain-specific tasks. However, the development of most domain-specific models focuses primarily on collecting large-scale domain data, often overlooking the crucial optimization of the pre-pretraining stage, which significantly impacts both model performance and training efficiency. This paper introduces PRE-PretrAining FRamEwork for Domain-specific Large Language Models (PreparedLLM), a framework designed to enhance the pre-pretraining stage for domain specialization of LLMs. PreparedLLM employs advanced techniques in data recipe, data cleaning, vocabulary expansion, and embedding initialization. These techniques are implemented to optimize both the composition and quality of the training data, and to enhance understanding of domain terminologies and concepts, as well as improve token embedding initialization. Utilizing the geoscience domain as a case study, this paper applies PreparedLLM for the domain specialization of the Llama, a widely recognized general-purpose LLM. Experimental results demonstrate that PreparedLLM enhances model convergence speed, training speed, inference speed, the text volume of the context window, and overall performance in domain specialization. The utilization of PreparedLLM in developing domain-specific LLMs has significantly increased performance while reducing both time and resource investment. The case study provides valuable insights into the development of domain-specific LLMs.https://www.tandfonline.com/doi/10.1080/20964471.2024.2396159Domain specializationgeosciencelarge language modelpre-pretraining |
| spellingShingle | Zhou Chen Ming Lin Zimeng Wang Mingrun Zang Yuqi Bai PreparedLLM: effective pre-pretraining framework for domain-specific large language models Big Earth Data Domain specialization geoscience large language model pre-pretraining |
| title | PreparedLLM: effective pre-pretraining framework for domain-specific large language models |
| title_full | PreparedLLM: effective pre-pretraining framework for domain-specific large language models |
| title_fullStr | PreparedLLM: effective pre-pretraining framework for domain-specific large language models |
| title_full_unstemmed | PreparedLLM: effective pre-pretraining framework for domain-specific large language models |
| title_short | PreparedLLM: effective pre-pretraining framework for domain-specific large language models |
| title_sort | preparedllm effective pre pretraining framework for domain specific large language models |
| topic | Domain specialization geoscience large language model pre-pretraining |
| url | https://www.tandfonline.com/doi/10.1080/20964471.2024.2396159 |
| work_keys_str_mv | AT zhouchen preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels AT minglin preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels AT zimengwang preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels AT mingrunzang preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels AT yuqibai preparedllmeffectiveprepretrainingframeworkfordomainspecificlargelanguagemodels |