JiuZhou: open foundation language models and effective pre-training framework for geoscience
Geoscience research has generated vast amounts of data, creating a need for effective extraction and integration of knowledge to address global-change challenges, promote sustainable development, and accelerate scientific discovery. Foundation language models, trained through extensive pre-training...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Taylor & Francis Group
2025-12-01
|
Series: | International Journal of Digital Earth |
Subjects: | |
Online Access: | https://www.tandfonline.com/doi/10.1080/17538947.2025.2449708 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841526364965961728 |
---|---|
author | Zhou Chen Ming Lin Mingrun Zang Zimeng Wang Juanzi Li Yuqi Bai |
author_facet | Zhou Chen Ming Lin Mingrun Zang Zimeng Wang Juanzi Li Yuqi Bai |
author_sort | Zhou Chen |
collection | DOAJ |
description | Geoscience research has generated vast amounts of data, creating a need for effective extraction and integration of knowledge to address global-change challenges, promote sustainable development, and accelerate scientific discovery. Foundation language models, trained through extensive pre-training and instruction tuning on large text corpora, can facilitate this process. However, when foundational language models lack sufficient geoscience expertise, instruction tuning with relevant data can generate content that is inconsistent with established facts. In this study, we introduce JiuZhou, a powerful open foundational language model used in geoscience. First, we construct the large-scale, diverse, and high-quality JiuZhou-Corpus and the JiuZhou-Framework specifically designed for training geoscience large language models (LLMs). We introduce a two-stage pre-adaptation pre-training method to enhance the efficiency of knowledge learning and transfer in the model and demonstrated its effectiveness. Evaluation on GeoBench shows that JiuZhou outperforms GPT-3.5 in objective tasks and surpasses all baselines in subjective tasks. Moreover, we analyse the performance variations of the LLM using a stronger base model, stronger instruction data, and more training data, as well as its ability to assist scientific discovery. The results demonstrate the potential of JiuZhou as a geoscience foundational language model and provide valuable insights for advancing LLM development in geoscience. This project is available at https://github.com/THU-ESIS/JiuZhou. |
format | Article |
id | doaj-art-a6e1debce6a54843ba542250223900f0 |
institution | Kabale University |
issn | 1753-8947 1753-8955 |
language | English |
publishDate | 2025-12-01 |
publisher | Taylor & Francis Group |
record_format | Article |
series | International Journal of Digital Earth |
spelling | doaj-art-a6e1debce6a54843ba542250223900f02025-01-17T02:29:12ZengTaylor & Francis GroupInternational Journal of Digital Earth1753-89471753-89552025-12-0118110.1080/17538947.2025.2449708JiuZhou: open foundation language models and effective pre-training framework for geoscienceZhou Chen0Ming Lin1Mingrun Zang2Zimeng Wang3Juanzi Li4Yuqi Bai5Department of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, People’s Republic of ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, People’s Republic of ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, People’s Republic of ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, People’s Republic of ChinaDepartment of Computer Science and Technology, Tsinghua University, Beijing, People’s Republic of ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, People’s Republic of ChinaGeoscience research has generated vast amounts of data, creating a need for effective extraction and integration of knowledge to address global-change challenges, promote sustainable development, and accelerate scientific discovery. Foundation language models, trained through extensive pre-training and instruction tuning on large text corpora, can facilitate this process. However, when foundational language models lack sufficient geoscience expertise, instruction tuning with relevant data can generate content that is inconsistent with established facts. In this study, we introduce JiuZhou, a powerful open foundational language model used in geoscience. First, we construct the large-scale, diverse, and high-quality JiuZhou-Corpus and the JiuZhou-Framework specifically designed for training geoscience large language models (LLMs). We introduce a two-stage pre-adaptation pre-training method to enhance the efficiency of knowledge learning and transfer in the model and demonstrated its effectiveness. Evaluation on GeoBench shows that JiuZhou outperforms GPT-3.5 in objective tasks and surpasses all baselines in subjective tasks. Moreover, we analyse the performance variations of the LLM using a stronger base model, stronger instruction data, and more training data, as well as its ability to assist scientific discovery. The results demonstrate the potential of JiuZhou as a geoscience foundational language model and provide valuable insights for advancing LLM development in geoscience. This project is available at https://github.com/THU-ESIS/JiuZhou.https://www.tandfonline.com/doi/10.1080/17538947.2025.2449708Foundation modelgeosciencelarge language modelpre-trainingdomain adaptation |
spellingShingle | Zhou Chen Ming Lin Mingrun Zang Zimeng Wang Juanzi Li Yuqi Bai JiuZhou: open foundation language models and effective pre-training framework for geoscience International Journal of Digital Earth Foundation model geoscience large language model pre-training domain adaptation |
title | JiuZhou: open foundation language models and effective pre-training framework for geoscience |
title_full | JiuZhou: open foundation language models and effective pre-training framework for geoscience |
title_fullStr | JiuZhou: open foundation language models and effective pre-training framework for geoscience |
title_full_unstemmed | JiuZhou: open foundation language models and effective pre-training framework for geoscience |
title_short | JiuZhou: open foundation language models and effective pre-training framework for geoscience |
title_sort | jiuzhou open foundation language models and effective pre training framework for geoscience |
topic | Foundation model geoscience large language model pre-training domain adaptation |
url | https://www.tandfonline.com/doi/10.1080/17538947.2025.2449708 |
work_keys_str_mv | AT zhouchen jiuzhouopenfoundationlanguagemodelsandeffectivepretrainingframeworkforgeoscience AT minglin jiuzhouopenfoundationlanguagemodelsandeffectivepretrainingframeworkforgeoscience AT mingrunzang jiuzhouopenfoundationlanguagemodelsandeffectivepretrainingframeworkforgeoscience AT zimengwang jiuzhouopenfoundationlanguagemodelsandeffectivepretrainingframeworkforgeoscience AT juanzili jiuzhouopenfoundationlanguagemodelsandeffectivepretrainingframeworkforgeoscience AT yuqibai jiuzhouopenfoundationlanguagemodelsandeffectivepretrainingframeworkforgeoscience |