JiuZhou: open foundation language models and effective pre-training framework for geoscience

Geoscience research has generated vast amounts of data, creating a need for effective extraction and integration of knowledge to address global-change challenges, promote sustainable development, and accelerate scientific discovery. Foundation language models, trained through extensive pre-training...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhou Chen, Ming Lin, Mingrun Zang, Zimeng Wang, Juanzi Li, Yuqi Bai
Format: Article
Language:English
Published: Taylor & Francis Group 2025-12-01
Series:International Journal of Digital Earth
Subjects:
Online Access:https://www.tandfonline.com/doi/10.1080/17538947.2025.2449708
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841526364965961728
author Zhou Chen
Ming Lin
Mingrun Zang
Zimeng Wang
Juanzi Li
Yuqi Bai
author_facet Zhou Chen
Ming Lin
Mingrun Zang
Zimeng Wang
Juanzi Li
Yuqi Bai
author_sort Zhou Chen
collection DOAJ
description Geoscience research has generated vast amounts of data, creating a need for effective extraction and integration of knowledge to address global-change challenges, promote sustainable development, and accelerate scientific discovery. Foundation language models, trained through extensive pre-training and instruction tuning on large text corpora, can facilitate this process. However, when foundational language models lack sufficient geoscience expertise, instruction tuning with relevant data can generate content that is inconsistent with established facts. In this study, we introduce JiuZhou, a powerful open foundational language model used in geoscience. First, we construct the large-scale, diverse, and high-quality JiuZhou-Corpus and the JiuZhou-Framework specifically designed for training geoscience large language models (LLMs). We introduce a two-stage pre-adaptation pre-training method to enhance the efficiency of knowledge learning and transfer in the model and demonstrated its effectiveness. Evaluation on GeoBench shows that JiuZhou outperforms GPT-3.5 in objective tasks and surpasses all baselines in subjective tasks. Moreover, we analyse the performance variations of the LLM using a stronger base model, stronger instruction data, and more training data, as well as its ability to assist scientific discovery. The results demonstrate the potential of JiuZhou as a geoscience foundational language model and provide valuable insights for advancing LLM development in geoscience. This project is available at https://github.com/THU-ESIS/JiuZhou.
format Article
id doaj-art-a6e1debce6a54843ba542250223900f0
institution Kabale University
issn 1753-8947
1753-8955
language English
publishDate 2025-12-01
publisher Taylor & Francis Group
record_format Article
series International Journal of Digital Earth
spelling doaj-art-a6e1debce6a54843ba542250223900f02025-01-17T02:29:12ZengTaylor & Francis GroupInternational Journal of Digital Earth1753-89471753-89552025-12-0118110.1080/17538947.2025.2449708JiuZhou: open foundation language models and effective pre-training framework for geoscienceZhou Chen0Ming Lin1Mingrun Zang2Zimeng Wang3Juanzi Li4Yuqi Bai5Department of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, People’s Republic of ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, People’s Republic of ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, People’s Republic of ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, People’s Republic of ChinaDepartment of Computer Science and Technology, Tsinghua University, Beijing, People’s Republic of ChinaDepartment of Earth System Science, Institute for Global Change Studies, Ministry of Education Ecological Field Station for East Asian Migratory Birds, Tsinghua University, Beijing, People’s Republic of ChinaGeoscience research has generated vast amounts of data, creating a need for effective extraction and integration of knowledge to address global-change challenges, promote sustainable development, and accelerate scientific discovery. Foundation language models, trained through extensive pre-training and instruction tuning on large text corpora, can facilitate this process. However, when foundational language models lack sufficient geoscience expertise, instruction tuning with relevant data can generate content that is inconsistent with established facts. In this study, we introduce JiuZhou, a powerful open foundational language model used in geoscience. First, we construct the large-scale, diverse, and high-quality JiuZhou-Corpus and the JiuZhou-Framework specifically designed for training geoscience large language models (LLMs). We introduce a two-stage pre-adaptation pre-training method to enhance the efficiency of knowledge learning and transfer in the model and demonstrated its effectiveness. Evaluation on GeoBench shows that JiuZhou outperforms GPT-3.5 in objective tasks and surpasses all baselines in subjective tasks. Moreover, we analyse the performance variations of the LLM using a stronger base model, stronger instruction data, and more training data, as well as its ability to assist scientific discovery. The results demonstrate the potential of JiuZhou as a geoscience foundational language model and provide valuable insights for advancing LLM development in geoscience. This project is available at https://github.com/THU-ESIS/JiuZhou.https://www.tandfonline.com/doi/10.1080/17538947.2025.2449708Foundation modelgeosciencelarge language modelpre-trainingdomain adaptation
spellingShingle Zhou Chen
Ming Lin
Mingrun Zang
Zimeng Wang
Juanzi Li
Yuqi Bai
JiuZhou: open foundation language models and effective pre-training framework for geoscience
International Journal of Digital Earth
Foundation model
geoscience
large language model
pre-training
domain adaptation
title JiuZhou: open foundation language models and effective pre-training framework for geoscience
title_full JiuZhou: open foundation language models and effective pre-training framework for geoscience
title_fullStr JiuZhou: open foundation language models and effective pre-training framework for geoscience
title_full_unstemmed JiuZhou: open foundation language models and effective pre-training framework for geoscience
title_short JiuZhou: open foundation language models and effective pre-training framework for geoscience
title_sort jiuzhou open foundation language models and effective pre training framework for geoscience
topic Foundation model
geoscience
large language model
pre-training
domain adaptation
url https://www.tandfonline.com/doi/10.1080/17538947.2025.2449708
work_keys_str_mv AT zhouchen jiuzhouopenfoundationlanguagemodelsandeffectivepretrainingframeworkforgeoscience
AT minglin jiuzhouopenfoundationlanguagemodelsandeffectivepretrainingframeworkforgeoscience
AT mingrunzang jiuzhouopenfoundationlanguagemodelsandeffectivepretrainingframeworkforgeoscience
AT zimengwang jiuzhouopenfoundationlanguagemodelsandeffectivepretrainingframeworkforgeoscience
AT juanzili jiuzhouopenfoundationlanguagemodelsandeffectivepretrainingframeworkforgeoscience
AT yuqibai jiuzhouopenfoundationlanguagemodelsandeffectivepretrainingframeworkforgeoscience