JiuZhou: open foundation language models and effective pre-training framework for geoscience

Geoscience research has generated vast amounts of data, creating a need for effective extraction and integration of knowledge to address global-change challenges, promote sustainable development, and accelerate scientific discovery. Foundation language models, trained through extensive pre-training...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhou Chen, Ming Lin, Mingrun Zang, Zimeng Wang, Juanzi Li, Yuqi Bai
Format: Article
Language:English
Published: Taylor & Francis Group 2025-12-01
Series:International Journal of Digital Earth
Subjects:
Online Access:https://www.tandfonline.com/doi/10.1080/17538947.2025.2449708
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Geoscience research has generated vast amounts of data, creating a need for effective extraction and integration of knowledge to address global-change challenges, promote sustainable development, and accelerate scientific discovery. Foundation language models, trained through extensive pre-training and instruction tuning on large text corpora, can facilitate this process. However, when foundational language models lack sufficient geoscience expertise, instruction tuning with relevant data can generate content that is inconsistent with established facts. In this study, we introduce JiuZhou, a powerful open foundational language model used in geoscience. First, we construct the large-scale, diverse, and high-quality JiuZhou-Corpus and the JiuZhou-Framework specifically designed for training geoscience large language models (LLMs). We introduce a two-stage pre-adaptation pre-training method to enhance the efficiency of knowledge learning and transfer in the model and demonstrated its effectiveness. Evaluation on GeoBench shows that JiuZhou outperforms GPT-3.5 in objective tasks and surpasses all baselines in subjective tasks. Moreover, we analyse the performance variations of the LLM using a stronger base model, stronger instruction data, and more training data, as well as its ability to assist scientific discovery. The results demonstrate the potential of JiuZhou as a geoscience foundational language model and provide valuable insights for advancing LLM development in geoscience. This project is available at https://github.com/THU-ESIS/JiuZhou.
ISSN:1753-8947
1753-8955