Unveiling the power of language models in chemical research question answering

Abstract While the abilities of language models are thoroughly evaluated in areas like general domains and biomedicine, academic chemistry remains less explored. Chemical QA tools also play a crucial role in both education and research by effectively translating complex chemical information into an...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xiuying Chen, Tairan Wang, Taicheng Guo, Kehan Guo, Juexiao Zhou, Haoyang Li, Zirui Song, Xin Gao, Xiangliang Zhang
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-01-01
Series:	Communications Chemistry
Online Access:	https://doi.org/10.1038/s42004-024-01394-x
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841544897068269568
author	Xiuying Chen Tairan Wang Taicheng Guo Kehan Guo Juexiao Zhou Haoyang Li Zirui Song Xin Gao Xiangliang Zhang
author_facet	Xiuying Chen Tairan Wang Taicheng Guo Kehan Guo Juexiao Zhou Haoyang Li Zirui Song Xin Gao Xiangliang Zhang
author_sort	Xiuying Chen
collection	DOAJ
description	Abstract While the abilities of language models are thoroughly evaluated in areas like general domains and biomedicine, academic chemistry remains less explored. Chemical QA tools also play a crucial role in both education and research by effectively translating complex chemical information into an understandable format. Addressing this gap, we introduce ScholarChemQA, a large-scale QA dataset constructed from chemical papers. Specifically, the questions are from paper titles with a question mark, and the multi-choice answers are reasoned out based on the corresponding abstracts. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. Correspondingly, we introduce a ChemMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data. Experiments show that Large Language Models (LLMs) still have significant room for improvement in the field of chemistry. Moreover, ChemMatch significantly outperforms recent similar-scale baselines: https://github.com/iriscxy/chemmatch .
format	Article
id	doaj-art-95a5fcaae37240aa86febd91a75a9e08
institution	Kabale University
issn	2399-3669
language	English
publishDate	2025-01-01
publisher	Nature Portfolio
record_format	Article
series	Communications Chemistry
spelling	doaj-art-95a5fcaae37240aa86febd91a75a9e082025-01-12T12:11:19ZengNature PortfolioCommunications Chemistry2399-36692025-01-018111110.1038/s42004-024-01394-xUnveiling the power of language models in chemical research question answeringXiuying Chen0Tairan Wang1Taicheng Guo2Kehan Guo3Juexiao Zhou4Haoyang Li5Zirui Song6Xin Gao7Xiangliang Zhang8Mohamed bin Zayed University of Artificial IntelligenceKing Abdullah University of Science and TechnologyUniversity of Notre DameUniversity of Notre DameKing Abdullah University of Science and TechnologyKing Abdullah University of Science and TechnologyMohamed bin Zayed University of Artificial IntelligenceKing Abdullah University of Science and TechnologyKing Abdullah University of Science and TechnologyAbstract While the abilities of language models are thoroughly evaluated in areas like general domains and biomedicine, academic chemistry remains less explored. Chemical QA tools also play a crucial role in both education and research by effectively translating complex chemical information into an understandable format. Addressing this gap, we introduce ScholarChemQA, a large-scale QA dataset constructed from chemical papers. Specifically, the questions are from paper titles with a question mark, and the multi-choice answers are reasoned out based on the corresponding abstracts. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. Correspondingly, we introduce a ChemMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data. Experiments show that Large Language Models (LLMs) still have significant room for improvement in the field of chemistry. Moreover, ChemMatch significantly outperforms recent similar-scale baselines: https://github.com/iriscxy/chemmatch .https://doi.org/10.1038/s42004-024-01394-x
spellingShingle	Xiuying Chen Tairan Wang Taicheng Guo Kehan Guo Juexiao Zhou Haoyang Li Zirui Song Xin Gao Xiangliang Zhang Unveiling the power of language models in chemical research question answering Communications Chemistry
title	Unveiling the power of language models in chemical research question answering
title_full	Unveiling the power of language models in chemical research question answering
title_fullStr	Unveiling the power of language models in chemical research question answering
title_full_unstemmed	Unveiling the power of language models in chemical research question answering
title_short	Unveiling the power of language models in chemical research question answering
title_sort	unveiling the power of language models in chemical research question answering
url	https://doi.org/10.1038/s42004-024-01394-x
work_keys_str_mv	AT xiuyingchen unveilingthepoweroflanguagemodelsinchemicalresearchquestionanswering AT tairanwang unveilingthepoweroflanguagemodelsinchemicalresearchquestionanswering AT taichengguo unveilingthepoweroflanguagemodelsinchemicalresearchquestionanswering AT kehanguo unveilingthepoweroflanguagemodelsinchemicalresearchquestionanswering AT juexiaozhou unveilingthepoweroflanguagemodelsinchemicalresearchquestionanswering AT haoyangli unveilingthepoweroflanguagemodelsinchemicalresearchquestionanswering AT ziruisong unveilingthepoweroflanguagemodelsinchemicalresearchquestionanswering AT xingao unveilingthepoweroflanguagemodelsinchemicalresearchquestionanswering AT xiangliangzhang unveilingthepoweroflanguagemodelsinchemicalresearchquestionanswering

Unveiling the power of language models in chemical research question answering

Similar Items