Assessing and enhancing the reliability of Chinese large language models in dental implantology

Abstract Background This study aimed to evaluate the reliability of five representative Chinese large language models (LLMs) in dental implantology. It also explored effective strategies for model enhancement. Methods A dataset of 100 dental implant-related questions (50 multiple-choice and 50 open-...

Full description

Saved in:
Bibliographic Details
Main Authors: Guohui Zhu, Xiao Zhang, Chunxia Chen
Format: Article
Language:English
Published: BMC 2025-07-01
Series:BMC Oral Health
Subjects:
Online Access:https://doi.org/10.1186/s12903-025-06648-1
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Background This study aimed to evaluate the reliability of five representative Chinese large language models (LLMs) in dental implantology. It also explored effective strategies for model enhancement. Methods A dataset of 100 dental implant-related questions (50 multiple-choice and 50 open-ended) was developed, covering medical knowledge, complex reasoning, and safety and ethics. Standard answers were validated by experts. Five LLMs—A: BaiXiaoYing, B: ChatGLM-4, C: ERNIE Bot 3.5, D: Qwen 2.5, and E: Kimi.ai—were tested using two metrics: recall and hallucination rate. Two enhancement techniques, chain of thought (CoT) reasoning and long text modeling (LTM), were applied, and their effectiveness was analyzed by comparing the metrics before and after the application of these techniques. Data analysis was conducted using SPSS software. ANOVA with Tukey HSD tests compared recall and hallucination rates across models, while paired t-tests evaluated changes before and after enhancement strategies. Results For multiple-choice questions, Group D (Qwen 2.5) achieved the highest recall at 0.9060 ± 0.0087, while Group C (ERNIE Bot 3.5) had the lowest hallucination rate at 0.1245 ± 0.0022. For open-ended questions, Group D maintained the highest recall at 0.7938 ± 0.0216, and Group C exhibited the lowest hallucination rate at 0.2390 ± 0.0029. Among enhancement strategies, chain-of-thought (CoT) reasoning improved Group D’s recall by 0.0621 ± 0.1474 (P < 0.05) but caused a non-significant increase in hallucination rate (0.0390 ± 0.1639, P > 0.05). Long-text modeling (LTM) significantly enhanced recall by 0.1119 ± 0.2000 (P < 0.05) and reduced hallucination rate by 0.2985 ± 0.4220 (P < 0.05). Conclusions Qwen 2.5 and ERNIE Bot 3.5 demonstrated exceptional reliability in dental implantology, excelling in answer accuracy and minimizing misinformation across question types. Open-ended queries posed higher risks of hallucinations compared to structured multiple-choice tasks, highlighting the need for targeted validation in free-text scenarios. Chain-of-thought (CoT) reasoning modestly improved accuracy but carried the trade-off of potential hallucination increases, while long-text modeling (LTM) significantly enhanced both accuracy and reliability simultaneously. These findings underscore LTM’s utility in optimizing large language models for specialized dental applications, balancing depth of reasoning with factual grounding to support clinical decision-making and educational training.
ISSN:1472-6831