Performance of Advanced Artificial Intelligence Models in Traumatic Dental Injuries in Primary Dentition: A Comparative Evaluation of ChatGPT-4 Omni, DeepSeek, Gemini Advanced, and Claude 3.7 in Terms of Accuracy, Completeness, Response Time, and Readability

This study aimed to evaluate and compare the performance of four advanced artificial intelligence-powered chatbots—ChatGPT-4 Omni (ChatGPT-4o), DeepSeek, Gemini Advanced, and Claude 3.7 Sonnet—in responding to questions related to traumatic dental injuries (TDIs) in the primary dentition. The assess...

Full description

Saved in:
Bibliographic Details
Main Authors: Berkant Sezer, Tuğba Aydoğdu
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/14/7778
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This study aimed to evaluate and compare the performance of four advanced artificial intelligence-powered chatbots—ChatGPT-4 Omni (ChatGPT-4o), DeepSeek, Gemini Advanced, and Claude 3.7 Sonnet—in responding to questions related to traumatic dental injuries (TDIs) in the primary dentition. The assessment focused on accuracy, completeness, readability, and response time, aligning with the 2020 International Association of Dental Traumatology guidelines. Twenty-five open-ended TDI questions were submitted to each model in two separate sessions. Responses were anonymized and evaluated by four pediatric dentists. Accuracy and completeness were rated using Likert scales; readability was assessed using five standard indices; and response times were recorded in seconds. ChatGPT-4o demonstrated significantly higher accuracy than Gemini Advanced (<i>p</i> = 0.005), while DeepSeek outperformed Gemini Advanced in completeness (<i>p</i> = 0.010). Response times differed significantly (<i>p</i> < 0.001), with DeepSeek being the slowest and ChatGPT-4o and Gemini Advanced being the fastest. DeepSeek produced the most readable outputs relatively, though none met public readability standards. Claude 3.7 generated the most complex texts (<i>p</i> < 0.001). A strong correlation existed between accuracy and completeness (ρ = 0.701, <i>p</i> < 0.001). These findings emphasize the cautious integration of artificial intelligence chatbots into pediatric dental care due to varied performance. Clinical accuracy, completeness, and readability are critical when offering information aligned with guidelines to support decisions in dental trauma management.
ISSN:2076-3417