Performance of Advanced Artificial Intelligence Models in Traumatic Dental Injuries in Primary Dentition: A Comparative Evaluation of ChatGPT-4 Omni, DeepSeek, Gemini Advanced, and Claude 3.7 in Terms of Accuracy, Completeness, Response Time, and Readability
This study aimed to evaluate and compare the performance of four advanced artificial intelligence-powered chatbots—ChatGPT-4 Omni (ChatGPT-4o), DeepSeek, Gemini Advanced, and Claude 3.7 Sonnet—in responding to questions related to traumatic dental injuries (TDIs) in the primary dentition. The assess...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-07-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/14/7778 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | This study aimed to evaluate and compare the performance of four advanced artificial intelligence-powered chatbots—ChatGPT-4 Omni (ChatGPT-4o), DeepSeek, Gemini Advanced, and Claude 3.7 Sonnet—in responding to questions related to traumatic dental injuries (TDIs) in the primary dentition. The assessment focused on accuracy, completeness, readability, and response time, aligning with the 2020 International Association of Dental Traumatology guidelines. Twenty-five open-ended TDI questions were submitted to each model in two separate sessions. Responses were anonymized and evaluated by four pediatric dentists. Accuracy and completeness were rated using Likert scales; readability was assessed using five standard indices; and response times were recorded in seconds. ChatGPT-4o demonstrated significantly higher accuracy than Gemini Advanced (<i>p</i> = 0.005), while DeepSeek outperformed Gemini Advanced in completeness (<i>p</i> = 0.010). Response times differed significantly (<i>p</i> < 0.001), with DeepSeek being the slowest and ChatGPT-4o and Gemini Advanced being the fastest. DeepSeek produced the most readable outputs relatively, though none met public readability standards. Claude 3.7 generated the most complex texts (<i>p</i> < 0.001). A strong correlation existed between accuracy and completeness (ρ = 0.701, <i>p</i> < 0.001). These findings emphasize the cautious integration of artificial intelligence chatbots into pediatric dental care due to varied performance. Clinical accuracy, completeness, and readability are critical when offering information aligned with guidelines to support decisions in dental trauma management. |
|---|---|
| ISSN: | 2076-3417 |