A systematic multimodal assessment of AI machine translation tools for enhancing access to critical care education internationally
Abstract Background Language barriers pose a significant barrier to expanding access to critical care education worldwide. Machine translation (MT) offers significant promise to increase accessibility to critical care content, and has rapidly evolved using newer artificial intelligence frameworks an...
Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-07-01
|
| Series: | BMC Medical Education |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12909-025-07452-9 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Background Language barriers pose a significant barrier to expanding access to critical care education worldwide. Machine translation (MT) offers significant promise to increase accessibility to critical care content, and has rapidly evolved using newer artificial intelligence frameworks and large language models. The best approach to systematically apply and evaluate these tools, however, remains unclear. Methods We developed a multimodal method to evaluate translations of critical care content used as part of an established international critical care education program. Four freely-available MT tools were selected (DeepL™, Google Gemini™, Google Translate™, Microsoft CoPilot™) and used to translate selected phrases and paragraphs into Chinese (Mandarin), Spanish, and Ukrainian. A human translation performed by a professional medical translator was used for comparison. These translations were compared using 1) blinded bilingual clinician evaluations using anchored Likert domains of fluency, adequacy, and meaning; 2) automated BiLingual Evaluation Understudy (BLEU) scores; and 3) validated system usability scale to assess the ease of use of MT tools. Blinded bilingual clinician evaluations were calculated as individual domains and averaged composite scores. Results Blinded clinician composite scores were highest for human translation (Chinese), Google Gemini (Spanish), and Microsoft CoPilot (Ukrainian). Microsoft CoPilot (Chinese) and Google Translate (Spanish and Ukrainian) earned the lowest scores. All Chinese and Spanish versions received “understandable to good” or “high quality” BLEU scores, while Ukrainian overall scored “hard to get the gist” except using Microsoft CoPilot. Usability scores were highest with DeepL (Chinese), Google Gemini (Spanish), and Google Translate (Ukrainian), and lower with Microsoft CoPilot (Chinese and Ukrainian) and Google Translate (Spanish). Conclusion No single MT tool performed best across all metrics and languages, highlighting the importance of routine assessment of these tools during educational activities given their rapid ongoing evolution. We offer a multimodal evaluation methodology to aid this assessment as medical educators expand their use of MT in international educational programs. |
|---|---|
| ISSN: | 1472-6920 |