A framework for evaluating cultural bias and historical misconceptions in LLMs outputs

Large Language Models (LLMs), while powerful, often perpetuate cultural biases and historical inaccuracies from their training data, marginalizing underrepresented perspectives. To address these issues, we introduce a structured framework to systematically evaluate and quantify these deficiencies. O...

Full description

Saved in:
Bibliographic Details
Main Authors: Moon-Kuen Mak, Tiejian Luo
Format: Article
Language:English
Published: KeAi Communications Co. Ltd. 2025-09-01
Series:BenchCouncil Transactions on Benchmarks, Standards and Evaluations
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2772485925000481
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849228333405962240
author Moon-Kuen Mak
Tiejian Luo
author_facet Moon-Kuen Mak
Tiejian Luo
author_sort Moon-Kuen Mak
collection DOAJ
description Large Language Models (LLMs), while powerful, often perpetuate cultural biases and historical inaccuracies from their training data, marginalizing underrepresented perspectives. To address these issues, we introduce a structured framework to systematically evaluate and quantify these deficiencies. Our methodology combines culturally sensitive prompting with two novel metrics: the Cultural Bias Score (CBS) and the Historical Misconception Score (HMS). Our analysis reveals varying cultural biases across LLMs, with certain Western-centric models, such as Gemini, exhibiting higher bias. In contrast, other models, including ChatGPT and Poe, demonstrate more balanced cultural narratives. We also find that historical misconceptions are most prevalent for less-documented events, underscoring the critical need for training data diversification. Our framework suggests the potential effectiveness of bias-mitigation techniques, including dataset augmentation and human-in-the-loop (HITL) verification. Empirical validation of these strategies remains an important direction for future work. This work provides a replicable and scalable methodology for developers and researchers to help ensure the responsible and equitable deployment of LLMs in critical domains such as education and content moderation.
format Article
id doaj-art-e53e65f60b0b4969a9898f8e6e77533f
institution Kabale University
issn 2772-4859
language English
publishDate 2025-09-01
publisher KeAi Communications Co. Ltd.
record_format Article
series BenchCouncil Transactions on Benchmarks, Standards and Evaluations
spelling doaj-art-e53e65f60b0b4969a9898f8e6e77533f2025-08-23T04:49:53ZengKeAi Communications Co. Ltd.BenchCouncil Transactions on Benchmarks, Standards and Evaluations2772-48592025-09-015310023510.1016/j.tbench.2025.100235A framework for evaluating cultural bias and historical misconceptions in LLMs outputsMoon-Kuen Mak0Tiejian Luo1Institute for the History of Natural Sciences, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, ChinaUniversity of Chinese Academy of Sciences, Beijing, China; Corresponding author.Large Language Models (LLMs), while powerful, often perpetuate cultural biases and historical inaccuracies from their training data, marginalizing underrepresented perspectives. To address these issues, we introduce a structured framework to systematically evaluate and quantify these deficiencies. Our methodology combines culturally sensitive prompting with two novel metrics: the Cultural Bias Score (CBS) and the Historical Misconception Score (HMS). Our analysis reveals varying cultural biases across LLMs, with certain Western-centric models, such as Gemini, exhibiting higher bias. In contrast, other models, including ChatGPT and Poe, demonstrate more balanced cultural narratives. We also find that historical misconceptions are most prevalent for less-documented events, underscoring the critical need for training data diversification. Our framework suggests the potential effectiveness of bias-mitigation techniques, including dataset augmentation and human-in-the-loop (HITL) verification. Empirical validation of these strategies remains an important direction for future work. This work provides a replicable and scalable methodology for developers and researchers to help ensure the responsible and equitable deployment of LLMs in critical domains such as education and content moderation.http://www.sciencedirect.com/science/article/pii/S2772485925000481Large language modelArtificial intelligenceCultural biasHistorical misconceptionhuman-in-the-loop
spellingShingle Moon-Kuen Mak
Tiejian Luo
A framework for evaluating cultural bias and historical misconceptions in LLMs outputs
BenchCouncil Transactions on Benchmarks, Standards and Evaluations
Large language model
Artificial intelligence
Cultural bias
Historical misconception
human-in-the-loop
title A framework for evaluating cultural bias and historical misconceptions in LLMs outputs
title_full A framework for evaluating cultural bias and historical misconceptions in LLMs outputs
title_fullStr A framework for evaluating cultural bias and historical misconceptions in LLMs outputs
title_full_unstemmed A framework for evaluating cultural bias and historical misconceptions in LLMs outputs
title_short A framework for evaluating cultural bias and historical misconceptions in LLMs outputs
title_sort framework for evaluating cultural bias and historical misconceptions in llms outputs
topic Large language model
Artificial intelligence
Cultural bias
Historical misconception
human-in-the-loop
url http://www.sciencedirect.com/science/article/pii/S2772485925000481
work_keys_str_mv AT moonkuenmak aframeworkforevaluatingculturalbiasandhistoricalmisconceptionsinllmsoutputs
AT tiejianluo aframeworkforevaluatingculturalbiasandhistoricalmisconceptionsinllmsoutputs
AT moonkuenmak frameworkforevaluatingculturalbiasandhistoricalmisconceptionsinllmsoutputs
AT tiejianluo frameworkforevaluatingculturalbiasandhistoricalmisconceptionsinllmsoutputs