The sociolinguistic foundations of language modeling

In this article, we introduce a sociolinguistic perspective on language modeling. We claim that language models in general are inherently modeling varieties of language, and we consider how this insight can inform the development and deployment of language models. We begin by presenting a technical...

Full description

Saved in:
Bibliographic Details
Main Authors: Jack Grieve, Sara Bartl, Matteo Fuoli, Jason Grafmiller, Weihang Huang, Alejandro Jawerbaum, Akira Murakami, Marcus Perlman, Dana Roemling, Bodo Winter
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-01-01
Series:Frontiers in Artificial Intelligence
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/frai.2024.1472411/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841543743804538880
author Jack Grieve
Sara Bartl
Matteo Fuoli
Jason Grafmiller
Weihang Huang
Alejandro Jawerbaum
Akira Murakami
Marcus Perlman
Dana Roemling
Bodo Winter
author_facet Jack Grieve
Sara Bartl
Matteo Fuoli
Jason Grafmiller
Weihang Huang
Alejandro Jawerbaum
Akira Murakami
Marcus Perlman
Dana Roemling
Bodo Winter
author_sort Jack Grieve
collection DOAJ
description In this article, we introduce a sociolinguistic perspective on language modeling. We claim that language models in general are inherently modeling varieties of language, and we consider how this insight can inform the development and deployment of language models. We begin by presenting a technical definition of the concept of a variety of language as developed in sociolinguistics. We then discuss how this perspective could help us better understand five basic challenges in language modeling: social bias, domain adaptation, alignment, language change, and scale. We argue that to maximize the performance and societal value of language models it is important to carefully compile training corpora that accurately represent the specific varieties of language being modeled, drawing on theories, methods, and descriptions from the field of sociolinguistics.
format Article
id doaj-art-2e009cf201384ca8b46d1fcc575a595f
institution Kabale University
issn 2624-8212
language English
publishDate 2025-01-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Artificial Intelligence
spelling doaj-art-2e009cf201384ca8b46d1fcc575a595f2025-01-13T06:11:01ZengFrontiers Media S.A.Frontiers in Artificial Intelligence2624-82122025-01-01710.3389/frai.2024.14724111472411The sociolinguistic foundations of language modelingJack GrieveSara BartlMatteo FuoliJason GrafmillerWeihang HuangAlejandro JawerbaumAkira MurakamiMarcus PerlmanDana RoemlingBodo WinterIn this article, we introduce a sociolinguistic perspective on language modeling. We claim that language models in general are inherently modeling varieties of language, and we consider how this insight can inform the development and deployment of language models. We begin by presenting a technical definition of the concept of a variety of language as developed in sociolinguistics. We then discuss how this perspective could help us better understand five basic challenges in language modeling: social bias, domain adaptation, alignment, language change, and scale. We argue that to maximize the performance and societal value of language models it is important to carefully compile training corpora that accurately represent the specific varieties of language being modeled, drawing on theories, methods, and descriptions from the field of sociolinguistics.https://www.frontiersin.org/articles/10.3389/frai.2024.1472411/fullAI ethicsartificial intelligencecomputational sociolinguisticscorpus linguisticslarge language modelsnatural language processing
spellingShingle Jack Grieve
Sara Bartl
Matteo Fuoli
Jason Grafmiller
Weihang Huang
Alejandro Jawerbaum
Akira Murakami
Marcus Perlman
Dana Roemling
Bodo Winter
The sociolinguistic foundations of language modeling
Frontiers in Artificial Intelligence
AI ethics
artificial intelligence
computational sociolinguistics
corpus linguistics
large language models
natural language processing
title The sociolinguistic foundations of language modeling
title_full The sociolinguistic foundations of language modeling
title_fullStr The sociolinguistic foundations of language modeling
title_full_unstemmed The sociolinguistic foundations of language modeling
title_short The sociolinguistic foundations of language modeling
title_sort sociolinguistic foundations of language modeling
topic AI ethics
artificial intelligence
computational sociolinguistics
corpus linguistics
large language models
natural language processing
url https://www.frontiersin.org/articles/10.3389/frai.2024.1472411/full
work_keys_str_mv AT jackgrieve thesociolinguisticfoundationsoflanguagemodeling
AT sarabartl thesociolinguisticfoundationsoflanguagemodeling
AT matteofuoli thesociolinguisticfoundationsoflanguagemodeling
AT jasongrafmiller thesociolinguisticfoundationsoflanguagemodeling
AT weihanghuang thesociolinguisticfoundationsoflanguagemodeling
AT alejandrojawerbaum thesociolinguisticfoundationsoflanguagemodeling
AT akiramurakami thesociolinguisticfoundationsoflanguagemodeling
AT marcusperlman thesociolinguisticfoundationsoflanguagemodeling
AT danaroemling thesociolinguisticfoundationsoflanguagemodeling
AT bodowinter thesociolinguisticfoundationsoflanguagemodeling
AT jackgrieve sociolinguisticfoundationsoflanguagemodeling
AT sarabartl sociolinguisticfoundationsoflanguagemodeling
AT matteofuoli sociolinguisticfoundationsoflanguagemodeling
AT jasongrafmiller sociolinguisticfoundationsoflanguagemodeling
AT weihanghuang sociolinguisticfoundationsoflanguagemodeling
AT alejandrojawerbaum sociolinguisticfoundationsoflanguagemodeling
AT akiramurakami sociolinguisticfoundationsoflanguagemodeling
AT marcusperlman sociolinguisticfoundationsoflanguagemodeling
AT danaroemling sociolinguisticfoundationsoflanguagemodeling
AT bodowinter sociolinguisticfoundationsoflanguagemodeling