Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation

Background: Recent advancements in large language model (LLM) technologies have introduced powerful open-source instruction-tuned LLMs that match the text generation quality of leading models like GPT-4. Despite accelerating LLM adoption in sensitive-information environments, the lack of disclosed...

Full description

Saved in:

Bibliographic Details
Main Authors:	Михаил Тихомиров, Даниил Чернышев
Format:	Article
Language:	English
Published:	National Research University Higher School of Economics 2024-12-01
Series:	Journal of Language and Education
Subjects:	large language model language adaptation natural language generation llama
Online Access:	https://jle.hse.ru/article/view/22224
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841556008958164992
author	Михаил Тихомиров Даниил Чернышев
author_facet	Михаил Тихомиров Даниил Чернышев
author_sort	Михаил Тихомиров
collection	DOAJ
description	Background: Recent advancements in large language model (LLM) technologies have introduced powerful open-source instruction-tuned LLMs that match the text generation quality of leading models like GPT-4. Despite accelerating LLM adoption in sensitive-information environments, the lack of disclosed training data hinders replication and makes these achievements exclusive to specific models. Purpose: Given the multilingual nature of the latest iteration of open-source LLMs, the benefits of training language-specific LLMs diminish, leaving computational efficiency as the sole guaranteed advantage of this computationally-expensive procedure. This work aims to address the language-adaptation limitations posed by restricted access to high-quality instruction-tuning data, offering a more cost-effective pipeline. Method: To tackle language-adaptation challenges, we introduce Learned Embedding Propagation (LEP), a novel method with lower training data requirements and minimal disruption of existing LLM knowledge. LEP employs an innovative embedding propagation technique, bypassing the need for instruction-tuning and directly integrating new language knowledge into any instruct-tuned LLM variant. Additionally, we developed Darumeru, a new benchmark for evaluating text generation robustness during training, specifically tailored for Russian adaptation. Results: We applied the LEP method to adapt LLaMa-3-8B and Mistral-7B for Russian, testing four different vocabulary adaptation scenarios. Evaluation demonstrates that LEP achieves competitive performance levels, comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct. Further improvements were observed through self-calibration and additional instruction-tuning steps, enhancing task-solving capabilities beyond the original models. Conclusion: LEP offers a viable and efficient alternative to traditional language-specific instruction-tuning, significantly reducing the costs associated with language adaptation while maintaining or surpassing the performance benchmarks set by contemporary LLMs.
format	Article
id	doaj-art-d24352c034684912ada5218cd1146c81
institution	Kabale University
issn	2411-7390
language	English
publishDate	2024-12-01
publisher	National Research University Higher School of Economics
record_format	Article
series	Journal of Language and Education
spelling	doaj-art-d24352c034684912ada5218cd1146c812025-01-07T16:17:17ZengNational Research University Higher School of EconomicsJournal of Language and Education2411-73902024-12-0110410.17323/jle.2024.22224Facilitating Large Language Model Russian Adaptation with Learned Embedding PropagationМихаил Тихомиров0Даниил Чернышев1Lomonosov Moscow State University, Moscow, RussiaLomonosov Moscow State University, Moscow, Russia Background: Recent advancements in large language model (LLM) technologies have introduced powerful open-source instruction-tuned LLMs that match the text generation quality of leading models like GPT-4. Despite accelerating LLM adoption in sensitive-information environments, the lack of disclosed training data hinders replication and makes these achievements exclusive to specific models. Purpose: Given the multilingual nature of the latest iteration of open-source LLMs, the benefits of training language-specific LLMs diminish, leaving computational efficiency as the sole guaranteed advantage of this computationally-expensive procedure. This work aims to address the language-adaptation limitations posed by restricted access to high-quality instruction-tuning data, offering a more cost-effective pipeline. Method: To tackle language-adaptation challenges, we introduce Learned Embedding Propagation (LEP), a novel method with lower training data requirements and minimal disruption of existing LLM knowledge. LEP employs an innovative embedding propagation technique, bypassing the need for instruction-tuning and directly integrating new language knowledge into any instruct-tuned LLM variant. Additionally, we developed Darumeru, a new benchmark for evaluating text generation robustness during training, specifically tailored for Russian adaptation. Results: We applied the LEP method to adapt LLaMa-3-8B and Mistral-7B for Russian, testing four different vocabulary adaptation scenarios. Evaluation demonstrates that LEP achieves competitive performance levels, comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct. Further improvements were observed through self-calibration and additional instruction-tuning steps, enhancing task-solving capabilities beyond the original models. Conclusion: LEP offers a viable and efficient alternative to traditional language-specific instruction-tuning, significantly reducing the costs associated with language adaptation while maintaining or surpassing the performance benchmarks set by contemporary LLMs. https://jle.hse.ru/article/view/22224large language modellanguage adaptationnatural language generationllama
spellingShingle	Михаил Тихомиров Даниил Чернышев Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation Journal of Language and Education large language model language adaptation natural language generation llama
title	Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation
title_full	Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation
title_fullStr	Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation
title_full_unstemmed	Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation
title_short	Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation
title_sort	facilitating large language model russian adaptation with learned embedding propagation
topic	large language model language adaptation natural language generation llama
url	https://jle.hse.ru/article/view/22224
work_keys_str_mv	AT mihailtihomirov facilitatinglargelanguagemodelrussianadaptationwithlearnedembeddingpropagation AT daniilčernyšev facilitatinglargelanguagemodelrussianadaptationwithlearnedembeddingpropagation

Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation

Similar Items