Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation

Background: Recent advancements in large language model (LLM) technologies have introduced powerful open-source instruction-tuned LLMs that match the text generation quality of leading models like GPT-4. Despite accelerating LLM adoption in sensitive-information environments, the lack of disclosed...

Full description

Saved in:
Bibliographic Details
Main Authors: Михаил Тихомиров, Даниил Чернышев
Format: Article
Language:English
Published: National Research University Higher School of Economics 2024-12-01
Series:Journal of Language and Education
Subjects:
Online Access:https://jle.hse.ru/article/view/22224
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841556008958164992
author Михаил Тихомиров
Даниил Чернышев
author_facet Михаил Тихомиров
Даниил Чернышев
author_sort Михаил Тихомиров
collection DOAJ
description Background: Recent advancements in large language model (LLM) technologies have introduced powerful open-source instruction-tuned LLMs that match the text generation quality of leading models like GPT-4. Despite accelerating LLM adoption in sensitive-information environments, the lack of disclosed training data hinders replication and makes these achievements exclusive to specific models. Purpose: Given the multilingual nature of the latest iteration of open-source LLMs, the benefits of training language-specific LLMs diminish, leaving computational efficiency as the sole guaranteed advantage of this computationally-expensive procedure. This work aims to address the language-adaptation limitations posed by restricted access to high-quality instruction-tuning data, offering a more cost-effective pipeline. Method: To tackle language-adaptation challenges, we introduce Learned Embedding Propagation (LEP), a novel method with lower training data requirements and minimal disruption of existing LLM knowledge. LEP employs an innovative embedding propagation technique, bypassing the need for instruction-tuning and directly integrating new language knowledge into any instruct-tuned LLM variant. Additionally, we developed Darumeru, a new benchmark for evaluating text generation robustness during training, specifically tailored for Russian adaptation. Results: We applied the LEP method to adapt LLaMa-3-8B and Mistral-7B for Russian, testing four different vocabulary adaptation scenarios. Evaluation demonstrates that LEP achieves competitive performance levels, comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct. Further improvements were observed through self-calibration and additional instruction-tuning steps, enhancing task-solving capabilities beyond the original models. Conclusion: LEP offers a viable and efficient alternative to traditional language-specific instruction-tuning, significantly reducing the costs associated with language adaptation while maintaining or surpassing the performance benchmarks set by contemporary LLMs.
format Article
id doaj-art-d24352c034684912ada5218cd1146c81
institution Kabale University
issn 2411-7390
language English
publishDate 2024-12-01
publisher National Research University Higher School of Economics
record_format Article
series Journal of Language and Education
spelling doaj-art-d24352c034684912ada5218cd1146c812025-01-07T16:17:17ZengNational Research University Higher School of EconomicsJournal of Language and Education2411-73902024-12-0110410.17323/jle.2024.22224Facilitating Large Language Model Russian Adaptation with Learned Embedding PropagationМихаил Тихомиров0Даниил Чернышев1Lomonosov Moscow State University, Moscow, RussiaLomonosov Moscow State University, Moscow, Russia Background: Recent advancements in large language model (LLM) technologies have introduced powerful open-source instruction-tuned LLMs that match the text generation quality of leading models like GPT-4. Despite accelerating LLM adoption in sensitive-information environments, the lack of disclosed training data hinders replication and makes these achievements exclusive to specific models. Purpose: Given the multilingual nature of the latest iteration of open-source LLMs, the benefits of training language-specific LLMs diminish, leaving computational efficiency as the sole guaranteed advantage of this computationally-expensive procedure. This work aims to address the language-adaptation limitations posed by restricted access to high-quality instruction-tuning data, offering a more cost-effective pipeline. Method: To tackle language-adaptation challenges, we introduce Learned Embedding Propagation (LEP), a novel method with lower training data requirements and minimal disruption of existing LLM knowledge. LEP employs an innovative embedding propagation technique, bypassing the need for instruction-tuning and directly integrating new language knowledge into any instruct-tuned LLM variant. Additionally, we developed Darumeru, a new benchmark for evaluating text generation robustness during training, specifically tailored for Russian adaptation. Results: We applied the LEP method to adapt LLaMa-3-8B and Mistral-7B for Russian, testing four different vocabulary adaptation scenarios. Evaluation demonstrates that LEP achieves competitive performance levels, comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct. Further improvements were observed through self-calibration and additional instruction-tuning steps, enhancing task-solving capabilities beyond the original models. Conclusion: LEP offers a viable and efficient alternative to traditional language-specific instruction-tuning, significantly reducing the costs associated with language adaptation while maintaining or surpassing the performance benchmarks set by contemporary LLMs. https://jle.hse.ru/article/view/22224large language modellanguage adaptationnatural language generationllama
spellingShingle Михаил Тихомиров
Даниил Чернышев
Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation
Journal of Language and Education
large language model
language adaptation
natural language generation
llama
title Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation
title_full Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation
title_fullStr Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation
title_full_unstemmed Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation
title_short Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation
title_sort facilitating large language model russian adaptation with learned embedding propagation
topic large language model
language adaptation
natural language generation
llama
url https://jle.hse.ru/article/view/22224
work_keys_str_mv AT mihailtihomirov facilitatinglargelanguagemodelrussianadaptationwithlearnedembeddingpropagation
AT daniilčernyšev facilitatinglargelanguagemodelrussianadaptationwithlearnedembeddingpropagation