\approachnameshort: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions

Abdullatif Köksal CIS, LMU Munich Munich Center for Machine Learning
Language Technology Lab, University of Cambridge
[email protected] Marion Thaler CIS, LMU Munich Ayyoob Imani CIS, LMU Munich Munich Center for Machine Learning
Ahmet Üstün Cohere for AI
Anna Korhonen Language Technology Lab, University of Cambridge
[email protected] Hinrich Schütze CIS, LMU Munich Munich Center for Machine Learning

Abstract

Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach’s effectiveness for both NLU and open-ended generation. We publicly release datasets and models at https://github.com/akoksal/muri.

1 Introduction

Instruction tuning refines large language models (LLMs) based on user intentions, enhancing their ability to generalize across tasks and align with human preferences (Ouyang et al., 2022; Sanh et al., 2022; Muennighoff et al., 2023; Wang et al., 2022). While pre-training data can be automatically collected from the web, preparing instruction tuning data is challenging as it requires aligned instruction-output pairs. Three main approaches have been applied to create instruction tuning datasets: human annotation (Ouyang et al., 2022; Köpf et al., 2023; Conover et al., 2023), templatized NLP tasks (Sanh et al., 2022; Wang et al., 2022; Longpre et al., 2023a), and synthetic data generation via LLMs (Wang et al., 2023; Honovich et al., 2022).

For low-resource languages, these approaches face serious limitations. Human annotation is costly, and finding native speakers for low-resource languages is challenging. Templatizing NLP tasks restricts datasets to specific structures and domains, limiting their general applicability, and there is insufficient NLP task-annotated data for low-resource languages (ImaniGooghari et al., 2023). Synthetic data generation is constrained by the languages supported by existing models and suffers from validity (Wang et al., 2023) and creativity (Honovich et al., 2022) issues. Moreover, outputs of both template-based and synthetic data generation methods heavily rely on translation pipelines and are particularly prone to artifacts known as “translationese” Gellerstam (1986). These artifacts include simplified vocabulary and grammar, unidiomatic word order and expressions, and often neglect linguistic and cultural contexts. Such occurrences of translationese have been shown to negatively impact model training Yu et al. (2022) by distorting examples and further distancing them from their linguistic and cultural context.

Consequently, no existing model, open source or proprietary, supports low-resource languages with the quality necessary for high-quality instruction tuning dataset generation. This has resulted in a disparity, with English dominating 73% of the most popular datasets (Longpre et al., 2023), leaving low-resource language communities underserved.

In this work, we introduce Multilingual Reverse Instructions (MURI), a novel approach to generate instruction tuning datasets for low-resource languages without requiring annotators, task-annotated data, or pre-trained multilingual models. MURI employs the reverse instructions method proposed by Köksal et al. (2024) and combines it with machine translation to develop language-specific instructions, $i_{\tau}$ , for a text $d_{\tau}$ . This involves translating $d_{\tau}$ to English, generating an English instruction $i_{\epsilon}$ using reverse instructions, and translating $i_{\epsilon}$ back to the original language, so it can serve as instruction for $d_{\tau}$ . Notably, unlike fully translation-based methods, our approach requires only translating the instructions which enables using authentic outputs in the target language. The instruction-output pair ( $i_{\tau}$ , $d_{\tau}$ ) can then be used to fine tune a language model to follow instructions. This approach is cost-effective and applicable to any language with available textual data.

By applying MURI to texts from diverse sources, we have created MURI-IT, a dataset containing more than 2 million instruction-output pairs across 200 languages. To our knowledge, this dataset offers the broadest language coverage for multilingual instruction tuning. Our sources include Wikipedia, WikiHow, and various web-crawled pages, providing a rich variety in style, domain, and length. The output documents, sourced directly from data in their original languages, retain cultural and linguistic nuances. Additionally, we use quality filters to ensure the dataset’s high standards.

To evaluate MURI-IT, native speakers across 13 languages assess the dataset on five aspects to gauge quality. We also fine-tune several mT5 family models using MURI-IT to execute instruction-based tasks, assessing their performance in natural language understanding and generation. For instance, MURI-101, an mT5-XXL model instruction-tuned with MURI-IT, outperforms prior models like mT0 (Muennighoff et al., 2023) by over 14% in multilingual MMLU. In open-ended generation tasks, it delivers much better outputs than mT0, with win rates of 59% vs. 28%. Additionally, MURI-IT enhances performance when used alongside existing datasets like Aya. We make the fine-tuned models and MURI-IT publicly available.

To summarize, our contributions are:
(i) We introduce Multilingual Reverse Instructions (MURI), a cost-effective method for creating multilingual instruction tuning datasets applicable to hundreds of languages.
(ii) We create and publish MURI-IT, an instruction tuning dataset for 200 languages using MURI. This dataset consists of 2,228,499 instruction-output pairs, with 64% of the data from low-resource languages.
(iii) We evaluate and analyze the dataset with native speakers in 13 languages. We find that the data is highly idiomatic in many languages.
(iv) We fine-tune and release MURI-101, a massively multilingual instruction-following language model using MURI-IT.

2 Related Work

Instruction Tuning Datasets

Instruction tuning has emerged as a powerful approach to enhancing the instruction-following capabilities of LLMs, as demonstrated by numerous studies (Ouyang et al., 2022; Sanh et al., 2022; Muennighoff et al., 2023; Wang et al., 2022). The three primary strategies to create instruction tuning datasets are human curation, templatized tasks, and synthetic generation via LLMs.

Human-curated datasets, like Open Assistant (Köpf et al., 2023) and Dolly (Conover et al., 2023), involve extensive human annotation but are difficult to scale and extend to more languages due to high cost. Alternatively, datasets such as Public Pool of Prompts (P3) (Sanh et al., 2022), SuperNatural Instructions (NIv2) (Wang et al., 2022), and FLAN (Longpre et al., 2023a) utilize NLP task reformulation to reduce cost and enhance applicability but still struggle with general-purpose instruction following since their main focus is on NLU tasks.

To address these issues, synthetic datasets have been developed, such as Self-Instruct (Wang et al., 2023), TeGit (Chen et al., 2023), Unnatural Instructions (Honovich et al., 2022). These datasets offer greater task diversity but are challenged by issues of validity and creativity. The reverse instruction method Köksal et al. (2024), employing data augmentation via generative models and pretraining corpora, further exemplifies cost-effective dataset generation. A similar method has been successfully applied in Bactrian-X (Li et al., 2023), demonstrating its effectiveness in multilingual settings by leveraging translation for synthetic data.

Refer to caption — Figure 1: Multilingual Reverse Instructions (MURI). Step 1: MURI selects a high-quality human-written example (<doc>) from multilingual corpora. Step 2: Translation into the English document <doc_eng>. Step 3 applies the reverse instructions method to <doc_eng> (i.e., prompting the LLM to generate a matching instruction <inst_eng>). Step 4: <inst_eng> is translated back into the source language (<inst>), resulting in a (<inst>, <doc>) pair where the <doc> output is human-written.

Multilingual instruction tuning has shown substantial benefits, especially for low-resource languages (Muennighoff et al., 2023). It not only maintains performance in English but also enhances capabilities in non-English languages with the help of a large scale of English examples (Shaham et al., 2024; AI@Meta, 2024). Despite these advancements, large-scale multilingual datasets often remain limited. Efforts to overcome this include pre-training on diverse multilingual data (Chung et al., 2022; Chowdhery et al., 2022) and creating dedicated multilingual instruction fine-tuning sets (Muennighoff et al., 2023), and manual annotation by native speakers (Singh et al., 2024). Extensions to existing datasets often utilize automatic translation (Li et al., 2023; Winata et al., 2023; Holmström and Doostmohammadi, 2023) and template-based generation (Gupta et al., 2023). These methods strive to balance diversity against resource constraints and quality issues inherent in automated translation processes, as seen in the extensive Aya collection (Singh et al., 2024). In summary, while instruction tuning has greatly advanced the capabilities of LLMs in following complex instructions, challenges remain in dataset diversity, validity, and the integration of multilingual content.

Multilingual LLMs

In the recent surge of LLMs, English-centric models like the closed-source GPT model family (Radford et al., 2019; Ouyang et al., 2022) and open-sourced ones like LLaMA and Pythia (Touvron et al., 2023; Biderman et al., 2023), have gained prominence. Multilingual models, unlike monolingual ones, offer the advantage of facilitating cross-lingual tasks such as translation (Jiao et al., 2023; Xu et al., 2024) and addressing low-resource languages (Artetxe and Schwenk, 2019; Wu and Dredze, 2020). mBERT pioneered the multilingual area by demonstrating that training on multilingual data allows different languages to be represented in a unified semantic space Devlin et al. (2019). Building on this foundation, subsequent models such as XLM-R, GLoT500, and AfriBERTa extended the capabilities of transformer-based models to hundreds of languages Conneau et al. (2020); ImaniGooghari et al. (2023); Ogueji et al. (2021).

However, the progression of multilingual encoder-decoder and decoder-only models has been more restrained. Models like LLaMA and Mistral, for instance, feature datasets predominantly composed of English, with limited data from a select group of high-resource languages Touvron et al. (2023); Jiang et al. (2023). In contrast, models like XGLM, BLOOM, and MGPT have been developed from scratch to support extensive language diversity Workshop et al. (2023); Lin et al. (2022); Shliazhko et al. (2022). Meanwhile, mT5 is trained on 101 languages, an important step in encoder-decoder model training Xue et al. (2021).

3 The MURI-IT Dataset

We introduce MURI-IT, which includes 2,228,499 instruction-output pairs in 200 languages. The dataset is primarily constructed by applying Multilingual Reverse Instructions (MURI) to the CulturaX Nguyen et al. (2023) and Wikipedia corpora. The core idea is summarized in Figure 1. Our goal is to utilize existing high-quality human-written multilingual corpora to generate a diverse instruction-following dataset. For a randomly selected text, we aim to generate an instruction for which the high-quality corpus text would serve as a good response. This approach ensures that a model trained with this dataset will not be conditioned on outputs with translation artifacts or culturally irrelevant topics.

MURI-IT also incorporates two additional subsets: 54,578 instances collected from the WikiHow website in 18 languages to augment the dataset, and 455,472 existing NLP task examples in 74 languages to enrich its diversity. This section details the steps used to produce MURI-IT.

3.1 Multilingual Reverse Instructions (MURI)

Step 1. Data Selection: We randomly sample documents from two multilingual corpora: CulturaX (1,076,575 documents) and Wikipedia (1,554,207 documents). CulturaX encompasses 167 languages, merging the OSCAR Ortiz Suárez et al. (2020) and mC4 Xue et al. (2021) corpora with additional cleaning, deduplication, language identification, and diversification procedures Nguyen et al. (2023). Wikipedia spans over 350 languages with high-quality documents.

Instruction Generation After selecting high-quality outputs, the next step is generating suitable instructions in the source language. Since recent LLMs support a limited number of languages, we utilize machine translation models and English LLMs. Let $(i_{\tau}^{k},d_{\tau}^{k})$ be an instruction-output pair in a target low-resource language $\tau$ . Given a corpus of human-written documents $D_{\tau}={d_{\tau}^{1},d_{\tau}^{2},\ldots,d_{\tau}^{n}}$ , we aim to create an instruction tuning dataset $D_{I\tau}={(i_{\tau}^{1},d_{\tau}^{1}),(i_{\tau}^{2},d_{\tau}^{2}),\ldots,(i_{% \tau}^{n},d_{\tau}^{n})}$ .

Step 2. Document Translation: First, each document $d_{\tau}^{k}$ is translated to English using a machine translation model, resulting in $d_{\epsilon}^{k}$ . We use MADLAD-400-3B-MT Kudugunta et al. (2023), with top_p= $1$ sampling for translation.

Step3. Reverse Instructions: Next, we employ an English LLM for instruction generation. We modify the reverse instructions prompt in (Köksal et al., 2024) to generate an instruction $i_{\epsilon}^{k}$ for $d_{\epsilon}^{k}$ in a few-shot manner, as illustrated in Table 7 in Appendix. We use Mixtral-8x7B Jiang et al. (2024) with greedy decoding for instruction generation.

Step 4. Translating Instruction to the Source Language and Ensuring Language Consistency: Finally, the generated instruction $i_{\epsilon}^{k}$ is translated back to its source language using MADLAD-400-3B-MT, denoted as $i_{\tau}^{k}$ . To verify language consistency, we utilize GlotLID Kargaran et al. (2023) and discard mismatched translations.

Step 5. Content screening: We observe that some examples contain violent or noisy content due to the nature of the corpora. We utilize the RoBERTa hate-speech model Vidgen et al. (2021) to screen the generated instruction-output $(i_{\epsilon}^{k},d_{\epsilon}^{k})$ pairs in English and eliminate unsuitable examples. To ensure dataset integrity and eliminate redundancy, we employ the MinHashLSHForest method for deduplication.

Manual screening revealed unsuitable instructions lacking necessary context, such as instructions asking for summarization of non-existent prior documents or requesting translations. We excluded these instruction-output pairs from our dataset by filtering out instructions including the words summarize or translate. Additionally, we observed that web-sourced documents often include extraneous content like footers, headers, or advertisements. We leave the elimination of such extraneous content for future work.

3.2 WikiHow Data

We collected articles from the multilingual WikiHow website using PyWikiHow JarbasAI (2024) in 18 languages (Arabic, Chinese, Czech, Dutch, English, French, German, Hindi, Italian, Japanese, Korean, Malay, Portuguese, Russian, Spanish, Thai, Turkish, and Vietnamese), based on the URLs provided by Wikilingua Ladhak et al. (2020). Each WikiHow page is comprised of the following sections: (i) A title that starts with “How to”, (ii) an abstract answer to the question, (iii) a number of steps, each comprised of a step-title and a step-text paragraph. We use the title of each WikiHow page as the instruction. To introduce variation in the style of the answers, we render the answers to the questions as follows: In $50\%$ of cases, we include the abstract in the answer and in the other $50\%$ we don’t. Regardless of whether the abstract is included or not, in $50\%$ of the cases we only include the step-titles, and in the other $50\%$ we include both the step-titles and step-texts.

3.3 NLP Tasks

To further improve diversity of tasks in MURI-IT, we incorporated several existing multilingual instruction following datasets based on NLP tasks, expanding language coverage and task diversity. These additions, totaling 455,472 samples across 74 languages, complement our primary data sources:

SuperNatural Instructions: We sampled 200 tasks from SuperNatural Instructions (Wang et al., 2022) per translation task type and 500 samples from the remaining set, resulting in 161,986 samples across 55 languages.

xP3 comprises 46 languages across 16 NLP tasks, such as various types of QA and Topic Classification. We adapted 184,000 samples of xP3 Muennighoff et al. (2022) to our format.

OASST1: From this crowd-sourced chat-style dataset Köpf et al. (2024) spanning 35 languages, we selected and paired message and output pairs up to the second deepest level, yielding 9,486 examples in 10 languages.

FLAN v2: We incorporated 100,000 samples from FLAN v2 (Longpre et al., 2023a), including 50,000 from its main collection and 50,000 from its Chain-of-Thought subset, all in English following Tulu (Ivison et al., 2023).

These additional datasets were chosen to increase the linguistic and task diversity of MURI-IT, ensuring a more comprehensive and versatile instruction-tuning dataset similar to prior work like Aya (Singh et al., 2024). We have summarized the statistics of our dataset in Table 1.

Source	# Languages	# Examples
Multilingual Reverse Instructions	194	1,718,449
Wikipedia	187	1,031,726
CulturaX	123	686,723
WikiHow	18	54,578
NLP Tasks	74	455,472
SupNatInst-v2	55	161,986
xP3	44	184,000
OpenAssistant	10	9,486
FLAN v2.0	1	100,000
Total	200	2,228,499

Table 1: Composition of MURI-IT by source, including the number of languages and examples. We split the dataset 90/5/5 into training, validation, and test sets, ensuring similar ratios for sources and languages.

4 Dataset Analysis

This section presents an in-depth analysis of MURI-IT, focusing on two key aspects: linguistic diversity and data quality assessment. Section 4.1 examines the range of languages represented in MURI-IT, highlighting its coverage of low-resource languages, diverse scripts, word orders, and case-marking systems.

Section 4.2 evaluates the dataset’s effectiveness in maintaining linguistic and cultural accuracy. We detail the findings from a quality evaluation involving native speakers, assessing alignment, correctness, and overall informational sufficiency.

4.1 Languages and Linguistic Diversity

MURI aims to provide a methodology inclusive of low-resource languages through a culturally respectful approach, utilizing materials in their native languages and avoiding outputs in translationese Bizzoni et al. (2020); Vanmassenhove et al. (2021). Given that the majority of languages used in NLP systems share typological similarities and geographical origins Joshi et al. (2021), this often leads to an uneven distribution of resources and tools available to the global community. MURI-IT therefore focuses particularly on languages with limited resources and diverse features.

Joshi et al. (2021) outlined a taxonomy categorizing languages based on their resource levels, ranging from 0 (left-behinds) such as Balinese with severely limited resources, to 5 (winners) like English or French. Our dataset encompasses a large number of low-resource languages, as shown in Figure 2.2, with over 700,000 examples falling into category 1. Despite this, access to outputs for these low-resource languages remains limited, with 33 languages containing fewer than 1,000 examples each. Nonetheless, MURI-IT proves to be one of the most diverse instruction-tuning datasets to date.

Nezhad and Agrawal (2024) emphasize script and word order as important factors in analyzing linguistic diversity. While the majority of languages in MURI-IT employ Latin script or a combination of Latin, Arabic, and Cyrillic scripts (Figure 2.2), a notable portion (more than one-fifth, categorized as “Other”) features low-resource scripts such as Lao or Georgian. As the output texts have not been translated, idiomatic use of these scripts is assured, ensuring correct orthography.

To further investigate linguistic diversity, we examined word order and case marking. Focusing on the order of subject, verb, and object, Figure 2.2 shows that while European SVO languages predominate Dryer (2013) and there are no rare OVS and OSV languages, all frequent patterns are represented. This showcases the “structural” diversity of our dataset.

Case-marking patterns align with geographical distribution; e.g., mid-size to large inventories are prevalent in South Asia, Eastern Europe, and east-central Africa Iggesen (2013). Figure 2.2 illustrates that our dataset encompasses a diverse range of case systems, including complex systems with up to ten cases. This indicates that our dataset has good coverage of both “analytic” and “synthetic” languages. Overall, case marking and word order exemplify the broader coverage of MURI-IT of less common languages compared to previous datasets, contributing to a more comprehensive representation of linguistic diversity in NLP resources.

Language Alignment Prop. of Non-Instructions Instruction Correctness Output Correctness Informational Sufficiency English 4.70 0.00 4.97 4.70 4.27 German 4.73 0.07 4.57 4.77 4.03 Italian 4.16 0.07 4.16 4.26 4.23 Swedish 4.50 0.00 4.97 4.60 3.93 Dutch 4.37 0.10 4.90 4.33 4.20 French 4.53 0.13 4.80 4.53 4.43 Persian 4.00 0.03 4.53 4.60 4.00 Bavarian 3.47 0.53 3.90 4.63 4.30 Vietnamese 4.47 0.07 4.50 4.50 3.33 Turkish 3.47 0.03 4.93 4.70 4.57 Chinese 3.37 0.10 5.00 4.23 4.37 Ukrainian 3.50 0.06 4.50 3.00 3.00 Korean 2.86 0.00 4.73 4.53 3.90 Avg. 4.01 0.09 4.65 4.41 4.04

Table 2: Comparison of alignment, proportion of non-instructive phrasing, grammatical correctness, and informational sufficiency across 13 evaluated languages in the reverse instruction subset of MURI-IT.

{CJK*}

UTF8gbsn

Issue	Lang.	Instruction & Output
No Orthographic Standard: Standard German words in the instruction. Bavarian has no standard orthography, resulting in translation models mixing different types of Bavarian and Standard German.	Bavarian	I: Beschreibn Sie die Geschichte von Pädaschduaf. A: Pädaschduaf oda Pädasduaf (kuaz: P’dorf), offiziö Perchtoldsdorf, is a Moaktgmoa mit Einwohna (Stand ) im Bezirk Mödling in […]
Code-switching: The instruction is in standard Mandarin, the output is in Shanghai dialect.	Chinese	I: 犹太人是谁 A: 犹太人（希伯来文: Yehudim）是一个历史悠久，凝聚力强的族群。[…]
Idiomaticity: The expression Chronisch-Übersättigung is not idiomatic (indicating shortcomings of the translation model).	German	I: Was ist chronische Übersäuerung des Körpers? Was bedeutet die Chronisch-Übersättigung des Organismus? A: Die chronische übersäuerung des Körpers - Die chronische übersäuerung des körpers-Tagebücher Wenn der Körper […]
Lack of specific product information: The output lacks specific information about the tool TimeTool that is required for a good output.	Dutch	I: Hoe kan ik mijn verjaardag vieren met TimeTools? A: TimeTools heeft voor elk tijdzegel een uniek werkboek ontwikkeld. Veel mensen ervaren dit als een ’feest der herkenning’. […]
Superfluous Text Markup: The output contains header details, inhibiting readability: mandy quotes billy and mandy \|	English	I: What is Clickfunnels? A: mandy quotes billy and mandy \| What is Clickfunnels? mandy quotes billy and mandy Prices: Is it truly worth it? The Internet […]

Table 3: Examples of issues identified in the human evaluation of MURI-IT, illustrating various challenges such as orthographic inconsistencies, code-switching, idiomatic inaccuracies, and superfluous text markup. These examples highlight areas where translation and formatting may impact the overall quality and clarity of the dataset.

4.2 Quality Assessment of MURI-IT

A distinctive feature of MURI-IT is its preservation of cultural and linguistic nuances, often lost in translated datasets. To enhance our linguistic analysis, we conducted a thorough evaluation of a random subset of the dataset, involving native speakers proficient in 13 languages. Each annotator examined 30 randomly selected instruction-output pairs from the reverse instruction subset of MURI-IT using five predefined evaluation criteria. These criteria assess the quality of both instructions and outputs using – except for Proper Instruction Format – a Likert scale. (i) Alignment (range 1-5): Measures the alignment between instruction and output. (ii) Instruction Correctness, (iii) Output Correctness (range 1-5): Assess lexical and grammatical accuracy of instruction and output. (iv) Informational Sufficiency (range 1-5): Determines whether the instruction can be adequately answered without external context. (v) Proper Instruction Format (0: No, 1: Yes): Indicates whether the instruction is appropriately formatted for a language model.

Table 2 shows that human assessment is generally good, but with some mixed results. High-resource languages such as English, German, French and Italian consistently perform well across all criteria. However, common issues include the presence of highly-specific and ambiguous information and instructions that depend heavily on temporal context, which can reduce their clarity and usefulness in instruction tuning. Across all 13 languages, extraneous headers, footers, and metadata are found in some outputs. Thus, the noise contained in the underlying multilingual corpora affects quality and coherence of MURI-IT, as reflected in the slightly lower average output correctness compared to input correctness. A relatively minor problem is less idiomatic and culturally appropriate language use (see German example in Table 3).

	arb	ben	cat	dan	deu	eus	fra	guj	hin	hrv	hun	hye	ind	ita	kan	mal
Okapi	27.7	26.8	30.5	31.8	31.7	27.9	30.7	27.4	26.5	30.0	30.1	27.5	27.5	30.4	26.8	25.8
mT0	31.5	31.6	32.8	33.0	32.7	29.7	32.1	29.5	32.0	31.1	32.3	28.4	33.3	32.4	30.9	28.6
mT0x	31.6	30.2	32.6	32.0	32.5	29.2	32.7	28.5	31.6	31.1	31.7	26.7	32.3	31.3	28.9	26.7
Aya-101	38.2	35.8	39.6	39.7	39.7	36.0	39.7	33.6	38.7	37.5	38.8	30.0	40.0	39.0	34.5	30.4
MURI-101 (ours)	36.5	33.0	38.8	38.4	38.9	34.4	39.0	33.1	35.4	37.0	38.1	29.9	38.9	38.5	32.4	30.9
	mar	nep	nld	por	ron	rus	slk	spa	srp	swe	tam	tel	ukr	vie	zho	Avg.
Okapi	26.1	25.2	31.1	30.1	30.9	30.6	30.2	30.9	30.4	29.3	26.0	25.9	31.6	27.5	28.2	28.8
mT0	31.6	32.4	32.0	32.1	32.4	32.8	32.3	32.1	30.9	31.6	29.4	29.0	31.5	30.9	32.5	31.5
mT0x	29.7	30.1	32.1	32.0	31.8	31.7	31.4	32.2	31.4	32.8	27.7	27.9	32.3	31.1	31.6	30.8
Aya-101	36.0	37.2	40.1	39.0	39.5	39.2	39.4	39.7	38.1	39.7	31.2	32.1	39.9	34.8	38.3	37.3
MURI-101 (ours)	33.0	33.2	38.8	38.1	38.1	37.7	38.0	39.0	36.6	38.5	29.8	31.3	37.0	36.8	36.9	36.0

Table 4: Multilingual MMLU performance of Okapi, mT0, mT0x, Aya-101 and MURI-101 across 31 languages. Scores are accuracy in a few-shot setup. Except for Okapi (25 shots), the number of shots is 5.

For lower-performing languages, a major source of error is the lack of standardization. For instance, Bavarian – spoken in Austria, Bavaria and Alto Adige – lacks a standard. This resulted in MADLAD Kudugunta et al. (2023) translations including Standard German words (Table 3), degrading the quality of generated instructions. Similarly, Chinese instructions and outputs were sometimes mismatched, e.g., inconsistent use of traditional vs. simplified Chinese and of dialects vs. Standard Mandarin.

Overall, we observe a moderately high alignment between instructions and outputs, averaging 4.01. Only 9% of the generated instructions deviate from the typical style of a question or direct instruction. Instruction-output pairs are mostly grammatically and lexically accurate, with higher-performing languages such as English and German aligning particularly well. This directly follows from the superior performance of MADLAD for these languages.

5 Experimental Setup

To evaluate the effectiveness of MURI-IT, we instruction-tune mT5-XXL (Xue et al., 2021). While recent autoregressive models exist with stronger results in English, mT5 remains one of the most comprehensive models supporting numerous languages. We fine-tune using a subset of MURI-IT for the 101 languages supported by mT5, called MURI-101. Our evaluation encompasses both multilingual Natural Language Understanding (NLU) and open-ended generation (NLG).

5.1 Baselines

We compare our MURI-101 model against four state-of-the-art multilingual instruction-following models.

mT0 Muennighoff et al. (2023): An mT5-XXL-based model, instruction-tuned using the xP3 dataset, which consists of 16 reformulated NLP tasks, including summarization, QA, and classification for 46 languages.

Okapi (Lai et al., 2023): A series of language-specific instruction-following models based on Bloom-7b (Workshop et al., 2023) and Llama-7b (Touvron et al., 2023). Each model is independently fine-tuned on translations of English synthetic data, followed by preference optimization for a specific language.

mT0x: An mT5-XXL model instruction-tuned using the extended xP3 dataset, xP3x, covering 101 languages (Üstün et al., 2024), providing a fair comparison regarding the number of languages.

Aya-101 (Üstün et al., 2024): Uses xP3x, translated Aya Collection, a subset of DataProvenance (Longpre et al., 2023) and translated ShareGPT-Command to instruction-tune mT5-XXL for 101 languages.

5.2 Training Details

Our experiments utilize the TPU Research Cloud (TRC) program, employing a TPU v4-32 with 32 chips and the T5X framework from Google. We set both input and output lengths to 1024 tokens and implement data packing. The effective batch size is 64, achieved through gradient accumulation (batch size of 8 with 8 accumulation steps). Following Üstün et al. (2024)’s findings, we use a fixed learning rate of 3e-4 without a scheduler and trained for 5 epochs. For generation tasks, we apply nucleus sampling with top_p=0.8 and temperature=0.9, as per Holtzman et al. (2020).

5.3 Evaluation Benchmarks

We evaluate the models in both multilingual and monolingual settings for NLU and open-ended generation tasks. Two evaluations use TranslatedDolly (Singh et al., 2024), a translated version of Dolly (Conover et al., 2023), a human-annotated English instruction-tuning dataset.

Multilingual settings. NLU: Multilingual MMLU (Lai et al., 2023) dataset, created by translating the English MMLU dataset to 31 languages. We evaluate using the lm-evaluation-harness framework EleutherAI (2024) with a 5-shot setup.
NLG: TranslatedDolly, evaluated on 21 languages using the multilingual Command R+ Cohere (2024) model as an LLM judge.

Monolingual low-resource settings. NLU: Taxi1500 (Ma et al., 2023) for classification with a 6-shot setup based on a parallel Bible corpus covering 1500 languages.
NLG: TranslatedDolly

6 Multilingual Model Evaluation

We first evaluate our model MURI-101 on the few-shot multilingual MMLU task. Table 4 shows that MURI-101 clearly outperforms previous models (Okapi, mT0, mT0x), with an average relative improvement of more than 14.3% (from 31.5 to 36.0). MURI-101 consistently outperforms prior models across all languages, with the exception of Aya-101.

While Aya-101 shows slightly better performance than MURI-101 in NLU, we note that Aya-101 is the result of a computationally-heavy training process involving around 25 million samples. This includes a lot of translated data (47.5% of the training mixture) and data synthetically generated and translated based on the ShareGPT dataset using a proprietary model (22.5% of the training data). Thus, around 60% of their training data relies on translation which may introduce systematic translation artifacts known as translationese (Gellerstam, 1986; Yu et al., 2022) in the model outputs. However, this effect is difficult to evaluate with current metrics. Given these factors, we primarily compare MURI-101 with mT0 in NLG.

For NLG evaluation, we compare MURI-101 with mT0 on TranslatedDolly (Singh et al., 2024) and compare outputs using the multilingual Command R+ model as a judge. From TranslatedDolly, we select the 21 languages that Command R+ supports. Figure 3 shows that MURI-101 consistently outperforms mT0 across all languages. Also across all languages, MURI-101’s win rate against mT0 is 59%, with lose and tie rates of 28% and 13%.

The lowest improvement in NLG is for simplified Chinese, with a 47% win rate vs. 40% loss rate. We hypothesize that code-switching within different varieties of Chinese (as discussed in §4.2) contributes to this limited improvement.

Language	aze	bel	bul	cym	gla	kaz	khm	lao	slk	slv	Avg.
mT5	20.4	22.4	20.7	18.4	19.3	19.8	16.5	21.3	19.2	18.9	19.7
Aya₁	37.0	32.1	34.4	33.0	28.7	44.7	30.0	32.7	38.1	40.3	35.1
Aya₁+MURI₁	39.5	33.7	38.1	35.5	35.2	46.7	31.3	33.0	39.1	39.6	37.2

Table 5: Monolingual NLU performance on the Taxi1500 classification task across different low-resource languages. Scores are accuracy using a 6-shot setup.

	Aya₁	Aya₁		Aya₁	Aya₁
		+MURI₁			+MURI₁
aze	4%	4%	kaz	2%	3%
bel	3%	6%	khm	2%	4%
bul	6%	7%	lao	3%	1%
cym	2%	4%	slk	4%	2%
gla	2%	4%	slv	6%	3%

Table 6: Win rate comparison of Aya₁ and Aya₁+MURI₁ models vs. gold human outputs across different low-resource languages in NLG. The average win rates are 3.4% and 3.8% for Aya₁ and Aya₁ + MURI₁, respectively.

7 Monolingual Evaluation in Low-Resource Setting

To evaluate the capabilities of MURI-IT and Aya in low-resource settings, we conduct an additional set of experiments with only monolingual training. We first select ten low-resource languages: Azerbaijani, Kazakh, Lao, Khmer, Welsh, Scottish Gaelic, Belarusian, Bulgarian, Slovenian, and Slovak. While available in Aya, these languages are not part of the human-annotated portion of Aya and only have examples via translation, thus possibly lacking in cultural context and idiomaticity. We test in our experiment how well MURI-IT complements translated content in this setting. Furthermore, the languages were chosen to represent diverse language families: Turkic, Tai-Kadai, Austroasiatic, Celtic, and Slavic.

For this low-resource scenario, we sample at most 15K examples from both Aya and MURI-IT. Then we instruction-tune mT5-XXL for each language and for Aya and Aya+MURI-IT separately, resulting in Aya₁ and Aya₁+MURI₁ models.

Since many of these languages are not supported by multilingual MMLU and Command R+, we use the few-shot classification task Taxi1500 (Ma et al., 2023) for NLU. For NLG, we use TranslatedDolly; however, we translate model outputs to English (via Google Translate) and calculate win rates with Llama-3-70B-Instruct of translated outputs vs. Dolly’s gold English human outputs.

Table 5 shows that incorporating MURI-IT consistently improves performance for low-resource languages, except for Slovakian. The baseline mT5 has 19.7% accuracy (slightly above random chance: 16.7%) while Aya has 35.1%. Even though Aya’s performance is impressive, we observe that incorporating MURI-IT further improves the results to 37.2%. This shows that MURI-IT can complement Aya in low-compute and low-resource settings and can further improve its performance.

Table 6 shows that on average, the win rate of Aya₁ is 3.4% and of Aya₁+MURI₁ is 3.8%. This indicates that the models do not produce good-quality outputs for these low-resource languages. While both Aya-101 and MURI-101 demonstrate better NLG performance than prior multilingual instruction-tuning models such as mT0, this shows that current models are still limited in their NLG capabilities for low-resource languages.

We hypothesize that the limitations of our base model, mT5, make it hard to achieve large improvements in NLG for low-resource languages. As recent autoregressive models begin to support a larger number of languages, we anticipate that MURI-IT, with its human-written outputs, will be used effectively to improve NLG performance for low-resource languages.

8 Conclusion

This study presents Multilingual Reverse Instructions (MURI), a novel approach for generating high-quality instruction tuning datasets for low-resource languages. Our method addresses limitations of translation-focused multilingual datasets by using human-written texts as outputs, combined with a translation pipeline and LLMs to create contextually appropriate instructions. The resulting dataset, MURI-IT, of more than 2 million pairs across 200 languages greatly expands the resources available for multilingual language models.

Evaluation by native speakers from 13 languages confirmed the dataset’s quality and idiomaticity. Our instruction-tuned mT5-XXL model, MURI-101, strongly outperformed previous models on NLU and NLG in both multi- and monolingually. Notably, incorporating MURI-IT improved performance for most low-resource languages, effectively complementing existing datasets like Aya.

While challenges remain, particularly in NLG for low-resource languages, MURI-IT represents a an important step towards more inclusive and linguistically diverse language models. Future work will focus on refining data quality and leveraging advanced multilingual models to further improve performance across languages.

9 Limitations

Despite the promising results obtained, several limitations must be acknowledged in this study. First, we did not perform clustering – in contrast to Köksal et al. – due to uncertainties regarding the performance of multilingual encoders. Clustering could potentially enhance content diversity, ensuring a greater variety of linguistic and cultural contexts.

Additionally, the quality of the data can be further improved through more rigorous cleaning such as the removal of headers and footers from documents. Similarly, the Multilingual Reverse Instructions methodology, particularly for low-resource languages, would benefit from more standardized source data. Our evaluation, involving native speakers, noted deficits in languages with less standardized orthography or prominent regional dialects. Additional preprocessing could address this issue.

Addressing these limitations in future work will involve integrating advanced clustering algorithms, enhancing data cleaning protocols, and expanding the dataset to include a wider range of languages.

References

AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
Artetxe and Schwenk (2019) Mikel Artetxe and Holger Schwenk. 2019. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. 2023. Pythia: A suite for analyzing large language models across training and scaling.
Bizzoni et al. (2020) Yuri Bizzoni, Tom S Juzek, Cristina España-Bonet, Koel Dutta Chowdhury, Josef van Genabith, and Elke Teich. 2020. How human is machine translationese? comparing human and machine translations of text and speech. In Proceedings of the 17th International Conference on Spoken Language Translation, pages 280–290, Online. Association for Computational Linguistics.
Chen et al. (2023) Yongrui Chen, Haiyun Jiang, Xinting Huang, Shuming Shi, and Guilin Qi. 2023. Tegit: Generating high-quality instruction-tuning data with text-grounded task design.
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. Palm: Scaling language modeling with pathways.
Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models.
Cohere (2024) Cohere. 2024. Command r+.
Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
Conover et al. (2023) Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. 2023. Free dolly: Introducing the world’s first truly open instruction-tuned llm. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Dryer (2013) Matthew S. Dryer. 2013. Order of subject, object and verb (v2020.3). In Matthew S. Dryer and Martin Haspelmath, editors, The World Atlas of Language Structures Online. Zenodo.
Dryer and Haspelmath (2013) Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online (v2020.3). Zenodo.
EleutherAI (2024) EleutherAI. 2024. lm-evaluation-harness.
Gellerstam (1986) Martin Gellerstam. 1986. Translationese in swedish novels translated from english. In Lars Wollin and Hans Lindquist, editors, Translation Studies in Scandinavia, pages 88–95. CWK Gleerup.
Gupta et al. (2023) Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, and Swaroop Mishra. 2023. Targen: Targeted data generation with large language models.
Holmström and Doostmohammadi (2023) Oskar Holmström and Ehsan Doostmohammadi. 2023. Making instruction finetuning accessible to non-English languages: A case study on Swedish models. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 634–642, Tórshavn, Faroe Islands. University of Tartu Library.
Holtzman et al. (2020) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In International Conference on Learning Representations.
Honovich et al. (2022) Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. 2022. Unnatural instructions: Tuning language models with (almost) no human labor.
Iggesen (2013) Oliver A. Iggesen. 2013. Number of cases (v2020.3). In Matthew S. Dryer and Martin Haspelmath, editors, The World Atlas of Language Structures Online. Zenodo.
ImaniGooghari et al. (2023) Ayyoob ImaniGooghari, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André Martins, François Yvon, and Hinrich Schütze. 2023. Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1082–1117, Toronto, Canada. Association for Computational Linguistics.
Ivison et al. (2023) Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. 2023. Camels in a changing climate: Enhancing lm adaptation with tulu 2.
JarbasAI (2024) JarbasAI. 2024. Pywikihow.
Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b.
Jiang et al. (2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. Mixtral of experts.
Jiao et al. (2023) Wenxiang Jiao, Jen tse Huang, Wenxuan Wang, Zhiwei He, Tian Liang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. Parrot: Translating during chat using large language models tuned with human translation and feedback.
Joshi et al. (2021) Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2021. The state and fate of linguistic diversity and inclusion in the nlp world.
Kargaran et al. (2023) Amir Kargaran, Ayyoob Imani, François Yvon, and Hinrich Schuetze. 2023. GlotLID: Language identification for low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6155–6218, Singapore. Association for Computational Linguistics.
Köpf et al. (2024) Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. 2024. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36.
Kudugunta et al. (2023) Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2023. Madlad-400: A multilingual and document-level large audited dataset.
Köksal et al. (2024) Abdullatif Köksal, Timo Schick, Anna Korhonen, and Hinrich Schütze. 2024. Longform: Effective instruction tuning with reverse instructions.
Köpf et al. (2023) Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. 2023. Openassistant conversations – democratizing large language model alignment.
Ladhak et al. (2020) Faisal Ladhak, Esin Durmus, Claire Cardie, and Kathleen McKeown. 2020. WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4034–4048, Online. Association for Computational Linguistics.
Lai et al. (2023) Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2023. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback.
Li et al. (2023) Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, and Timothy Baldwin. 2023. Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation.
Lin et al. (2022) Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, and Xian Li. 2022. Few-shot learning with multilingual language models.
Longpre et al. (2023a) Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts. 2023a. The flan collection: Designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
Longpre et al. (2023) Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, and et al. 2023. The data provenance initiative: A large scale audit of dataset licensing & attribution in ai. arXiv preprint arXiv:2310.16787. 2023b.
Ma et al. (2023) Chunlan Ma, Ayyoob ImaniGooghari, Haotian Ye, Ehsaneddin Asgari, and Hinrich Schütze. 2023. Taxi1500: A multilingual dataset for text classification in 1500 languages. arXiv preprint arXiv:2305.08487.
Muennighoff et al. (2023) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2023. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
Muennighoff et al. (2022) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. 2022. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
Nezhad and Agrawal (2024) Sina Bagheri Nezhad and Ameeta Agrawal. 2024. Exploring the maze of multilingual modeling.
Nguyen et al. (2023) Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2023. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages.
Ogueji et al. (2021) Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. 2021. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 116–126, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Ortiz Suárez et al. (2020) Pedro Javier Ortiz Suárez, Laurent Romary, and Benoît Sagot. 2020. A monolingual approach to contextualized word embeddings for mid-resource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703–1714, Online. Association for Computational Linguistics.
Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog.
Sanh et al. (2022) Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M. Rush. 2022. Multitask prompted training enables zero-shot task generalization.
Shaham et al. (2024) Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, and Matan Eyal. 2024. Multilingual instruction tuning with just a pinch of multilinguality.
Shliazhko et al. (2022) Oleh Shliazhko, Alena Fenogenova, Maria Tikhonova, Vladislav Mikhailov, Anastasia Kozlova, and Tatiana Shavrina. 2022. mgpt: Few-shot learners go multilingual.
Singh et al. (2024) Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemiński, Hakimeh Fadaei, Irem Ergün, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda, Emad A. Alghamdi, Sebastian Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, Ahmet Üstün, Marzieh Fadaee, and Sara Hooker. 2024. Aya dataset: An open-access collection for multilingual instruction tuning.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models.
Vanmassenhove et al. (2021) Eva Vanmassenhove, Dimitar Shterionov, and Matthew Gwilliam. 2021. Machine translationese: Effects of algorithmic bias on linguistic complexity in machine translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2203–2213, Online. Association for Computational Linguistics.
Vidgen et al. (2021) Bertie Vidgen, Tristan Thrush, Zeerak Waseem, and Douwe Kiela. 2021. Learning from the worst: Dynamically generated datasets to improve online hate detection. In ACL.
Wang et al. (2023) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language models with self-generated instructions.
Wang et al. (2022) Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Savan Doshi, Shailaja Keyur Sampat, Siddhartha Mishra, Sujan Reddy A, Sumanta Patro, Tanay Dixit, and Xudong Shen. 2022. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Winata et al. (2023) Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, and Sebastian Ruder. 2023. NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834, Dubrovnik, Croatia. Association for Computational Linguistics.
Workshop et al. (2023) BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Davut Emre Taşar, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Daniel McDuff, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, and Thomas Wolf. 2023. Bloom: A 176b-parameter open-access multilingual language model.
Wu and Dredze (2020) Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Computational Linguistics.
Xu et al. (2024) Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. 2024. A paradigm shift in machine translation: Boosting translation performance of large language models.
Xue et al. (2021) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
Yu et al. (2022) Sicheng Yu, Qianru Sun, Hao Zhang, and Jing Jiang. 2022. Translate-train embracing translationese artifacts. Association for Computational Linguistics.
Üstün et al. (2024) Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. 2024. Aya model: An instruction finetuned open-access multilingual language model.

10 Appendices

Few-shot example used for Reverse Instruction generation

Answer: Apache Kafka is a distributed system. The main components of Apache Kafka […]

> What kind of instruction could this be the answer to?

Instruction: What are the main components of Apache Kafka?

(three more few shot examples)

Answer: [DOC] > What kind of instruction could this be the answer to?

Instruction:

Table 7: Few-shot examples used for reverse instruction generation.

Guideline for Evaluating MURI-IT

Attributes for Evaluation

1.
Alignment: Determine whether the instruction aligns with the output on a scale of 1 to 5, where:
- •
  
  1: The instruction and the output are completely misaligned, making it difficult to understand how the output was generated based on the given instruction (e.g., the output does not or not fully answer the instruction).
- •
  
  5: The instruction and the output are perfectly aligned, providing clear guidance on how to generate the output based on the given instruction.
2.
Instruction Format: Identify if the instruction is phrased as an instruction or question:
- •
  
  Mark as "Instruction" if the given instruction provides a directive for generating the response, e.g., it is phrased as an instruction or question.
- •
  
  Mark as "No Instruction" if the given instruction is phrased as a statement, prompting no further answer.
3.
Grammatical and Lexical Correctness and Cohesiveness of the Instruction: Assess whether the instruction is grammatically and lexically correct on a scale of 1 to 5, where:
- •
  
  1: The instruction contains numerous grammatical errors and uses inappropriate or unclear language, hindering comprehension and interpretation. The text is not cohesive; parts of the text don’t belong together.
- •
  
  5: The instruction is grammatically flawless and employs precise and appropriate language, facilitating clear understanding and interpretation.
4.
Grammatical and Lexical Correctness and Cohesiveness of the Output: Assess whether the output is grammatically and lexically correct on a scale of 1 to 5, where:
- •
  
  1: The output contains numerous grammatical errors and uses inappropriate or unclear language, hindering comprehension and interpretation. The text is not cohesive; parts of the text don’t belong together.
- •
  
  5: The output is grammatically flawless and employs precise and appropriate language, facilitating clear understanding and interpretation.
5.
Informational Sufficiency: Assess whether each instruction provides sufficient information for generating comprehensive outputs and whether it can be reasonably answered based on the provided information on a scale of 1 to 5, where:
- •
  
  1: The instruction lacks essential information and details, making it impossible to generate a reasonable answer or is ambiguous and not understandable. Example: Summarize the article.
- •
  
  5: The instruction provides ample information, and it is possible to be answered by a Large Language Model. Example: What does the word Rigadon mean?