Thesis RAG Retrieval Augmented Generation For The IR-Anthology
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
Faculty of Media
Degree Programme Digital Engineering
Master’s Thesis
Islam Torky
...............................................
Islam Torky
Abstract
In recent years, Natural Language Processing (NLP) has seen a big leap
forward with the emergence of pre-trained Large Language Models (LLMs).
These models, like BERT, GPT-3, and Llama2, have excelled in various NLP
tasks. While they’ve set new standards, they also face challenges, notably
hallucinations, where they generate information that sounds plausible but is
incorrect. Furthermore, they struggle with staying accurate and updated with
new data called information cutoff. In addition to these two, they are general-
purpose and not limited to a specific field; therefore, lack domain specificity.
Retrieval Augmented Generation (RAG) is a solution to these challenges
aiming to combine the strengths of LLMs with external knowledge retrieval.
RAG retrieves information during inference, reducing the risk of generating
incorrect content and keeping information up-to-date. This thesis explores
implementing RAG on the IR-Anthology, a vast collection of research papers
on information retrieval. The goal is to make retrieval of information from the
IR-Anthology more efficient, enabling easier access for researchers or students.
To set the stage, the thesis begins with an overview of recent NLP and RAG
advancements, providing a foundation for understanding and introducing in-
novative methodologies using RAG. The thesis adopts a systematic approach,
tailoring pipelines for the IR-Anthology dataset and evaluating their effective-
ness in retrieval and generation stages. More specifically, this approach aims
to not only assess overall efficacy but also understand variations in outputs
across scenarios and data subsets.
The thesis explores how dividing documents into smaller segments (chunks)
affects RAG pipelines, as well as different retrieval methods. The conducted
experiments reveal that for retrieval of information from PDFs, larger chunks
improve accuracy. However, for generating text, smaller chunks benefit the
LLM by providing more focused information. Surprisingly, simpler retrieval
methods outperform more complex ones.
Contents
1 Introduction 4
2 Related Work 8
2.1 IR-Anthology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Retrieval Augmented Generation . . . . . . . . . . . . . . . . . 10
2.2.1 Retrieval and Generation . . . . . . . . . . . . . . . . . . 11
2.2.2 RAG Evaluation . . . . . . . . . . . . . . . . . . . . . . 11
2.3 RAG Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Transformer Architecture . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Large Language Models . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Llama2 Architecture . . . . . . . . . . . . . . . . . . . . 21
2.6 Embedding Model . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Approach 27
3.1 Mistral 7B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Prometheus 13B . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Quantization & vLLM: . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 BGE Large: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Llamaindex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.1 Parsing/Chunking Methods . . . . . . . . . . . . . . . . 37
3.5.2 Indexing/Embedding of Chunks . . . . . . . . . . . . . . 38
3.5.3 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.4 Generation . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 Retrieval Methods . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.1 Vector Retrieval (HNSW) . . . . . . . . . . . . . . . . . 42
3.6.2 Best Matching 25 (BM25) . . . . . . . . . . . . . . . . . 42
3.6.3 Hybrid Retrieval . . . . . . . . . . . . . . . . . . . . . . 44
3.6.4 Hypothetical Document Embedding (HyDE) . . . . . . . 45
3.6.5 Reranker . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 46
i
CONTENTS
4 Evaluation 48
4.1 Generating Synthetic Data . . . . . . . . . . . . . . . . . . . . . 49
4.1.1 LLM Parameters . . . . . . . . . . . . . . . . . . . . . . 50
4.1.2 Question Generation . . . . . . . . . . . . . . . . . . . . 51
4.1.3 Answer Generation . . . . . . . . . . . . . . . . . . . . . 52
4.1.4 Document Selection . . . . . . . . . . . . . . . . . . . . . 53
4.2 Retrieval Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Generation Evaluation . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 65
6 Conclusion 69
Bibliography 72
ii
List of Figures
2.1 Querying GPT 3.5 to inquire about the research done by Sarkar
et al. [2023] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Basic flow of a RAG pipeline. Individual steps are illustrated
with circled numbers. . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Vanilla transformer adapted from Vaswani et al. [2017] . . . . . 15
2.4 Multi-Head Attention Mechanism. Every head is represented
by a Query, Key, and Value . . . . . . . . . . . . . . . . . . . . 20
2.5 Llama2 decoder-only transformer architecture as introduced by
Touvron et al. [2023]. . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Grouped Query Attention as introduced by Ainslie et al. [2023] . 23
1
LIST OF FIGURES
2
List of Tables
3
Chapter 1
Introduction
In recent years, the field of natural language processing (NLP) has experienced
a transformative shift with the advent of pre-trained large language models
(LLM). These models, pre-trained on large corpora of text, have demonstrated
remarkable performance across a wide spectrum of NLP tasks. LLMs, such
as BERT (Devlin et al. [2019a]), GPT-3 (Patel et al. [2023]), and Llama2
(Touvron et al. [2023]), have achieved state-of-the-art results in tasks ranging
from text classification to chat models.
While pre-trained LLMs have undeniably made significant strides in NLP, it
is crucial to acknowledge their limitations. One notable challenge that LLMs
face is the issue of hallucinations. Hallucinations refer to the generation of
plausible-sounding but incorrect or misleading information in generated text.
These models often rely on patterns in the training data and may produce re-
sponses that appear coherent but are factually incorrect or nonsensical. This
is particularly problematic in applications where accuracy and reliability are
paramount, such as medical diagnosis or legal document generation (Zhang
et al. [2023]). Furthermore, LLMs tend to under perform when confronted
with up-to-date information due to the information cutoff; once an LLM has
been initially trained it can no longer process any new data without being
finetuned. They are heavily reliant on the data distribution they were trained
on, and their performance can degrade when applied to tasks involving recent
developments. Another limitation of LLMs is being experts in a certain field,
since they are heavily dependant on their data distribution and this data is
not specific to a certain field they lack domain specificity. Understanding these
limitations of LLMs is crucial when considering their application in practical
scenarios. These shortcomings highlight the need for more specialized and
robust approaches, such as Retrieval Augmented Generation (RAG), which
aims to leverage the strengths of LLMs while addressing their weaknesses in
handling factual accuracy, up-to-date information, and domain-specific knowl-
4
CHAPTER 1. INTRODUCTION
edge.
RAG, first introduced by Lewis et al. [2020] utilized a configuration where
the parametric memory was implemented as a pre-trained sequence-to-sequence
(seq2seq) encoder-decoder transformer, while the non-parametric memory con-
sisted of a dense vector index of Wikipedia. This non-parametric memory was
accessed using a pre-trained neural retriever, illustrating the incorporation of
external knowledge retrieval in the RAG model. During inference, the models
retrieve pertinent passages from Wikipedia, which are then employed in re-
sponse generation. The high-level architecture of a standard RAG pipeline is
depicted in Figure 1.1. This illustration outlines the operational flow wherein
a user submits a query to the RAG pipeline. Subsequently, analogous infor-
mation (referred to as a chunk) is retrieved from the designated knowledge
base. This retrieved information is then fed into the generator (LLM), which
generates a response characterized by its adherence to truth, up to date, and
alignment with the user’s domain. This capability enables RAG to access con-
text specific to the query which has the potential to address the problem of
hallucinations, information cutoff, and domain specificity (Siriwardhana et al.
[2023]).
Figure 1.1: High Level RAG interaction between a knowledge base, LLM, and a
user.
5
CHAPTER 1. INTRODUCTION
6
CHAPTER 1. INTRODUCTION
• Chapter 5: The analysis chapter will present the findings of the research,
including the generated results and any insights gleaned from the anal-
ysis.
• Chapter 6: The concluding chapter will give a brief review of all previous
chapters as well as addressing the major challenges faced and finally
detailing the final insights.
7
Chapter 2
Related Work
2.1 IR-Anthology
The Information Retrieval Anthology (IR-Anthology) by Potthast et al. [2021]
draws inspiration from and addresses challenges identified in existing research
and projects within the domain of scholarly search and information retrieval.
As an ongoing initiative, the IR-Anthology serves as a valuable repository for
researchers and practitioners, providing access to a diverse range of literature
related to information retrieval.
The ACL Anthology reference corpus proposed by Bird et al. [2008], a
digital archive of conference and journal papers in natural language processing
and computational linguistics, provides a curated collection of publications and
serves as a reference repository of research results. It serves as a benchmark
and inspiration for the IR Anthology, with its success attributed to a unified
collection of bibliographic metadata and a comprehensive set of openly accessi-
ble full texts. The centralized web service architecture and search capabilities
of the ACL Anthology form a foundational reference for the development of
8
CHAPTER 2. RELATED WORK
Figure 2.1: Querying GPT 3.5 to inquire about the research done by Sarkar et al.
[2023]
Incorporating the IR-Anthology into a RAG pipeline not only enriches the
breadth of information available but also addresses the issue of misinformation
or incomplete data often encountered in LLMs. By integrating this domain-
specific repository, RAG can establish a robust framework for fact-checking
9
CHAPTER 2. RELATED WORK
10
CHAPTER 2. RELATED WORK
pη (z|x) (2.1)
• η: Non-parametric retriever.
• θ: Parametric generator.
11
CHAPTER 2. RELATED WORK
12
CHAPTER 2. RELATED WORK
for it are.
Figure 2.2: Basic flow of a RAG pipeline. Individual steps are illustrated with
circled numbers.
1. Parsing & Chunking: PDFs are first parsed and split into smaller chunks.
2. Encode Chunks: The chunks are then processed into an embedding model
which represents the chunks’ textual content into numerical representa-
tions called embeddings.
3. Index: After the chunks have been encoded, they are then stored into
the vector database and are ready to be retrieved.
4. Encode Query: Once the user gives in a query to the pipeline it is trans-
formed into an embedding similarly as Step 2.
6. Similar Chunks & Query: After the similar chunks are retrieved they
are fed into a prompt along with the query. This prompt contains an
13
CHAPTER 2. RELATED WORK
instruction for the LLM on how to utilize the information it has been
given.
7. Prompting: After the prompt has been built it is then fed to the LLM.
14
CHAPTER 2. RELATED WORK
address the differences between the vanilla architecture and the architectures
utilized within this thesis.
15
CHAPTER 2. RELATED WORK
Preprocessing:
• Tokenization: Breaks down the input text into individual units, like
words or sub-words, called tokens. This is done through Byte Pair En-
coding (BPE); which is a subword tokenization technique used in trans-
formers. It iteratively merges frequent character pairs to create new
subwords. This creates a smaller vocabulary while handling rare words
(broken down into known subwords) and capturing some word context.
• Context Window: defined by the context length (L), acts like a sliding
frame that focuses on a portion of the input sequence at a time. Its size
determines how much information the model considers together, enabling
it to capture contextual dependencies between elements. However, a
larger window increases computational cost and may introduce irrelevant
information, while a smaller window might miss important dependencies.
The optimal choice depends on the task, data, and resources.
16
CHAPTER 2. RELATED WORK
Encoder Block:
Decoder Block:
• Output Layer: Converts the final decoder output into the desired format,
like words in a translation task.
17
CHAPTER 2. RELATED WORK
Self-Attention:
This mechanism allows the model to attend to all elements within a single
sequence, capturing relationships between tokens. It involves three key steps:
Linear Projections: Each token is projected into three different vector
spaces given in Equations 2.3, 2.4, 2.5: Query (Q), Key (K), and Value (V).
These projections are learned during training (Wq , Wk , Wv ), then it is multi-
plied to the input sequence embedding (X).
Q = Wq X (2.3)
K = Wk X (2.4)
V = Wv X (2.5)
Scaled Attention Scores: The model calculates a score for each pair of
tokens, indicating how relevant one token (V) is to another token (Q). These
scores are obtained by multiplying the Q vector of a token with the K vector
√
of all other tokens in the sequence. They are then scaled by a factor of 1/ dk
where dk is used to counteract the effect of the vanishing gradient problem
(during training) if the softmax of the dot product of Q and K returns a very
small gradient. The final scaled attention scored can be observed in Equation
2.6
18
CHAPTER 2. RELATED WORK
QK T
Scaled Attention Score = √ (2.6)
dk
QK T
Attention(Q, K, V ) = softmax( √ )V (2.7)
dk
Multi-Head Attention:
This extends self-attention by creating multiple heads (Nx ) that learn dif-
ferent aspects of the relationships between tokens. The input sequence, Q,
K, and V vectors are all linearly projected Nx times, resulting in Nx sets of
Q, K, and V vectors for each head. Self-attention is performed independently
on each head using their respective Q, K, and V vectors. The outputs from
each head are then concatenated to form a final, richer representation of the
sequence. The multi-head attention concatenation can be seen in Equation
2.8. While a visual representation can be seen in Figure 2.4
19
CHAPTER 2. RELATED WORK
20
CHAPTER 2. RELATED WORK
like GPT-4, open-source LLMs offer distinct advantages for academic research.
These advantages include fostering transparency and cost-effectiveness, along
with enabling the reproducibility of results through readily available models.
Additionally, open-source LLMs facilitate easier deployment on local systems,
streamlining the research process.
This section delves into the architectural underpinnings of the evaluation LLM
which will be later introduced in Chapter 3 by first examining the foundational
21
CHAPTER 2. RELATED WORK
22
CHAPTER 2. RELATED WORK
GQA was proposed by Ainslie et al. [2023], which aimed to decrease the
tension between performance and computational efficiency. Transformers rely
heavily on the attention mechanism, which allows them to attend to specific
parts of the input sequence. However, the standard attention mechanism in-
volves extensive computations, especially for LLMs with a large number of
parameters and attention heads. This high computational cost translates to
slower training times and increased resource requirements for inference, hin-
dering the practical application of LLMs. Multi - head attention traditionally
has each query head within the attention layer attending to the entire input
sequence independently. GQA as seen in Figure 2.6 groups these query heads
into multiple smaller groups, each sharing a single key and value head.
23
CHAPTER 2. RELATED WORK
GQA, the key could represent the input sequence embedding and the value
could represent the intermediate attention scores calculated for specific groups
of query heads. When a specific group of query heads needs to attend to the
same input sequence, GQA can first check the KV Cache to see if the corre-
sponding attention scores (value) have already been computed for the same
key (input sequence embedding). If the scores are found in the cache, GQA
can reuse them directly, avoiding redundant calculations. This significantly
improves efficiency compared to recalculating the scores every time.
Training:
Model Sizes:
Llama2 7B: This base model boasts 7 billion parameters, offering a bal-
ance between performance and resource efficiency.
Llama2 13B: With 13 billion parameters, this model exhibits improved
capabilities compared to the 7B variant, particularly in tasks requiring deeper
understanding.
Llama2 70B: The largest model in the series, featuring 70 billion param-
eters, pushing the boundaries of LLM performance.
Performance:
The primary distinction between the models lies in their parameter size and
associated capabilities. As the parameter size increases, the model’s capacity
to learn complex relationships and generate nuanced text grows. The 70B
model demonstrates the most impressive performance across various bench-
marks, including MMLU scores and reasoning tasks. In Table 2.1 the different
MMLU values for the different Llama2 sizes can be seen. The 13B model
delivers competitive performance, while the 7B variant offers a more econom-
ical choice with respectable capabilities. The video random access memory
(vRAM) requirement scales with the parameter size. The 7B model has the
lowest requirement, followed by the 13B and 70B models, respectively.
24
CHAPTER 2. RELATED WORK
Table 2.1: Llama2 Parameters and MMLU Scores as reported by Touvron et al.
[2023].
understanding the discussion of the evaluation LLM in Chapter 3 and its spe-
cific adaptations built upon this innovative base.
25
CHAPTER 2. RELATED WORK
original text. This objective helps the model understand relationships between
sentences.
Fine-tuning: The pre-trained BERT model is then fine-tuned for specific
NLP tasks like question answering, sentiment analysis, or text summarization.
This fine-tuning involves adding a task-specific output layer on top of the pre-
trained BERT encoder and training the entire model on labeled data for the
desired task.
This pre-training and fine-tuning approach allows BERT to acquire general-
purpose knowledge from vast amounts of text data and then specialize in spe-
cific tasks through fine-tuning.
This analysis of the BERT architecture provides a crucial foundation for
understanding the discussion of the embedding in Chapter 3.
A foundational understanding of the BERT architecture, as presented in
this chapter, is essential for the understanding the embedding model used in
RAG pipeline which will be explored in Chapter 3.
26
Chapter 3
Approach
Chapters 1 and 2 established the limitations of LLMs in three key areas: hal-
lucinations, information cutoff, and domain specificity. Chapter 2 explored
RAG as a potential solution, demonstrating its efficacy and robustness in
open-domain question answering, abstractive question answering, and Jeop-
ardy question generation tasks. Followed by the exploration of the RAG
pipeline, along with its individual components, and the scientific underpinnings
of transformer architectures and its variations. Building upon this foundation,
Chapter 3 will delve deeper into the selected components and variations that
have been implemented within this thesis.
This chapter dives into the core components of the system. It starts by
introducing the two LLMs used in this thesis, Mistral 7B and Prometheus
13B, highlighting their differences from the Llama2 architecture. The focus
then turns to optimization techniques, including quantization and vLLMs.
Next, the chapter explores the crucial role of the embedding model, which
translates text into numerical vectors.
The narrative then introduces Llamaindex, a unifying platform that or-
chestrates the entire RAG logic. This framework seamlessly integrates all the
previously discussed components: parsing methods, indexing, retrieval, and
generation. Finally, the chapter delves into retrieval methods, explaining how
the RAG pipeline finds relevant information from the vast knowledge base.
3.1 Mistral 7B
Mistral 7B, introduced by Jiang et al. [2023], is a decoder-only transformer
language model serving as the backbone and primary RAG component for
response generation in this thesis. This open-source model excels due to its
robust architecture, surpassing many alternatives and revolutionizing the ca-
pabilities of open-source models upon its release. Building upon the existing
27
CHAPTER 3. APPROACH
Parameter Value
dim 4096
n-layers 32
head-dim 128
hidden-dim 14336
n-heads 32
n-kv-heads 8
window-size 4096
context-len 8192
vocab-size 32000
28
CHAPTER 3. APPROACH
Figure 3.1: Sliding window attention adapted from Jiang et al. [2023].
Performance:
Jiang et al. [2023] offers two versions of its 7B parameter language model:
a pre-trained base model and an instruction-tuned model specifically designed
for chat applications. The instruction-tuned model1 demonstrates superior
performance in chat-like settings.
While the base model can be run with 14.4 GB of vRAM, Jiang et al.
[2023] recommends using a system with at least 24 GB of vRAM for optimal
performance. Additionally, the model achieves a reported MMLU score of
60.1%. Which outperforms the Llama2 13B model as found in Table 2.1. For
this thesis the instruction-tuned model will be utilized.
1
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-AWQ
29
CHAPTER 3. APPROACH
Figure 3.2: Rolling Buffer Cache adapted from Jiang et al. [2023]. The cache
operates with a predefined capacity of W entries. Data is stored using a key-value
structure, where each key-value pair is placed at a specific position determined by
the modulo operation (imodW ) on the key’s index i. If the index i exceeds the
cache’s capacity W , the oldest entries are overwritten to accommodate new data.
The most recently generated tokens and their corresponding internal representation
are highlighted for easy identification.
30
CHAPTER 3. APPROACH
Input:
• Response to Evaluate: This is the answer that the LLM gives to the
instruction.
• Customized Score Rubric: This is a set of guidelines that tells the eval-
uation LLM how to score the response. It includes things like what the
LLM should look for in a good answer and how to rate different aspects
of the response.
Output:
• Score: This is a number between 1 and 5 that indicates how well the
response performed on the task.
Since Prometheus was built upon Llama2 it offers the same 7B and 13B
finetuned models. To choose the evaluation model required for this thesis the
Multi-Turn Benchmark (MT Bench) - Human Preference metric was the most
suitable. The MT-Bench first introduced by Zheng et al. [2023] is a metric
that measures the ability of LLMs to engage in coherent, informative, and
engaging conversations. The authors hand-crafted multiple customized score
rubrics and generated a reference answer using GPT-4 for each test prompt
as well. Which in turn created a new evaluation benchmark called MT-Bench
Human Preference. In Table 3.2 it can be observed how fine-tuning improved
31
CHAPTER 3. APPROACH
and aligned more with human preferences. For this thesis the Prometheus
13B2 will be utilized as the evaluator LLM.
Table 3.2: MT - Bench Human Preference for different models as reported by Kim
et al. [2023].
32
CHAPTER 3. APPROACH
AWQ (Lin et al. [2023]) recognizes that not all weights in a model are
equally important. Some weights have a significant impact on the final out-
put, while others have a minimal effect. AWQ uses a calibration step to identify
these "salient weights." A small subset of the training data is passed through
the model, and the activations are analyzed. Based on this analysis, AWQ
determines which weights have a larger influence on the activations. Once
identified, these crucial weights are protected during the quantization process.
They are quantized with higher precision to minimize the introduction of er-
rors. The remaining, less critical weights are quantized with lower precision.
This approach significantly reduces the overall memory footprint of the model
without compromising accuracy. In Figure 3.3 it can be observed how only the
salient weights are left with their original weights, while the remaining weights
are quantized. The red shading of the X matrix represents how salient the
weights are.
Flash Attention:
33
CHAPTER 3. APPROACH
every element in the input sequence. This process requires storing a large at-
tention matrix in memory, whose size scales quadratically with the sequence
length. For long sequences, this matrix becomes enormous, exceeding the mem-
ory capacity of available hardware. Additionally, standard attention involves
frequent data transfers between slower memory (high bandwith memory) and
faster on-chip memory (static random-access memory) on GPUs, leading to
performance bottlenecks.
Flash Attention tackles these challenges through two key innovations. First
is tiling in which the attention matrix is cleverly divided into smaller, manage-
able tiles. This approach significantly reduces the memory footprint required
to store the entire matrix at once. Secondly instead of repeatedly transferring
data between memory and performing calculations step-by-step, Flash Atten-
tion performs all necessary operations (key, query, and value transformations)
within the on-chip memory in one go. This eliminates the need for frequent
data transfers and boosts performance resulting in a reduced memory footprint
by utilizing tiling.
Flash Attention requires significantly less memory to process long sequences
compared to standard attention. This enables processing of larger models and
longer sequences on hardware with limited memory resources. Also resulting in
Faster Inference by minimizing data transfers and performing fused operations
in on-chip memory, Flash Attention significantly accelerates the attention cal-
culations. This translates to faster model inference times and improved overall
performance.
Flash Attention addresses the memory bottleneck and performance lim-
itations of standard attention mechanisms. By leveraging tiling and fused
operations, it offers a memory-efficient and high-performance solution for pro-
cessing LLMs, particularly those dealing with long sequences. This technique
paves the way for deploying powerful LLMs on resource-constrained devices.
Paged Attention:
34
CHAPTER 3. APPROACH
a particular word in the sequence. When the LLM needs information about a
word, it uses the lookup table to find the corresponding block in memory and
retrieves only the relevant data. This eliminates the need to keep the entire
cache readily available.
35
CHAPTER 3. APPROACH
Table 3.3: Different embedding models used from Xiao et al. [2023]
3.5 Llamaindex
Llamaindex5 is a software framework specifically designed to augment the ca-
pabilities of LLMs in the domain of RAG. Some of the core functionalities of
Llamaindex are data integration, data preprocessing, and retrieval augmenta-
tion. Llamaindex establishes seamless connections between LLMs and diverse
data repositories, including document databases, vector stores, and even other
LLMs. This grants LLMs the ability to retrieve task-relevant information from
these external sources. It can manipulate the retrieved data to render it suit-
able for consumption. This may encompass operations such as summarization,
information extraction, and tokenization. During the text generation process,
Llamaindex dynamically retrieves pertinent data based on the evolving context
and injects it into the LLM.
The user furnishes a prompt or starting point for the text generation
task, along with any pertinent contextual information. This is then provided
through the flow with a provided prompt and context to execute a retrieval
strategy, identifying and retrieving relevant information from the integrated
data sources. The retrieved data undergoes processing to render it compatible
with LLM consumption, and then it is fed back to the LLM as supplementary
information. The LLM leverages both its internal knowledge repository and
the retrieved data to generate text that is not only factually accurate but also
informative and relevant to the prompt and context.
By furnishing LLMs with access to pertinent data, LlamaIndex facilitates
the generation of text that is more accurate, informative, and factually sound.
The retrieved data bolsters the LLM’s ability to maintain coherence and con-
sistency throughout the generated text. LlamaIndex’s capacity to integrate
diverse data sources and tools renders it adaptable to a broad spectrum of use
cases. In essence, LlamaIndex empowers LLMs to move beyond their internal
knowledge base (from pre-training) and leverage external information for a
more robust and informative text generation process.
There are multiple software frameworks that offer this RAG logic, such
5
https://github.com/run-llama/llama_index
36
CHAPTER 3. APPROACH
37
CHAPTER 3. APPROACH
38
CHAPTER 3. APPROACH
• Calculate the distance between the node and each sampled point.
• Connect the node to the M closest points, forming its immediate neigh-
bors at Level 0.
• For
√ each node in the shortlist: Sample a small number of points (e.g.,
M ) from the previous level (Level L-1). Calculate the distance to all
points in the shortlist, and among these, connect to the furthest point.
This connection acts as a "shortcut" to potentially distant clusters.
39
CHAPTER 3. APPROACH
• Ef : 100
• M : 16
Qdrant:
3.5.3 Retrieval
Following the parsing and indexing of document chunks, the appropriate re-
trieval engine is selected. Subsequently, the retrieval method’s parameters,
including the topk value, require careful consideration.
T opk refers to the maximum number of most relevant chunks retrieved by
the chosen method. Intuitively, a larger topk value suggests a broader context
retrieved from the index, potentially leading to a more comprehensive response
to the query.
9
https://github.com/qdrant/qdrant
40
CHAPTER 3. APPROACH
3.5.4 Generation
Once the retrieval stage has identified pertinent chunks, these chunks along
with the user’s query are used to construct a prompt that guides the LLM in
response generation. This prompt essentially serves as a structured input for
the LLM, incorporating the user’s intent and the retrieved contextual infor-
mation. The prompt typically combines the user’s query with key elements
from the retrieved documents. This may involve extractive techniques where
relevant snippets from the retrieved chunks are directly incorporated into the
prompt, or abstractive techniques where the key concepts and factual infor-
mation are paraphrased and woven into a cohesive prompt. The constructed
prompt is then fed into the LLM. The LLM’s ability to process language and
generate coherent text allows it to leverage the information within the prompt
to formulate a response that addresses the user’s query and incorporates in-
sights from the retrieved documents.
As illustrated in Figure 3.6, a sample prompt can provide a concrete exam-
ple of how the user’s query and retrieved context are combined to guide the
LLM’s response generation.
41
CHAPTER 3. APPROACH
42
CHAPTER 3. APPROACH
X tf(w, D) · (k1 + 1)
score(D, Q) = IDF(w) · (3.1)
|D|
w∈Q
tf(w, D) + k1 · 1 − b + b · avgdl
• IDF (w): Inverse document frequency of term w. This reflects how rare
or common the term is across the document collection.
• IDF (w): This component ensures that terms that appear frequently
across many documents (common words like "the" or "a") contribute less
to the score compared to rare and potentially more informative terms.
• tf (w, D): Documents containing a query term more often are intuitively
considered more relevant. However, simply counting occurrences can be
misleading. BM225 addresses this by incorporating a saturation factor
through k1 .
43
CHAPTER 3. APPROACH
44
CHAPTER 3. APPROACH
only a few systems, it can still achieve a good RRF score, promoting diversity
in the final results.
3.6.5 Reranker
Rerankers are encoder - based transformer models that act as a post-processor
to the retrieved chunks. They function as specialized models designed to re-
evaluate and reorder the initial set of documents retrieved by a search engine
in response to a user query. This process aims to elevate the most relevant and
informative documents to the top of the search results, enhancing the user’s
experience by prioritizing the information they seek.
Following the initial retrieval of the topk candidate documents by the re-
triever component, a reranker module takes center stage. This module metic-
ulously evaluates the relevance of each retrieved document to the specific user
query. This evaluation leverages a cross-encoder architecture, as introduced in
Lee et al. [2023], which assesses the semantic similarity between a retrieved doc-
ument chunk and the query. Unlike traditional vector similarity approaches,
45
CHAPTER 3. APPROACH
where contextually chunked text and query text are independently processed
by an embedding model before distance metrics (e.g., cosine similarity, Eu-
clidean distance) are applied, cross-encoders encode both texts jointly within
the embedding model. The resulting embeddings are then fed into a classi-
fier layer, such as a neural network, to generate a final relevance score. After
generating the final relevance score the chunks are then re - ordered with the
highest score being placed at the first position.
This thesis leveraged the BAAI/BGE − reranker − large (Xiao et al.
[2023]) model10 which has BERT-like architecture. In Figure 3.7 the differences
of how the bi - encoder and cross - encoder handle the retrieved chunk, and
query.
46
CHAPTER 3. APPROACH
methods utilized throughout this thesis. This includes the BM25 algorithm,
vector retrieval using HNSW, and others.
Chapter 4, which follows, will explore the evaluation methodologies em-
ployed to identify the most suitable configuration for the IR-Anthology dataset.
Additionally, it will delve into the various evaluations used and their connec-
tion to the RAG pipeline.
47
Chapter 4
Evaluation
The preceding chapters have meticulously dissected the inner workings of the
RAG pipeline, a powerful tool designed to leverage the strengths of large LLMs
and retrieval techniques. Chapters 1 and 2 unveiled the inherent limitations
of LLMs, highlighting their short comings with hallucinations, information cut
off, and domain-specificity. Chapter 2 introduced RAG as a potential remedy,
showcasing its effectiveness across various tasks like open-domain question an-
swering, abstractive question answering, and Jeopardy question generation.
The explanation included a flow chart that explained the basic steps in a RAG
pipeline. It also explained the key components, like transformers, which are
the foundation of LLMs, and embedding models.
Chapter 3 built upon this foundation by delving into the intricate details of
RAG’s implementation. Chapter 3 meticulously examined the specific LLMs
employed within the RAG pipeline, such as Mistral 7B, and Prometheus 13B,
to identify their distinct advantages.
Furthermore, Chapter 3 investigated optimization techniques like quanti-
zation and vLLMs. Then an exploration of the unifying platform, the Lla-
maindex framework, that seamlessly orchestrates the execution of RAG logic
by integrating all the previously discussed components. Finally, the chapter
concluded with a comprehensive analysis of the retrieval methods utilized by
RAG to extract relevant information from vast datasets.
Having established a thorough understanding of the RAG pipeline, Chapter
4 now shifts its focus towards the crucial process of evaluation. This chapter
will delve into the methodologies employed to assess the suitability of various
RAG configurations for the specific demands of the IR-Anthology dataset.
An exploration of the diverse range of evaluation metrics and their intricate
connection to the performance of the RAG pipeline.
This chapter outlines a two-stage evaluation framework for assessing re-
trieval performance. To address the cost and time constraints associated with
48
CHAPTER 4. EVALUATION
49
CHAPTER 4. EVALUATION
ezi /T
softmax(zi ) = P zj /T (4.1)
je
The parameter max new tokens plays a crucial role in controlling the length
of text generated by the LLM. It determines the maximum number of new to-
kens the model will add to the provided context. This interaction between max
new tokens and the length of the context is pivotal, with the latter establishing
the starting point for text generation.
Higher values of max new tokens allow the LLM to produce longer and
more detailed outputs, potentially offering comprehensive responses or narra-
tives. Conversely, lower values result in concise summaries or shorter creative
compositions.
In the Mistral 7B instructional LLM configuration, the parameter max
new tokens was established at 128 for question generation and 512 for answer
generation as seen in Table 4.2. Additionally, the context length was specified
as 2048. For Prometheus 13B, a context length of 2048 was selected, along
with a maximum new context length of 1024 (4.3), to facilitate a more detailed
explanation for the evaluation to be discussed later in this chapter.
50
CHAPTER 4. EVALUATION
Temperature: 0
Mistral 7B Instruct Context Length: 2048
Max New Tokens: 128
Temperature: 0
Prometheus 13B Context Length: 2048
Max New Tokens: 1024
51
CHAPTER 4. EVALUATION
"corpus" key represents the chunks stored within the index, with each individ-
ual chunk assigned its own UUID. The "relevant chunks" key establishes con-
nections between the generated queries and the corresponding chunks through
the utilization of UUIDs. A representative example of a singular question and
its associated chunk pair is illustrated in Figure 4.2.
52
CHAPTER 4. EVALUATION
53
CHAPTER 4. EVALUATION
Table 4.4: Duration of parsing, encoding, and indexing for each method for dataset
preparation.
54
CHAPTER 4. EVALUATION
This section will delineate the retrieval evaluation process, elucidate the met-
rics employed for assessment, and detail the various configurations imple-
mented.
55
CHAPTER 4. EVALUATION
4.2.1 Flow
Building upon the preceding section outlining the generation and storage of
question evaluations in JSON format, this subsection delves into subsequent
steps. Following the completion of the question generation setup, the queries
undergo processing by the query engine to obtain the topk results. As em-
phasized in the earlier chapter, our evaluations exclusively focus on the top3
chunks. Once retrieved, these chunks undergo evaluation based on two met-
rics: Mean Reciprocal Rank (MRR) and Hit Rate. Figure 4.5 visualizes the
end to end flow of retrieval evaluation.
4.2.2 Evaluation
Within the RAG pipeline, the objective is to assess two key elements: the
parsing method, encoder (embedding model), and the retrieval method. This
examination aims to scrutinize the impact that tuning each component will
have on the retrieval setup.
Initiating the evaluation process commences with the scrutiny of three piv-
otal parsing components. Specifically, the token-based parsing is to be exam-
ined at three distinct sizes: 256, 512, and 1024.
Subsequently, an appraisal of retrieval methods is scheduled. This encom-
passes the evaluation of BM25 (sparse), HNSW (dense), a hybrid approach
(combining sparse and dense methods), a reranker, and HyDE.
In the context of vector retrieval, the designated distance metric is the
cosine similarity function. Equation 4.2, and 4.3 explicitly denotes the formu-
lation of the cosine similarity function.
A·B
Cosine Similarity(A, B) = (4.2)
kAk · kBk
Pn
Ai · Bi
Cosine Similarity(A, B) = pPn i=12 pPn 2
(4.3)
i=1 Ai · i=1 Bi
• kAk and kBk represent the Euclidean norms of vectors A and B, respec-
tively.
56
CHAPTER 4. EVALUATION
Subsequently, a reranker was incorporated into each method. The entire pro-
cess was reiterated, this time incorporating HyDE into the evaluation for a
comprehensive analysis.
The schematic representation in Figure 4.6 illustrates the integration of
HyDE within the evaluation flow as well as the prompt used in Figure 4.7.
Prior to entering the query engine, the query undergoes a re-writing process
facilitated by the LLM. Subsequently, the flow continues in a conventional
manner.
57
CHAPTER 4. EVALUATION
4.2.3 Metrics
• Ri is the rank of the first relevant item for the i-th query.
Hit Rate:
4.2.4 Results
Table 4.7 illustrates noteworthy trends in Hit Rate and MRR across various
Chunk Sizes (256, 512, and 1024) and Retrieval Methods, including Sparse
with BM25 (S), Dense with HNSW (D), with Reranker (RR), and Hybrid (H).
Remarkably, the combination of Chunk Size 1024 and Retrieval Method Hy-
brid consistently delivers superior performance, boasting the highest Hit Rate
of 0.92 and commendable MRR values. Sparse with BM25, especially with
chunk size 256, also demonstrates competitive outcomes. The incorporation
of Reranker in certain scenarios proves beneficial, as evidenced by improved
performance. Larger chunk sizes generally exhibit higher Hit Rates and MRR.
58
CHAPTER 4. EVALUATION
In a parallel fashion, Table 4.8 depicts the outcomes derived from employ-
ing HyDE prior to submitting the query to the query engine. Nevertheless,
discernibly, there is a conspicuous decline in performance associated with this
methodology. In contrast, subsequent to the application of HyDE, HNSW
retrieval exhibits a slight improvement over BM25, suggesting that BM25
performs slightly less favorably in this context. This improvement may be
ascribed to the supplementary details and information incorporated into the
query, thereby facilitating a closer alignment between the query and the target
retrieval point within the vector space. During the evaluation of HyDE, the
inclusion of a reranker was deemed unnecessary. This determination stemmed
from the realization that integrating an additional reranker would necessitate
four forward propagations of transformers leading a very long processing time
(more than 20 seconds). In a finalized pipeline, the sequence of operations
would manifest as follows:
59
CHAPTER 4. EVALUATION
60
CHAPTER 4. EVALUATION
This section will systematically outline the process of evaluating the gener-
ation, explicate the metrics utilized for assessment, and provide comprehensive
details on the diverse configurations implemented.
4.3.1 Flow
In Figure 4.8, the procedural outline of the generation evaluation process com-
mences subsequent to the generation and compilation of question and chunk
pairs into a JSON file during the retrieval evaluation phase. Following the ac-
quisition of these question-chunk pairs, they undergo processing wherein they
are inputted into the LLM (Mistral 7B - instruct) for answer generation. The
resultant answers are subsequently stored within a JSON file along with the
preceding contextual information.
Upon the completion of JSON file formulation, the evaluation phase en-
sues. Sequentially, each question-answer pair, along with its associated chunk,
is subjected to assessment using the evaluating LLM (Prometheus 13B). This
evaluation is conducted with regard to two primary metrics: Relevancy and
Faithfulness. Furthermore, the semantic similarity between the generated an-
swer and the corresponding chunk is determined by encoding both elements
via an embedding model, followed by cosine similarity analysis on the resultant
vectors.
The resulting JSON file contains feedback on the metrics of faithfulness
and relevancy, alongside their respective scores. This feedback elucidates the
rationale behind the assigned scores, providing insight into the evaluation pro-
cess.
4.3.2 Evaluation
The primary objectives of the generation evaluation encompass assessing the
LLMs, particularly Mistral 7B - instruct, in their capacity to synthesize con-
textual information, maintain fidelity to the provided chunk, and furnish per-
tinent responses to inquiries. Additionally, the evaluation aims to scrutinize
how different parsing methods impact these aforementioned aspects.
In contrast to the preceding retrieval evaluation, the current evaluation ex-
tends beyond the evaluation of the conventional token-based parsing method
to include an examination of the SW parsing method. Notably, the considera-
tion of the SW parsing method is exclusive to the generation evaluation owing
to its characteristic of accommodating an average token size for a SW of 3 and
6, closely approximating token sizes of 256 and 512 respectively. This decision
is further justified by the observations outlined in Table 4.1, wherein the sig-
nificant volume of nodes necessitates substantial computational resources for
61
CHAPTER 4. EVALUATION
4.3.3 Metrics
Three main metrics were utilized for the evaluation. Relevancy, faithfulness,
and semantic similarity.
Relevancy:
62
CHAPTER 4. EVALUATION
Faithfulness:
63
CHAPTER 4. EVALUATION
Semantic Similarity:
4.3.4 Results
Table 4.9 presents the results of various evaluations across different chunk sizes,
namely 256, 512, and 1024, as well as sentence window 3 and 6. In Figure 4.11
the resulting JSON file for a single Q&A example can be seen.
In terms of faithfulness, the chunk size of 256 exhibits the highest per-
formance, with a score of 0.933, indicating a strong alignment between the
generated answers and the provided chunks. Conversely, the chunk size of
1024 demonstrates the lowest faithfulness score of 0.585.
Regarding relevancy, the chunk size of SW 6 performs the best, achieving a
score of 0.714. In contrast, the chunk size of 1024 displays the lowest relevancy
score of 0.541.
In semantic similarity, the chunk size of SW 6 also emerges as the top
performer, with a score of 0.547, indicating a high degree of similarity between
the generated answers and the provided chunks. Conversely, the chunk size of
512 exhibits the lowest semantic similarity score of 0.480.
These preliminary findings provide insights into the varying performance of
different chunk sizes across the metrics of faithfulness, relevancy, and semantic
similarity. Further exploration and detailed analysis of these results will be
discussed in Chapter 5.
64
CHAPTER 4. EVALUATION
65
Chapter 5
As explored in the preceding chapters, the RAG framework offers promising so-
lutions to the limitations encountered with standalone LLMs in NLP. Chapter
2 laid the groundwork by discussing the potential of RAG in addressing issues
like hallucination, information cutoff, and domain specificity, while Chapter 3
delved into the implementation details of the RAG pipeline, highlighting key
components such as the LLMs used, the quantization technique, embedding
models, and the orchestration framework, Llamaindex. Furthermore, Chap-
ter 4 provided insights into the evaluation methodologies employed to assess
the performance of the RAG pipeline, particularly focusing on the two-step
evaluation process for the retrieval and generation components.
This chapter aims to analyze the findings from the evaluation process and
propose avenues for further exploration and improvement within the RAG
framework. Following the two-step evaluation process introduced in Chapter
4, which served as an ablation study to isolate the impact of each component,
this section analyzes the insights gained from each step. This analysis informs
the selection of the configuration for the IR-Anthology and possible trade-
offs. The discussion will first explore the findings from each evaluation stage,
followed by a comprehensive overview of the potential final setup.
66
CHAPTER 5. DISCUSSION & ANALYSIS
factual queries. Dense methods like HNSW might capture more semantic rela-
tionships between chunks, but these might not be as crucial for factual searches
in scientific research. However, a combination of both BM25 and HNSW re-
sulted also in the best performing hit rate which could be attributed of getting
the best of both worlds from both the algorithms.
Chunk size exhibited a significant influence on retrieval performance, with
opposing effects on sparse and dense retrieval models. Sparse retrieval al-
gorithms, like BM25, demonstrated improved performance with increasing
chunk size. This can be attributed to the term frequency weighting mechanism
within BM25, which benefits from a wider range of terms to assess relevance.
Conversely, dense retrieval models experienced performance degradation with
larger chunk sizes.
The reranker does not offer improvement to the hit rate metric as the
required chunk is still within the topk , but its ranking withint the topk impacts
the MRR metric.
The introduction of HyDE resulted as seen in Table 4.8 shows a decrease
in performance across all evaluated configurations. This decline is likely at-
tributable to the query rewriting process, which may introduce irrelevant terms
("hallucinations") during expansion. For retrieval models like BM25, which
rely heavily on the presence of original query terms, this can be particularly
detrimental as relevant keywords might be removed or altered. Conversely,
HNSW exhibited a smaller performance drop with HyDE. This suggests that
the transformed queries may have remained within a similar semantic space to
the original, potentially aligning with the intended behavior of HyDE.
Figure 5.1: Evaluation results grouped by different chunk sizes, for each retrieval
method.
67
CHAPTER 5. DISCUSSION & ANALYSIS
In Figure 5.1 the plotting of all the grouped chunk sizes with their respective
retrieval algorithms can be observed. The following are the best performing
findings in terms of the retrieval method for each chunk size:
Figure 5.2: Evaluation results for the generation. Comparison bar plot for the
different chunk sizes. (Sentence Window - SW)
68
Chapter 6
Conclusion
The objective of this thesis was to utilize RAG to develop efficient interaction
between users and the IR-Anthology, this was done through experimenting
multiple RAG configurations to reach the most optimal setup. Through doing
that it mitigated issues faced within using standalone LLMs that suffer from
hallucinations, information cutoff, and domain specificity.
Chapter Review
• Chapter 1 & Chapter 2: The groundwork was laid for understanding the
potential of RAG in mitigating issues such as hallucination, information
cutoff, and domain specificity. The fundamental concepts of RAG were
introduced (e.g., transformers, LLMs, embedding models), emphasizing
its potential to combine the strengths of RAG with the IR-Anthology to
create a robust tool for researchers within the domain of IR to refer to.
• Chapter 3: Within this chapter it provided detailed insights into the
implementation aspects of the RAG pipeline, including the selection of
LLMs, embedding model, and retrieval methods. The explored method-
ologies for LLMs highlighted the importance of optimization techniques
such as FlashAttention and PagedAttention.
• Chapter 4: The evaluation process discussed offered valuable insights into
the performance of the RAG pipeline, particularly in terms of retrieval
and generation components. Through meticulous evaluation methodolo-
gies, optimal configurations were identified, showcasing the effectiveness
of techniques such as hybrid retrieval methods and smaller chunk sizes
in enhancing both retrieval and generation performance.
• Chapter 5: This chapter further deepened the analysis by dissecting the
retrieval and generation results and exploring explanations for observed
69
CHAPTER 6. CONCLUSION
Challenges
Concluding Remarks
70
CHAPTER 6. CONCLUSION
also leads to the inference that a discernible trade-off exists between retrieval
efficiency and generation quality. Specifically, increasing the size of chunks
enhances the performance of the retrieval process, albeit at the expense of
compromising the generation capabilities of the LLM. This phenomenon can
be analogized to presenting an individual with a sizable textbook and tasking
them with extracting information to answer a query. While the required infor-
mation may indeed be present within the extensive text, the individual faces
challenges in processing the entirety of the textbook simultaneously, possibly
necessitating additional research endeavors, such as fine-tuning, to adapt more
effectively to its contents.
In essence, the journey of implementing RAG within the IR-Anthology
framework illuminated not only the technical hurdles but also the inherent
trade-offs between different aspects of the system. These challenges under-
score the need for further exploration and refinement, particularly in devising
strategies that strike a delicate balance between retrieval efficiency and genera-
tion quality. Future endeavors may focus on innovative approaches to optimize
this trade-off. By addressing these challenges head-on, we can pave the way for
more robust and versatile systems that redefine the landscape of information
retrieval and generation in research domains.
Moving forward, the insights gleaned from this comprehensive analysis
pave the way for further exploration and refinement of the RAG framework.
Future research endeavors could focus on finetuning retrieval and generation
strategies, optimizing chunk segmentation techniques, and exploring novel ap-
proaches to enhance overall performance and scalability.
71
Bibliography
Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-
Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev, and Yee Fan
Tan. The ACL Anthology reference corpus: A reference dataset for biblio-
graphic research in computational linguistics. In Nicoletta Calzolari, Khalid
Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, and
Daniel Tapias, editors, Proceedings of the Sixth International Conference
on Language Resources and Evaluation (LREC’08), Marrakech, Morocco,
May 2008. European Language Resources Association (ELRA). URL http:
//www.lrec-conf.org/proceedings/lrec2008/pdf/445_paper.pdf.
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading
wikipedia to answer open-domain questions. In Regina Barzilay and Min-
Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Associa-
tion for Computational Linguistics, ACL 2017, Vancouver, Canada, July
30 - August 4, Volume 1: Long Papers, pages 1870–1879. Association
for Computational Linguistics, 2017. doi: 10.18653/V1/P17-1171. URL
https://doi.org/10.18653/v1/P17-1171.
72
BIBLIOGRAPHY
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher
Ré. Flashattention: Fast and memory-efficient exact attention with io-
awareness, 2022.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT:
pre-training of deep bidirectional transformers for language understanding.
In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings
of the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, NAACL-
HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and
Short Papers), pages 4171–4186. Association for Computational Linguistics,
2019a. doi: 10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/
n19-1423.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert:
Pre-training of deep bidirectional transformers for language understanding,
2019b.
Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Güney, Volkan Cirik,
and Kyunghyun Cho. Searchqa: A new q&a dataset augmented with context
from a search engine. CoRR, abs/1704.05179, 2017. URL http://arxiv.
org/abs/1704.05179.
Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot
dense retrieval without relevance labels, 2022.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei
Chang. REALM: retrieval-augmented language model pre-training. CoRR,
abs/2002.08909, 2020. URL https://arxiv.org/abs/2002.08909.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika,
Dawn Song, and Jacob Steinhardt. Measuring massive multitask language
understanding, 2021.
73
BIBLIOGRAPHY
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran
Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Min-
joon Seo. Prometheus: Inducing fine-grained evaluation capability in lan-
guage models, 2023.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng,
Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient
memory management for large language model serving with pagedattention,
2023.
Hyun Seung Lee, Seungtaek Choi, Yunsung Lee, Hyeongdon Moon, Shinhyeok
Oh, Myeongho Jeong, Hyojun Go, and Christian Wallraven. Cross encoding
as augmentation: Towards effective educational text classification, 2023.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Chuang Gan,
and Song Han. Awq: Activation-aware weight quantization for llm compres-
sion and acceleration, 2023.
74
BIBLIOGRAPHY
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Ran-
gan Majumder, and Li Deng. MS MARCO: A human generated machine
reading comprehension dataset. In Tarek Richard Besold, Antoine Bor-
des, Artur S. d’Avila Garcez, and Greg Wayne, editors, Proceedings of the
Workshop on Cognitive Computation: Integrating neural and symbolic ap-
proaches 2016 co-located with the 30th Annual Conference on Neural In-
formation Processing Systems (NIPS 2016), Barcelona, Spain, December 9,
2016, volume 1773 of CEUR Workshop Proceedings. CEUR-WS.org, 2016.
URL https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf.
Ajay Patel, Bryan Li, Mohammad Sadegh Rasooli, Noah Constant, Colin Raf-
fel, and Chris Callison-Burch. Bidirectional language models are also few-
shot learners. In The Eleventh International Conference on Learning Rep-
resentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net,
2023. URL https://openreview.net/pdf?id=wCFB37bzud4.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits
of transfer learning with a unified text-to-text transformer. J. Mach. Learn.
Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.
html.
Roshan Rao, Joshua Meier, Tom Sercu, Sergey Ovchinnikov, and Alexan-
der Rives. Transformer protein language models are unsupervised struc-
ture learners. In 9th International Conference on Learning Representations,
75
BIBLIOGRAPHY
ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
URL https://openreview.net/forum?id=fylclEqgvgd.
Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you
pack into the parameters of a language model? In Bonnie Webber, Trevor
Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Language Processing, EMNLP 2020,
Online, November 16-20, 2020, pages 5418–5426. Association for Computa-
tional Linguistics, 2020. doi: 10.18653/V1/2020.EMNLP-MAIN.437. URL
https://doi.org/10.18653/v1/2020.emnlp-main.437.
Stephen E. Robertson and Hugo Zaragoza. The probabilistic relevance frame-
work: Bm25 and beyond. Foundations and Trends in Information Retrieval,
3(4):333–389, 2009.
Shawon Sarkar, Maryam Amirizaniani, and Chirag Shah. Representing tasks
with a graph-based method for supporting users in complex search tasks. In
Jacek Gwizdka and Soo Young Rieh, editors, Proceedings of the 2023 Confer-
ence on Human Information Interaction and Retrieval, CHIIR 2023, Austin,
TX, USA, March 19-23, 2023, pages 378–382. ACM, 2023. doi: 10.1145/
3576840.3578279. URL https://doi.org/10.1145/3576840.3578279.
Noam Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202,
2020. URL https://arxiv.org/abs/2002.05202.
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston.
Retrieval augmentation reduces hallucination in conversation. In Marie-
Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih,
editors, Findings of the Association for Computational Linguistics: EMNLP
2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November,
2021, pages 3784–3803. Association for Computational Linguistics, 2021.
doi: 10.18653/v1/2021.findings-emnlp.320. URL https://doi.org/10.
18653/v1/2021.findings-emnlp.320.
Shamane Siriwardhana, Rivindu Weerasekera, Tharindu Kaluarachchi, Elliott
Wen, Rajib Rana, and Suranga Nanayakkara. Improving the domain adap-
tation of retrieval augmented generation (RAG) models for open domain
question answering. Trans. Assoc. Comput. Linguistics, 11:1–17, 2023. URL
https://transacl.org/ojs/index.php/tacl/article/view/4029.
Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yun-
feng Liu. Roformer: Enhanced transformer with rotary position embedding.
Neurocomputing, 568:127063, 2024. doi: 10.1016/J.NEUCOM.2023.127063.
URL https://doi.org/10.1016/j.neucom.2023.127063.
76
BIBLIOGRAPHY
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi,
Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava,
Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya
Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin
Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kar-
das, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev,
Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov,
Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael
Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor,
Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov,
Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien
Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama
2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288,
2023. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/
arXiv.2307.09288.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you
need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.
03762.
77
BIBLIOGRAPHY
Boxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng
Gao, Ahmed Hassan Awadallah, and Bo Li. Adversarial GLUE: A multi-
task benchmark for robustness evaluation of language models. In Joaquin
Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Informa-
tion Processing Systems Track on Datasets and Benchmarks 1, NeurIPS
Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL
https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/
hash/335f5352088d7d9bf74191e006d8e24c-Abstract-round2.html.
Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack:
Packaged resources to advance general chinese embedding, 2023.
Biao Zhang and Rico Sennrich. Root mean square layer normal-
ization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelz-
imer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, edi-
tors, Advances in Neural Information Processing Systems 32: Annual
Conference on Neural Information Processing Systems 2019, NeurIPS
2019, December 8-14, 2019, Vancouver, BC, Canada, pages 12360–
12371, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/
1e8a19426224ca89e83cef47f1e7f53b-Abstract.html.
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting
Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan
Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the AI ocean: A
survey on hallucination in large language models. CoRR, abs/2309.01219,
2023. doi: 10.48550/arXiv.2309.01219. URL https://doi.org/10.48550/
arXiv.2309.01219.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu,
Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,
Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench
and chatbot arena, 2023.
78