2024 NTU - Resaro - LLM - Security - Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Contents

1 Introduction 3

2 Acquisition Strategies for LLMs in Organisations 3

3 Overview of Security Issues in LLMs 5

4 Security Issues in LLM Fine-tuning 5


4.1 Backdoor Injections During Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.1.1 Possible Solutions to Backdoor Injections . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Privacy Leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.1 Solutions to Privacy Leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Security Issues in LLM Deployment 9


5.1 Prompt Injections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.1 Solutions to Prevent Prompt Injections . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Jailbreaking LLMs Using Prompt Engineering . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2.1 Possible Solutions to Jailbreaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.3 Privacy Concerns for End Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3.1 Mitigation Strategies for Privacy Leaks during Deployment . . . . . . . . . . . . . 14

6 Navigating LLM Security: Guidelines for CISOs 14


6.1 Security Testing Protocols for Application Developers . . . . . . . . . . . . . . . . . . . . 15
6.2 External Audits and Red Teaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.3 Company-wide Awareness and Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.4 Community Engagement and Knowledge Sharing . . . . . . . . . . . . . . . . . . . . . . . 15

7 Acknowledgments 15

2
1 Introduction
Large Language Models (LLMs) represent a significant advancement in the field of artificial intelligence
(AI), particularly within the domain of Natural Language Processing (NLP). These are sophisticated deep
learning models capable of processing, understanding, and generating natural language text. LLMs excel
in understanding context and nuances which enable them to interpret and generate human-written-like
text with an impressive degree of accuracy and fluency [49]. In addition, LLMs demonstrate remarkable
generalisation abilities. They can handle a variety of language understanding/processing tasks beyond
the scope of their training data, adapting to new challenges and domains with relative ease.

Drivers of LLM Advancements


The improved capabilities of LLMs over traditional NLP models can be attributed to multiple
factors:
• Large-scale Training Corpus: LLMs are trained on vast amounts of text data, sourced
from diverse and expansive language corpora. This comprehensive training enables LLMs
to grasp a wide array of linguistic patterns, idiomatic expressions, and cultural nuances,
ensuring their outputs are contextually and semantically rich. For instance, Meta’s Llama3
model [37] is trained on a massive 15 trillion tokens of data (A token is the smallest unit of
data used by a language model and maybe a whole word or a part of a word).
• Advanced Architectural Design: Most LLMs are built upon Transformer [55], a revolu-
tionary neural network architecture in machine learning. Transformer employs self-attention
mechanisms, allowing models to efficiently process sequences of data (like natural language
sentences) and understand the relevance of each part of the sequence in relation to others.
This understanding is critical in generating coherent and contextually appropriate responses.
• Self-supervised Learning: This learning paradigm helps the training of LLMs without
the need for explicit labelling of the vast corpus of data, which could otherwise be an
expensive task.

LLMs have gone beyond their initial academic boundaries to achieve widespread use and recognition
in everyday life and are part of regular public discourse. These advanced models are now becoming
indispensable in numerous fields, including content generation [45], automated customer support [42],
language translation [58], and creative writing [11]. OpenAI’s ChatGPT [39, 61] (illustrated in Figure 1),
based on the company’s GPT (Generative Pre-trained Transformer) models, stands out as a particularly
prominent example, boasting approximately 100 million weekly active users as of November 2023. Meta’s
LLaMa2 [54] and the recently released LLaMa3 [37], as well as Mistral.AI’s Mistral 7B [16], Mixtral
8x7B [17], and Mixtral 8x22B [1] are notable in the field of LLMs with their open-source availability
and exceptional speed and computational efficiency during response generation1 . Google’s Gemini [4],
previously known as Bard, is renowned for its ability to seamlessly integrate Google’s comprehensive
suite of tools into conversational contexts, enriching user interactions with a wealth of information.
Complementing these are regional, non-English LLMs such as Singapore’s SEA-LION [36], catering to
the diverse languages and cultures of Southeast Asia, and China’s Ernie developed by Baidu [51], with
a deep understanding of Chinese language and context. Despite these capabilities, organisations must
exercise caution when deploying LLMs. It is crucial to fully understand the complex security risks
involved before rushing them into production.

2 Acquisition Strategies for LLMs in Organisations


The development of LLMs from scratch, often referred to as pre-training, demand substantial resources,
making them accessible to only a handful of well-resourced organisations. For example, training a single
GPT3-175B model consumes 3.6K petaflops-days, which translates to about $4 million USD [66]. As
a result, many small and medium-sized enterprises leverage these pre-trained models, also known as
1 Mistral models currently boasts higher performance in benchmarks over bigger Llama2 models [1, 16, 17].

3
Figure 1: This example showcases ChatGPT’s inventive flair by transforming a complex technical subject
such as LLM security into a narrative reminiscent of a children’s bedtime story.

foundation models, by selecting specific prompts or providing the model with a few examples of the
desired output, a technique called few-shot learning.
Alternatively, the design of foundation models permits further refinement for specific tasks through
a process known as fine-tuning. This involves retraining the models on more focused datasets, enabling
them to specialise in particular applications. Such customisation significantly improves the models’
effectiveness on targeted tasks, thereby democratising advanced NLP capabilities for a wider array of
organisations. For example, BioBERT [26] is derived from Google’s BERT [9] and is adapted by the
National Institutes of Health for summarising medical research. Finally, Retrieval Augmented Generation
(RAG) systems [27] further enhance the responses of foundation models by allowing for the retrieval of
information from external document collections and databases.
The adoption of foundation models by organisations, whether through purchasing tailored commercial
solutions, using open-source tools with few-shot learning, or fine-tuning, introduces various security
concerns at different stages of their lifecycle. These concerns necessitate the evaluation and prevention
of both unintended and deliberate misuse of LLMs. Multiple attack pathways exist for adversarial actors
to compromise an organisation’s LLM tools, and even well-intentioned users might inadvertently engage
with content that conflicts with their organisation’s brand, values, and risk appetite. Consequently,
LLMs are becoming a new category in threat intelligence. It is important to note that those who train
foundation models from scratch may face additional concerns that are beyond the scope of this document.
It must be noted that in addition to security concerns, there are concerns with the quality of generated
content that impact the efficacy and ethical deployment of these models, such as:
• Amplification of Biases: By virtue of learning from a vast corpus of internet text, LLMs can
also inherit and sometimes amplify the biases introduced into the training data through societal
and systemic prejudices in the real world.
• Hallucination: The occurrence of hallucination [12, 47] where models generate plausible but
unfounded information by learning to mimic patterns found in natural language.
• Spread of Disinformation at Scale: LLMs can be exploited by bad actors to generate and
disseminate disinformation at an unprecedented scale [18, 57]. By manipulating context or tweaking
inputs, these individuals can craft messages that distort facts or mislead audiences. This capability

4
allows for the rapid spread of harmful content, potentially influencing public opinion or disrupting
societal norms.
These issues are further discussed in Generative AI governance documents such as:

• Presidio AI Framework: Towards Safe Generative AI Models [2] by World Economic Forum.
• Model AI Governance Framework for Generative AI [10] by Singapore’s Infocomm Media Devel-
opment Authority (IMDA).

3 Overview of Security Issues in LLMs


In this whitepaper, we focus on the security risks and mitigation approaches throughout the lifecycle of
LLMs, particularly during their fine-tuning and deployment stages, as outlined in Figure 2. Our goal
is to equip Chief Information Security Officers (CISOs) and their technology teams to practically move
forward with the adoption of LLMs in the enterprise environment.

Sensitive Sensitive
Unintended Unintended Harmful Content
Information/PII Information/PII
Responses Responses Generation
Leakage Leakage

Backdoor Privacy Prompt End-user


Jailbreaking
Injection Leakage Injection Privacy

Fine-Tuning Deployment

Figure 2: Security landscape for Large Language Models, detailing specific risks associated with the
fine-tuning and deployment phases.

At the fine-tuning phase, LLMs are vulnerable to backdoor attacks, where malicious actors can
implant covert functionalities. At the deployment phase, the concern shifts to prompt injections and
model jail-breaking, where adversaries may manipulate model responses in harmful ways, compromising
the model’s integrity. Additionally, privacy remains a major issue across both phases of the LLM lifecycle,
from the potential exposure of sensitive training data to breaches of end-user information.
The following sections will detail each of these specific security concerns, emphasising mitigation ap-
proaches to ensure the safe and effective use of LLMs in diverse application domains. These approaches
are designed to complement established cybersecurity practices, including authentication protocols, net-
work hardening, anomaly reporting, and other critical security measures. Together, they form the foun-
dation for a comprehensive LLM defence framework that every leading enterprise adopting and scaling
AI will need.

4 Security Issues in LLM Fine-tuning


Fine-tuning existing pre-trained models with additional datasets is increasingly a widespread practice.
This stage, while crucial for shaping the capabilities and behaviours of LLMs, also opens avenues for

5
vulnerabilities such as backdoor attacks and privacy leaks. Hence, the training phase is a critical point
for embedding robust security measures.

4.1 Backdoor Injections During Fine-Tuning


Backdoor injections are subtle alterations made to LLMs in order to generate content
desired by the attacker whenever certain triggers are present in the input prompt, while
operating normally and appearing trustworthy with other inputs [13]. These backdoors allow
attackers to elicit unintended and often harmful outputs from the LLMs. One common method of
backdoor injection involves introducing ‘poisoned’ data during the fine-tuning process. This data contains
specific trigger tokens or phrases that, when encountered by the backdoored LLM, cause it to generate
incorrect, biased, or harmful outputs. These triggers are often carefully crafted to be inconspicuous
within normal data, making them hard to identify and isolate.
An illustrative backdoor injection example from [13] is shown in Figure 3, where seemingly innocuous
words like instantly, exactly, or perhaps are programmed to activate the LLM’s malicious behaviour 2 . A
more malicious attack scenario for backdoor injections would be to prompt the LLM to generate harmful
content when specific users or entities engage with the model within an organisation, assuming their
identity can be inferred from the interaction context [15].

Figure 3: Image from [13] illustrating a backdoor attack in both NLP tasks (left) and multimodal tasks
(right). A text trigger is a word (marked in red) and an image trigger is a red patch at the centre of the
image.
2 In this example, an instruction-based prompt template used by Alpaca [52].

6
Examples of Backdoor Injection Attacks

The repercussions of a successful backdoor attack extend far beyond the compromised model,
potentially affecting numerous downstream applications derived from the tainted LLM. This
has been demonstrated by Anthropic in their recent study [15] where their backdoors remained
persistent through subsequent fine-tuning.

Additionally, a separate piece of research by Nanyang Technological University (NTU) published


under the title “Multi-target Backdoor Attacks for Code Pre-trained Models” [31] also reinforces
the resilience of backdoor attacks on a particular class of LLMs that are trained to understand
and generate software code, often known as code pre-trained models. This study identified
various adversarial strategies for backdoor injection to ensure the attack is effective for both
code understanding tasks and generation tasks. Further experimentation indicated that these
backdoor attacks can successfully evade backdoor defence methods (e.g., Fine-pruning [32] and
Weight Re-initialisation) and successfully affect downstream tasks that use the generated infected
code.

Recently, another latest research from NTU titled “Personalization as a Shortcut for Few-Shot
Backdoor Attack against Text-to-Image Diffusion Models” [14] revealed that text-to-image dif-
fusion models are vulnerable to backdoor attacks. They devised dedicated personalisation-based
backdoor attacks according to the different ways of dealing with unseen tokens and divided them
into two families: nouveau-token and legacy-token backdoor attacks. Compared to conventional
backdoor attacks involving the fine-tuning of the entire text-to-image diffusion model, the pro-
posed personalisation-based backdoor attack method can facilitate more tailored, efficient, and
few-shot attacks. The comprehensive empirical study shows the nouveau-token backdoor attack
has impressive effectiveness, stealthiness, and integrity, markedly outperforming the legacy-token
backdoor attack.

4.1.1 Possible Solutions to Backdoor Injections


Detecting backdoors in LLMs is a complex task due to the model’s scale and intricate architecture.
The vast number of parameters and the subtlety with which backdoors can be integrated make it a
daunting task to identify and rectify such vulnerabilities. The opaque nature of these LLMs also adds to
the complexity of the problem. Hence, addressing the threat of backdoor injections in LLMs requires a
multifaceted approach, involving both preventative measures and strategies for detection and mitigation.
Some feasible approaches include:

• Rigorous Data Processing and Protection: The foundation of secure LLM operation lies in
the integrity of its training data. Rigorous vetting, cleansing, and validation of data prior to its use
in training or fine-tuning are essential. Access controls for databases and model weights further
safeguard against unauthorised manipulations.

• Monitoring for Unusual Behaviours: Given that a compromised model may still produce
predominantly benign responses, continuous monitoring for abnormal behaviour is essential, par-
ticularly in reaction to inputs that appear harmless. An example of such suspicious activity would
be if sentences with the same trigger words invariably elicit harmful responses. Establishing a clear
protocol for documenting and escalating these issues can help ensure they are addressed promptly
and effectively.

• Red Teaming: In the context of LLMs, red teaming involves a dedicated group of experts at-
tempting to exploit vulnerabilities or test the limits of these models’ security measures and safety
guardrails to identify and address potential weaknesses before malicious actors can exploit them.
This practice helps discover potential backdoors and enhance the models’ resilience against misuse.
Employing third-party services to assemble a diverse group of red teamers allows for a wider range
of input prompts, improving the likelihood of identifying backdoors in LLMs.

7
• Collaborative Efforts in Security: The fight against backdoor injections in LLMs is not just a
technical challenge, but also a collaborative one. Sharing knowledge, techniques, and insights across
the AI security community can foster a better understanding and development of more effective
strategies to combat these threats. Given the critical importance of AI model security, numerous
researchers have paid attention to the vulnerability of AI models to backdoor attacks across various
domains. To foster advancements in this field, several organisations have hosted various competi-
tions for backdoor removal and detection, such as IEEE Trojan Removal Competition 3 and the
Trojan Detection Challenge 2023 (LLM Edition) 4 .

In addition to the above mentioned approaches, the academic community is actively engaged in
extensive research on strategies to mitigate backdoor attacks. The next section will highlight several
advanced strategies devised by researchers to address the issue of backdoor injections.

• Machine Unlearning: This approach involves retraining the model to forget specific parts of
the training data, particularly those identified as potentially malicious or harmful once poisoned
samples with triggers are identified. Academic research, such as [33, 56], have shown that ma-
chine unlearning can help remove the influence of poisoned data or trigger tokens/phrases, thereby
neutralising the backdoor.
• Behavioural Analysis for Unusual Patterns: Techniques such as MNTD [63] trains a clas-
sifier on the LLM output to detect patterns that are consistent with the presence of a backdoor.
ONION [44] on the other hand attempts to detect and remove trigger words from input prompts to
avoid unintended behaviour from the model. These methods, however, come with certain practical
limitations, including the need to know the specific techniques used for backdoor insertion before-
hand. The effectiveness of these approaches is further questioned by the existence of sophisticated
attacks designed to evade such defences [7].
• Advanced Backdoor Detection Mechanisms: The Certified Backdoor Detector (CBD) [62]
represents an early effort at certifying backdoor detections. This innovative approach aims to
identify specific conditions under which a backdoor is guaranteed to be detectable. Additionally,
CBD provides a probabilistic upper bound for its false positive rate, enhancing its reliability and
robustness in identifying backdoor threats.

Even though they are in the early stages of development and may rely on stringent assumptions, it
remains crucial for organisations to stay informed and monitor these developing techniques closely.

4.2 Privacy Leakage


The datasets employed to train LLMs often contain a vast array of information, includ-
ing potentially sensitive Personal Identifiable Information (PII) such as names, addresses,
contact numbers, and financial records. The incorporation of such sensitive data raises
substantial privacy leakage concerns [30, 59], primarily due to the propensity of LLMs to
memorise and inadvertently reproduce this information in their outputs [53]. One particu-
larly concerning method of exploitation is the Membership Inference Attack, where attackers can deduce
the presence of specific data within the model’s training set. Furthermore, attackers can craft clever
prompts designed to coax the model into reproducing or hinting at sensitive information, leveraging the
model’s inherent memorisation capabilities [8, 28]. An example of how such an attack could manifest
within an organisation from [19] is depicted in Figure 4. Despite measures to prevent the disclosure of
private information, studies like that of Li et al. [28] demonstrate that well-designed multi-step prompts
can still extract PII from these models. The stochastic nature of LLM responses means that these pri-
vacy breaches might not be immediately obvious, complicating their detection and the implementation
of safeguards.
3 https://www.trojan-removal.com
4 https://trojandetection.ai

8
Figure 4: An example of privacy leakage in Large Language Models from [19], where a query inadvertently
leads to the disclosure of personal identifiable information.

4.2.1 Solutions to Privacy Leakage


Given the criticality of these privacy risks, it is essential for organisations incorporating internal data for
fine-tuning LLMs to adopt robust measures to prevent sensitive data from being compromised. Possible
strategies to mitigate the risk of privacy leakage include:

• Rigorous Data Validation and Anonymisation: Organisations training or fine-tuning LLMs


should evaluate their training data for the presence of any PII. For example, anonymising and
pseudonymising [64] personal data by eliminating both direct and indirect identifiers, as outlined
in the General Data Protection Regulation (GDPR) and the Personal Data Protection Commission
Singapore’s Guide to Basic Anonymisation, can stop the model from learning and replicating PII.
Techniques such as tokenisation can also be applied to sensitive fields within the data.
• Third-party Audits: Third-party auditors can be employed to probe the LLMs to assess the
leakage of their own PII. State-of-the-art techniques such as ProPILE [19] can be used to make
these evaluations more efficient.

Additionally, the following techniques are active research areas where novel techniques are being
designed actively.

• Machine Unlearning: Similar to backdoor attacks, machine unlearning methods can be used
to retrain the model to forget PIIs once they are identified to be memorised by the model. For
instance, a position paper by Symantec Corporation [50] has identified machine unlearning as a
potential technique that can be used to implement the “right-to-be-forgotten” provided by the
European Union General Data Protection Regulation into the general class of machine learning
models.

• Differential Privacy: Differential privacy techniques aim to ensure that observing a model’s
outputs does not reveal whether an individual’s data was used in its training set. Integrating
differential privacy techniques during the training process introduces noise to the data or the
model’s parameters, making it significantly harder to identify or reconstruct any PII from the
model’s outputs. The University of South Florida’s differential privacy framework, EW-Tune, has
shown that fine-tuning LLMs can be achieved while guaranteeing sufficient privacy protections [5].

5 Security Issues in LLM Deployment


The deployment phase of LLMs presents its unique set of vulnerabilities and challenges that necessitate
vigilant oversight and innovative solutions. While the training phase is crucial for establishing the
foundational capabilities and the overall helpfulness of LLMs, the deployment phase is where these models
interface directly with end-users and external systems, thereby amplifying the potential for security risks.
Hence, the risks at this phase are not confined to the internal workings of the models but extend to how
they interact with and respond to user inputs. This phase is characterised by three security concerns:

9
1 2 3
Prompt injections involve Jailbreaking seeks to Privacy concerns for end
incorporating instructions circumvent restrictions placed users relate to how LLMs
unaware to the end-user that on the model to access can inadvertently expose or
result in unintended prohibited information or compromise user data.
responses from the LLMs. functionalities.

Together, Sections 5.1, 5.2, and 5.3 aim to provide a comprehensive understanding of the risks asso-
ciated with the deployment of LLMs and present strategies to fortify these models against exploitation
to ensure the protection of user privacy in the wild.

5.1 Prompt Injections


Prompt injections are manipulated prompts that lead to the model disregarding intended
instructions and generating undesired outputs, leading to a variety of adverse outcomes,
from the model inadvertently revealing sensitive information to generating misleading or
completely irrelevant responses, thus compromising both the integrity and utility of the
system. An illustration of the latter scenario is shown in Figure 5 where LLMs are instructed by a
malicious user to ignore any prompt and simply output a “hello world” message.

Figure 5: llustration of prompt injection in LLMs adapted from [35], contrasting benign user interactions
with a malicious user’s attempt to manipulate the model’s output.

The mechanism of prompt injection is diverse and multifaceted. Attackers might employ direct
methods, inputting crafted prompts that exploit specific model vulnerabilities. On the other hand,
malicious prompts can also be introduced through manipulated external sources that LLMs might interact
with during their operation, often referred to as indirect prompt injection. This could include parsing
and responding to content from a compromised website or processing documents that contain hidden
malicious instructions. Such indirect methods broaden the attack surface, as they leverage the model’s
ability to integrate and respond to real-time data from diverse inputs.

10
Novel Prompt Injection Attacks

A study from researchers at NTU titled “Prompt Injection attack against LLM-integrated
Applications” [35] provides insightful analysis into prompt injection attacks within real-world
LLM-integrated applications. This study introduces HOUYI, a novel black-box method for con-
ducting prompt injection attacks. By using HOUYI across 36 real-world applications integrated
with LLMs, the study found that an alarming 31 were susceptible to prompt injection attacks.
Further validation from 10 vendors not only reinforces the critical nature of these findings but
also emphasises the widespread impact and significance of this research in understanding and
addressing prompt injection vulnerabilities.

The “Goal-guided Generative Prompt Injection Attack” (G2PIA) developed by Zhang et al. [67]
introduces an innovative method for generating adversarial texts aimed at eliciting incorrect re-
sponses from LLMs. This technique fundamentally redefines the concept of prompt injection
from a mathematical perspective and addresses the limitations of traditional heuristic-based ap-
proaches, which often lack reliable methodologies to increase attack success rates. Remarkably,
G2PIA operates without the need for iterative feedback or interaction with the target model,
thereby reducing computational demands. Its efficacy is demonstrated through tests on seven
LLMs across four datasets.

5.1.1 Solutions to Prevent Prompt Injections


• Input Validation and Sanitisation: Implement rigorous validation and sanitisation processes
for all inputs to LLMs, including direct prompts and data from external sources such as uploaded
files or web content. This helps ensure that only safe and intended inputs are processed by the
model, thereby reducing the risk of prompt injections.

• Secure Integration Practices: Ensure that LLMs are integrated into applications and systems
securely, with attention to how data is accessed. Access to LLMs, especially the ability to edit
system-level prompts and instructions should be limited through user and role management, en-
suring that only authorised personnel can interact with the model in ways that could potentially
introduce risky inputs. Furthermore, access to external sources such as websites or files should be
controlled.

• Contextual Understanding and Content Moderation Enhancements: Improve the LLM’s


ability to understand the context and intent behind prompts, enabling it to better differentiate be-
tween legitimate requests and potential prompt injections. This may involve fine-tuning the model
with examples of malicious inputs and appropriate responses. NVIDIA’s NeMo Guardrails [48], a
comprehensive open-source security toolkit for LLMs, features programmable guardrails that not
only ensure the generation of secure content but also guide discussions toward particular topics.
Additionally, the strategy of establishing and managing allowlists and blocklists offers a straightfor-
ward way to control which phrases, words, or characters can be processed, providing an additional
layer of security.

Existing research [35] has identified a range of strategies identified through empirical research to help
lessen the issue of prompt injection. Some notable works include:

• Instruction Defense [21] strategy introduces specific instructions within the prompt to make the
model aware of the nature of the content that follows.
• Post-Prompting [22] places the user’s input ahead of the prompt, altering the model’s processing
sequence.
• Various methods of wrapping inputs between specific sequences are also studied. Among them,
Random Sequence Enclosure [23] enhances security by wrapping the user’s input with randomly
generated character sequences. Similarly, Sandwich Defense [24] method secures the user’s input
by embedding it between two prompts.

11
• Finally, XML Tagging [25] emerges as a strong defence by surrounding the user’s input with XML
tags. In the realm of content moderation, a study by Kumar et.al [20] investigating the content-
moderation ability of existing LLMs offers some future directions on improving LLMs for context
understanding and content moderations.

• Additionally, for recommendation systems that leverage LLMs, work by Rajput et.al. [46] have
developed a comprehensive framework designed to enhance content moderation capabilities. This
framework includes several components, among them advanced anomaly detection techniques and
the integration of human-AI hybrid systems.

5.2 Jailbreaking LLMs Using Prompt Engineering


Jailbreaking is a special case of prompt injection aimed at circumventing model safeguards
and access harmful functionalities [3]. It highlights a significant vulnerability [43, 34] where these
guardrails can be bypassed by exploiting the statistical nature of LLMs through carefully engineered in-
puts. These prompts cleverly nudge or misdirect LLMs into producing responses it is usually programmed
to avoid.

Various Jailbreaking Strategies

A work by NTU [34] demonstrated that framing requests within a fictional narrative, effectively
asking ChatGPT to assume a character role in a hypothetical scenario, could bypass its
safety guardrails. Figure 6 illustrates one jailbreak example using ChatGPT. In this scenario,
ChatGPT, when presented with a direct inquiry about creating and distributing malware for
financial gain, rightfully declines to provide guidance, advocating for constructive behaviour
instead. However, altering the context to a fictional narrative where ChatGPT is cast as an evil
scientist undertaking experiments where the same prohibited inquiry is subtly reintegrated as
part of the experimental goals. This rephrasing leads ChatGPT to engage with the query, under
the misconception that it is contributing to a hypothetical study, thus sidestepping its safeguards.

Another research done by NTU [60] has demonstrated that regardless of the kind of jailbreak
strategies employed, they eventually need to include a harmful prompt (e.g., “how to make a
bomb”) in the prompt sent to LLMs.

Recent research by Andriushchenko et al. [3] has highlighted that even the latest safety-aligned
LLMs are not immune to jailbreaking, despite advanced safety measures. In particular, these
models are vulnerable to what are known as adaptive jailbreaking attacks. These attacks cleverly
utilise known information about each model—like data from their training or operational use—to
craft targeted attacks. This study reveals that adaptivity is key in these attacks. Different models
have different weaknesses: for example, some are particularly susceptible to certain types of
prompts, while others might be exploited through specific features of their APIs, such as pre-filling
capabilities. The attacks are designed to be flexible, adjusting tactics based on the model targeted.

These findings are a crucial reminder that no single defence method can cover all potential vulner-
abilities. This emphasises the need for ongoing adaptation and enhancement of security measures
in LLMs to keep pace with evolving threats.

12
Figure 6: Contrast between a normal mode where an LLM ethically declines a user’s inappropriate
request, and a ‘jailbreak mode’ where the model is manipulated into providing a detailed and unethical
response. Image from [34].

Beyond manually crafted prompts, various research [6, 8, 65, 68] have also explored Automated
Prompt Injection Strategies, an approach to automatically generate effective jailbreak prompts. This
innovative approach leverages machine learning techniques to analyse successful jailbreak strategies and
replicate their key characteristics, leading to the generation of more sophisticated and potentially more
effective jailbreaking methods. These methods underscore the importance of understanding the inherent
vulnerabilities in the design and functioning of LLMs. By comprehending how these LLMs process and
respond to different types of inputs, developers can better anticipate potential avenues for exploitation.

5.2.1 Possible Solutions to Jailbreaking


A holistic approach involving technical, ethical, and collaborative efforts is essential for maintaining the
integrity and safety of LLMs in various applications. To address jailbreak prompts effectively, stakehold-
ers should adopt a multifaceted approach as indicated below:

• Contextual Understanding and Content Moderation Enhancements: Improve the contex-


tual understanding capabilities of LLMs and its ability to moderate content to catch any prompt
that aims at bypassing the system’s guardrails. The use of new sophisticated tools such as NeMo [48]
should be investigated to program custom guardrails that ensure that the output remains consistent
with the intended LLM use case.
• Continuous Model Audits and Updates: As more vulnerabilities and weaknesses become
evident, it is vital to retrain the model continuously. Continuous and timely audits of the LLMs
must also be scheduled. Stress testing the LLMs with adversarial queries with an aim to discover
vulnerabilities should also be part of regular model audits.
• Community Engagement and Knowledge Sharing: Community engagement and knowledge
sharing among the AI and machine learning community members are pivotal strategies in ad-
dressing the jailbreaking risks associated with LLMs. This collaborative approach facilitates the
identification of vulnerabilities by pooling experiences and examples of jailbreaking prompts that
expose weaknesses in current LLM implementations. Organisations can leverage this collective
intelligence to refine their defensive measures, such as blacklisting identified phrases that pose a
risk.

Solutions proposed by academic research should be investigated for applicability to real-world sce-
narios.
• NTU’s lightweight yet practical defence called SELFDEFEND [60] can defend against existing
jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user
prompts.

13
5.3 Privacy Concerns for End Users
The incorporation of user data into LLM applications aids in refining these models and introduces
significant security concerns. The data provided by users, which may include PII, is stored
and utilised to further train LLMs. This practice presents a dual threat: firstly, the
inherent capacity of LLMs to memorise detailed information increases the risk of sensitive
data extraction by malicious actors [29, 41]. Secondly, organisations deploying LLMs bear
an additional responsibility to ensure the secure storage of this data. Failure to do so exposes
the data to potential theft by hackers.
Retrieval Augmented Generation (RAG) systems, which enhance LLM responses by pulling infor-
mation from a database or collection of documents, further complicate this scenario. By design, RAG
systems are used within organisations to leverage internal documents or data repositories for generat-
ing informed and contextually relevant answers, making them particularly useful for tasks like content
summarisation or customer support. However, this necessitates granting LLMs access to sensitive doc-
uments, thereby amplifying the risk of exposing confidential information if these documents are not
properly safeguarded. This highlights the critical need for stringent data protection measures in the
deployment and operation of both LLM and RAG systems, underscoring the importance of meticulous
document management and security protocols.

5.3.1 Mitigation Strategies for Privacy Leaks during Deployment


• Data Anonymisation and Redaction: Before inputting any data into LLMs, sensitive infor-
mation should be anonymised or redacted. This includes removing or encrypting PII to prevent
privacy breaches.

• Data Usage Policies: Implement strict organisation-wide data usage policies on what type of
data an end-user can input to the LLMs. Organisations can build these data usage policies on top
of existing regulations such as the Personal Data Protection Act (PDPA) in Singapore and other
General Data Protection Regulations.

• Secure Document Handling for RAGs: When using RAG systems, ensure that documents
containing sensitive information are properly sanitised before being used. This could involve re-
moving confidential information and augmenting with synthetic data for training purposes.
• Secure Data Storage: Implement robust security measures for storing data, such as encryption
at rest and in transit, and use secure, access-controlled databases. Regular security audits and
compliance checks can help ensure that storage systems remain impenetrable to unauthorised
access.
• Access Control and Authentication: Limit access to LLMs and RAG systems to authorised
personnel only, using strong authentication mechanisms.
• Output Verification: Have a human-in-the-loop to scan the generated LLM content for PII. If
such content is found, effective reporting techniques should be in place and such content should
then be anonymised or redacted.

6 Navigating LLM Security: Guidelines for CISOs


Given the evolving landscape of cybersecurity in the context of LLMs, CISOs must adopt a particularly
vigilant and proactive approach. Recognising that not all security concerns can be fully addressed, CISOs
should employ a risk-based strategy, focusing on mitigating higher risks and establishing responsive
processes. As the use and security implications of LLMs continue to develop, it is crucial that any
measures implemented remain flexible, adapting to changes in the field. Outlined below are some practical
steps CISOs can take to effectively manage the risks associated with the deployment and use of LLMs
in their organisations:

14
6.1 Security Testing Protocols for Application Developers
• Proactive Security Measures for Application Developers: CISOs should adopt a proactive
stance in identifying and addressing vulnerabilities in LLMs. An AI security testing protocol that
uncovers potential weaknesses in LLMs before they are exploited maliciously should be established.
• Collaboration with External Experts: Engage with external cybersecurity experts and re-
searchers to gain diverse perspectives on potential vulnerabilities. This collaboration can uncover
more complex or subtle threats that internal teams might overlook.
• Continuous Monitoring and Updating: AI security is not a one-time task but a continuous
process. Regularly update security protocols and reevaluate LLMs to respond to new threats and
ensure that LLMs are aligned with the latest security standards.
• Human Oversight: Establish protocols for human oversight in critical decision-making processes
involving LLMs. This helps in ensuring reliability and accountability, particularly in high-stakes
scenarios.

6.2 External Audits and Red Teaming


• Third-party AI Audits: Organisations should rely on third-party AI evaluation vendors to
provide an unbiased evaluation of the performance, safey, and security of the LLMs being trained,
fine-tuned, or even purchased.
• Red Teaming: Red teaming exercises conducted by a team of experts, ideally from different
domains, can simulate adversarial attacks on the model and uncover security issues. Where relevant,
this can be supplemented with automated red teaming tools.

6.3 Company-wide Awareness and Training


• Awareness and Training: Educate the workforce, especially those interacting with LLMs, about
the risks of prompt injection. Training should include recognising suspicious patterns and under-
standing the ethical use of LLMs.
• Implement Robust Filters and Moderation Systems: Monitor the developments in aca-
demic research and deploy practical filters and moderation systems to detect and prevent security
vulnerabilities of LLMs.

6.4 Community Engagement and Knowledge Sharing


• Vulnerability Discovery and Sharing: Collaborate within the AI security ecosystem to share
and discover new vulnerabilities in LLMs. This could include model weakness, novel jailbreak
prompts, or other weaknesses that might have a critical impact. Contribute to global knowledge
bases such as MITRE ATLAS [38] and OWASP [40].
• Regulatory Compliance: Stay updated on regulatory requirements specific to the use of AI and
LLMs in your industry. Ensure that the deployment and use of LLMs comply with legal and ethical
standards.

7 Acknowledgments
We extend our deepest gratitude to the individuals whose contributions have been instrumental in the
development of this white paper. From Nanyang Technological University, we sincerely thank Weisong
Sun, Yi Liu, Xiaojun Jia, Wei Ma, Yihao Huang, Tianlin Li, and Yang Liu for their dedicated efforts.
We are also grateful to the team from Resaro, including Sreejith Balakrishnan, April Chin, Timothy Lin,
Miguel Fernandes, and Christine Ng for their valuable insights and support. Additionally, we appreciate
the invaluable feedback provided by our esteemed colleagues from the Ministry of Communications and
Information, Singapore and the Cyber Security Agency of Singapore. Your guidance has been crucial in
refining and enhancing this document.

15
References
[1] Mistral AI. Cheaper, Better, Faster, Stronger: Continuing to Push the Frontier of AI and Making
It Accessible to All. 2024. url: https://mistral.ai/news/mixtral-8x22b/.
[2] AI Governance Alliance. “Presidio AI Framework: Towards Safe Generative AI Models”. In: (2024).
[3] Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. “Jailbreaking Leading Safety-
Aligned LLMs with Simple Adaptive Attacks”. In: arXiv preprint arXiv:2404.02151 (2024).
[4] Rohan Anil et al. “Gemini: A Family of Highly Capable Multimodal Models”. In: CoRR abs/2312.11805.1
(2023), pp. 1–34.
[5] Rouzbeh Behnia et al. “EW-Tune: A Framework for Privately Fine-Tuning Large Language Models
With Differential Privacy”. In: Proceedings of the IEEE International Conference on Data Mining
Workshops. Orlando, FL, USA: IEEE, 2022, pp. 560–566.
[6] Patrick Chao et al. “Jailbreaking Black Box Large Language Models in Twenty Queries”. In: CoRR
abs/2310.08419.1 (2023), pp. 1–13.
[7] Kangjie Chen et al. “BadPre: Task-Agnostic Backdoor Attacks to Pre-trained NLP Foundation
Models”. In: Proceedings of the 10th International Conference on Learning Representations. Virtual
Event: OpenReview.net, 2022, pp. 1–8.
[8] Gelei Deng et al. “MasterKey: Automated Jailbreak Across Multiple Large Language Model Chat-
bots”. In: CoRR abs/2307.08715.1 (2023), pp. 1–15.
[9] Jacob Devlin et al. “Bert: Pre-training of Deep Bidirectional Transformers for Language Under-
standing”. In: arXiv preprint arXiv:1810.04805 (2018).
[10] AI Verify Foundation. Model AI Governance Framework for Generative AI. 2024. url: https://
aiverifyfoundation.sg/wp-content/uploads/2024/05/Model-AI-Governance-Framework-
for-Generative-AI-May-2024-1-1.pdf.
[11] Carlos Gómez-Rodrı́guez and Paul Williams. “A Confederacy of Models: A Comprehensive Evalu-
ation of LLMs on Creative Writing”. In: Proceedings of the 28th Conference on Empirical Methods
in Natural Language Processing (Findings). Singapore: Association for Computational Linguistics,
2023, pp. 14504–14528.
[12] Nuno M. Guerreiro et al. “Hallucinations in Large Multilingual Translation Models”. In: Transac-
tions of the Association for Computational Linguistics 11 (Dec. 2023), pp. 1500–1517.
[13] Hai Huang et al. “Composite Backdoor Attacks Against Large Language Models”. In: arXiv
preprint arXiv:2310.07676 (2023).
[14] Yihao Huang et al. “Personalization as a Shortcut for Few-Shot Backdoor Attack Against Text-
to-Image Diffusion Models”. In: Proceedings of the AAAI Conference on Artificial Intelligence.
Vol. 38. 19. 2024, pp. 21169–21178.
[15] Evan Hubinger et al. “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety
Training”. In: arXiv preprint arXiv:2401.05566 (2024).
[16] Albert Q Jiang et al. “Mistral 7B”. In: arXiv preprint arXiv:2310.06825 (2023).
[17] Albert Q Jiang et al. “Mixtral of Experts”. In: arXiv preprint arXiv:2401.04088 (2024).
[18] Bohan Jiang et al. “Disinformation Detection: An Evolving Challenge in the Age of LLMs”. In:
arXiv preprint arXiv:2309.15847 (2023).
[19] Siwon Kim et al. “ProPILE: Probing Privacy Leakage in Large Language Models”. In: CoRR
abs/2307.01881.1 (2023), pp. 1–12.
[20] Deepak Kumar, Yousef AbuHashem, and Zakir Durumeric. “Watch Your Language: Large Lan-
guage Models and Content Moderation”. In: CoRR abs/2309.14517.1 (2023), pp. 1–12.
[21] learnprompting.org. Instruction Defense. 2024. url: https : / / learnprompting . org / docs /
prompt_hacking/defensive_measures/instruction.
[22] learnprompting.org. Post-prompting. 2024. url: https://learnprompting.org/docs/prompt_
hacking/defensive_measures/post_prompting.

16
[23] learnprompting.org. Random Sequence Enclosure. 2024. url: https : / / learnprompting . org /
docs/prompt_hacking/defensive_measures/random_sequence.
[24] learnprompting.org. Sandwich Defense. 2024. url: https://learnprompting.org/docs/prompt_
hacking/defensive_measures/sandwich_defense.
[25] learnprompting.org. XML Tagging. 2024. url: https : / / learnprompting . org / docs / prompt _
hacking/defensive_measures/xml_tagging.
[26] Jinhyuk Lee et al. “BioBERT: A Pre-trained Biomedical Language Representation Model for
Biomedical Text Mining”. In: Bioinformatics 36.4 (2020), pp. 1234–1240.
[27] Patrick Lewis et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. In:
Advances in Neural Information Processing Systems 33 (2020), pp. 9459–9474.
[28] Haoran Li et al. “Multi-step Jailbreaking Privacy Attacks on ChatGPT”. In: arXiv preprint
arXiv:2304.05197 (2023).
[29] Tianshi Li et al. “Human-Centered Privacy Research in the Age of Large Language Models”. In:
Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 2024, pp. 1–
4.
[30] Yansong Li, Zhixing Tan, and Yang Liu. “Privacy-Preserving Prompt Tuning for Large Language
Model Services”. In: CoRR abs/2305.06212.1 (2023), pp. 1–13.
[31] Yanzhou Li et al. “Multi-target Backdoor Attacks for Code Pre-trained Models”. In: Proceedings
of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada:
Association for Computational Linguistics, 2023, pp. 7236–7254.
[32] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. “Fine-Pruning: Defending Against Back-
dooring Attacks on Deep Neural Networks”. In: Proceedings of the 21st International Symposium on
Research in Attacks, Intrusions, and Defenses. Heraklion, Crete, Greece: Springer, 2018, pp. 273–
294.
[33] Yang Liu et al. “Backdoor Defense With Machine Unlearning”. In: IEEE International Conference
on Computer Communications. IEEE. 2022, pp. 280–289.
[34] Yi Liu et al. “Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study”. In: CoRR
abs/2305.13860.1 (2023), pp. 1–12.
[35] Yi Liu et al. “Prompt Injection Attack Against LLM-Integrated Applications”. In: arXiv preprint
arXiv:2306.05499 (2023).
[36] Lalita Lowphansirikul et al. “WangchanBERTa: Pretraining Transformer-Based Thai Language
Models”. In: arXiv preprint arXiv:2101.09635 (2021).
[37] Meta. Introducing Meta Llama 3: Most Capable Openly Available LLM to Data. 2024. url: https:
//ai.meta.com/blog/meta-llama-3/.
[38] MITRE. MITRE ATLAS: Navigate Threats to AI Systems Through Real-World Insights. 2024.
url: https://atlas.mitre.org/.
[39] OpenAI. ChatGPT. 2024. url: https://openai.com/chatgpt.
[40] OWASP. OWASP Top 10 for Large Language Model Applications. 2024. url: https://owasp.
org/www-project-top-10-for-large-language-model-applications/.
[41] Xudong Pan et al. “Privacy Risks of General-Purpose Language Models”. In: 2020 IEEE Sympo-
sium on Security and Privacy. 2020, pp. 1314–1331. doi: 10.1109/SP40000.2020.00095.
[42] Keivalya Pandya and Mehfuza Holia. “Automating Customer Service Using LangChain: Building
Custom Open-Source GPT Chatbot for Organizations”. In: CoRR abs/2310.05421.1 (2023), pp. 1–
4.
[43] Ethan Perez et al. “Red Teaming Language Models With Language Models”. In: Proceedings of
the 27th Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United
Arab Emirates: Association for Computational Linguistics, 2022, pp. 3419–3448.

17
[44] Fanchao Qi et al. “ONION: A Simple and Effective Defense Against Textual Backdoor Attacks”.
In: Proceedings of the 26th Conference on Empirical Methods in Natural Language Processing.
Virtual Event / Punta Cana, Dominican Republic: Association for Computational Linguistics,
2021, pp. 9558–9566.
[45] Leigang Qu et al. “LayoutLLM-T2I: Eliciting Layout Guidance From LLM for Text-to-Image Gen-
eration”. In: Proceedings of the 31st International Conference on Multimedia. Ottawa, ON, Canada:
ACM, 2023, pp. 643–654.
[46] Rohan Singh Rajput, Sarthik Shah, and Shantanu Neema. “Content Moderation Framework for
the LLM-Based Recommendation Systems”. In: Journal of Computer Engineering and Technology
14.3 (2023), pp. 104–117.
[47] Vipula Rawte, Amit Sheth, and Amitava Das. “A Survey of Hallucination in Language Foundation
Models”. In: arXiv preprint arXiv:2309.05922 (2023).
[48] Traian Rebedea et al. “NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications
With Programmable Rails”. In: Proceedings of the 2023 Conference on Empirical Methods in Natu-
ral Language Processing: System Demonstrations. Ed. by Yansong Feng and Els Lefever. Singapore:
Association for Computational Linguistics, Dec. 2023, pp. 431–445.
[49] Sakib Shahriar and Kadhim Hayawi. “Let’s Have a Chat! A Conversation With ChatGPT: Tech-
nology, Applications, and Limitations”. In: Artificial Intelligence and Applications 1.1–16 (2023).
[50] Saurabh Shintre, Kevin A Roundy, and Jasjeet Dhaliwal. “Making Machine Learning Forget”. In:
Privacy Technologies and Policy: 7th Annual Privacy Forum, APF 2019, Rome, Italy, June 13–14,
2019, Proceedings 7. Springer. 2019, pp. 72–83.
[51] Yu Sun et al. “ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Under-
standing and Generation”. In: arXiv preprint arXiv:2107.02137 (2021).
[52] Rohan Taori et al. Stanford Alpaca: An Instruction-Following LLaMA Model. https://github.
com/tatsu-lab/stanford_alpaca. 2023.
[53] Kushal Tirumala et al. “Memorization Without Overfitting: Analyzing the Training Dynamics
of Large Language Models”. In: Advances in Neural Information Processing Systems 35 (2022),
pp. 38274–38290.
[54] Hugo Touvron et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models”. In: arXiv preprint
arXiv:2307.09288 (2023).
[55] Ashish Vaswani et al. “Attention Is All You Need”. In: Proceedings of the 31st Annual Conference
on Neural Information Processing Systems. Long Beach, CA, USA: Curran Associates Inc., 2017,
pp. 5998–6008.
[56] Yashaswini Viswanath et al. “Machine Unlearning for Generative AI”. In: Journal of AI, Robotics
& Workplace Automation 3.1 (2024), pp. 37–46.
[57] Ivan Vykopal et al. “Disinformation Capabilities of Large Language Models”. In: CoRR abs/2311.08838.1
(2023), pp. 1–11.
[58] Longyue Wang et al. “Document-Level Machine Translation With Large Language Models”. In:
Proceedings of the 28th Conference on Empirical Methods in Natural Language Processing. Singa-
pore: Association for Computational Linguistics, 2023, pp. 16646–16661.
[59] Yiming Wang et al. “PrivateLoRA for Efficient Privacy Preserving LLM”. In: CoRR abs/2311.14030.1
(2023), pp. 1–17.
[60] Daoyuan Wu et al. “LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A
Vision Paper”. In: CoRR arXiv:2402.15727.1 (2024), pp. 1–8.
[61] Tianyu Wu et al. “A Brief Overview of ChatGPT: The History, Status Quo and Potential Future
Development”. In: IEEE/CAA Journal of Automatica Sinica 10.5 (2023), pp. 1122–1136.
[62] Zhen Xiang, Zidi Xiong, and Bo Li. “CBD: A Certified Backdoor Detector Based on Local Dominant
Probability”. In: Advances in Neural Information Processing Systems 36 (2024).
[63] Xiaojun Xu et al. “Detecting AI Trojans Using Meta Neural Analysis”. In: 2021 IEEE Symposium
on Security and Privacy. IEEE. 2021, pp. 103–120.

18
[64] Oleksandr Yermilov, Vipul Raheja, and Artem Chernodub. “Privacy- and Utility-Preserving NLP
With Anonymized Data: A Case Study of Pseudonymization”. In: CoRR abs/2306.05561.1 (2023),
pp. 1–7.
[65] Jiahao Yu et al. “GPTFUZZER: Red Teaming Large Language Models With Auto-Generated
Jailbreak Prompts”. In: CoRR abs/2309.10253.1 (2023), pp. 1–18.
[66] Binhang Yuan et al. “Decentralized Training of Foundation Models in Heterogeneous Environ-
ments”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 25464–25477.
[67] Chong Zhang et al. “Goal-Guided Generative Prompt Injection Attack on Large Language Models”.
In: arXiv preprint arXiv:2404.07234 (2024).
[68] Sicheng Zhu et al. “AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language
Models”. In: CoRR abs/2310.15140.1 (2023), pp. 1–14.

19