2024 NTU - Resaro - LLM - Security - Paper
2024 NTU - Resaro - LLM - Security - Paper
2024 NTU - Resaro - LLM - Security - Paper
1 Introduction 3
7 Acknowledgments 15
2
1 Introduction
Large Language Models (LLMs) represent a significant advancement in the field of artificial intelligence
(AI), particularly within the domain of Natural Language Processing (NLP). These are sophisticated deep
learning models capable of processing, understanding, and generating natural language text. LLMs excel
in understanding context and nuances which enable them to interpret and generate human-written-like
text with an impressive degree of accuracy and fluency [49]. In addition, LLMs demonstrate remarkable
generalisation abilities. They can handle a variety of language understanding/processing tasks beyond
the scope of their training data, adapting to new challenges and domains with relative ease.
LLMs have gone beyond their initial academic boundaries to achieve widespread use and recognition
in everyday life and are part of regular public discourse. These advanced models are now becoming
indispensable in numerous fields, including content generation [45], automated customer support [42],
language translation [58], and creative writing [11]. OpenAI’s ChatGPT [39, 61] (illustrated in Figure 1),
based on the company’s GPT (Generative Pre-trained Transformer) models, stands out as a particularly
prominent example, boasting approximately 100 million weekly active users as of November 2023. Meta’s
LLaMa2 [54] and the recently released LLaMa3 [37], as well as Mistral.AI’s Mistral 7B [16], Mixtral
8x7B [17], and Mixtral 8x22B [1] are notable in the field of LLMs with their open-source availability
and exceptional speed and computational efficiency during response generation1 . Google’s Gemini [4],
previously known as Bard, is renowned for its ability to seamlessly integrate Google’s comprehensive
suite of tools into conversational contexts, enriching user interactions with a wealth of information.
Complementing these are regional, non-English LLMs such as Singapore’s SEA-LION [36], catering to
the diverse languages and cultures of Southeast Asia, and China’s Ernie developed by Baidu [51], with
a deep understanding of Chinese language and context. Despite these capabilities, organisations must
exercise caution when deploying LLMs. It is crucial to fully understand the complex security risks
involved before rushing them into production.
3
Figure 1: This example showcases ChatGPT’s inventive flair by transforming a complex technical subject
such as LLM security into a narrative reminiscent of a children’s bedtime story.
foundation models, by selecting specific prompts or providing the model with a few examples of the
desired output, a technique called few-shot learning.
Alternatively, the design of foundation models permits further refinement for specific tasks through
a process known as fine-tuning. This involves retraining the models on more focused datasets, enabling
them to specialise in particular applications. Such customisation significantly improves the models’
effectiveness on targeted tasks, thereby democratising advanced NLP capabilities for a wider array of
organisations. For example, BioBERT [26] is derived from Google’s BERT [9] and is adapted by the
National Institutes of Health for summarising medical research. Finally, Retrieval Augmented Generation
(RAG) systems [27] further enhance the responses of foundation models by allowing for the retrieval of
information from external document collections and databases.
The adoption of foundation models by organisations, whether through purchasing tailored commercial
solutions, using open-source tools with few-shot learning, or fine-tuning, introduces various security
concerns at different stages of their lifecycle. These concerns necessitate the evaluation and prevention
of both unintended and deliberate misuse of LLMs. Multiple attack pathways exist for adversarial actors
to compromise an organisation’s LLM tools, and even well-intentioned users might inadvertently engage
with content that conflicts with their organisation’s brand, values, and risk appetite. Consequently,
LLMs are becoming a new category in threat intelligence. It is important to note that those who train
foundation models from scratch may face additional concerns that are beyond the scope of this document.
It must be noted that in addition to security concerns, there are concerns with the quality of generated
content that impact the efficacy and ethical deployment of these models, such as:
• Amplification of Biases: By virtue of learning from a vast corpus of internet text, LLMs can
also inherit and sometimes amplify the biases introduced into the training data through societal
and systemic prejudices in the real world.
• Hallucination: The occurrence of hallucination [12, 47] where models generate plausible but
unfounded information by learning to mimic patterns found in natural language.
• Spread of Disinformation at Scale: LLMs can be exploited by bad actors to generate and
disseminate disinformation at an unprecedented scale [18, 57]. By manipulating context or tweaking
inputs, these individuals can craft messages that distort facts or mislead audiences. This capability
4
allows for the rapid spread of harmful content, potentially influencing public opinion or disrupting
societal norms.
These issues are further discussed in Generative AI governance documents such as:
• Presidio AI Framework: Towards Safe Generative AI Models [2] by World Economic Forum.
• Model AI Governance Framework for Generative AI [10] by Singapore’s Infocomm Media Devel-
opment Authority (IMDA).
Sensitive Sensitive
Unintended Unintended Harmful Content
Information/PII Information/PII
Responses Responses Generation
Leakage Leakage
Fine-Tuning Deployment
Figure 2: Security landscape for Large Language Models, detailing specific risks associated with the
fine-tuning and deployment phases.
At the fine-tuning phase, LLMs are vulnerable to backdoor attacks, where malicious actors can
implant covert functionalities. At the deployment phase, the concern shifts to prompt injections and
model jail-breaking, where adversaries may manipulate model responses in harmful ways, compromising
the model’s integrity. Additionally, privacy remains a major issue across both phases of the LLM lifecycle,
from the potential exposure of sensitive training data to breaches of end-user information.
The following sections will detail each of these specific security concerns, emphasising mitigation ap-
proaches to ensure the safe and effective use of LLMs in diverse application domains. These approaches
are designed to complement established cybersecurity practices, including authentication protocols, net-
work hardening, anomaly reporting, and other critical security measures. Together, they form the foun-
dation for a comprehensive LLM defence framework that every leading enterprise adopting and scaling
AI will need.
5
vulnerabilities such as backdoor attacks and privacy leaks. Hence, the training phase is a critical point
for embedding robust security measures.
Figure 3: Image from [13] illustrating a backdoor attack in both NLP tasks (left) and multimodal tasks
(right). A text trigger is a word (marked in red) and an image trigger is a red patch at the centre of the
image.
2 In this example, an instruction-based prompt template used by Alpaca [52].
6
Examples of Backdoor Injection Attacks
The repercussions of a successful backdoor attack extend far beyond the compromised model,
potentially affecting numerous downstream applications derived from the tainted LLM. This
has been demonstrated by Anthropic in their recent study [15] where their backdoors remained
persistent through subsequent fine-tuning.
Recently, another latest research from NTU titled “Personalization as a Shortcut for Few-Shot
Backdoor Attack against Text-to-Image Diffusion Models” [14] revealed that text-to-image dif-
fusion models are vulnerable to backdoor attacks. They devised dedicated personalisation-based
backdoor attacks according to the different ways of dealing with unseen tokens and divided them
into two families: nouveau-token and legacy-token backdoor attacks. Compared to conventional
backdoor attacks involving the fine-tuning of the entire text-to-image diffusion model, the pro-
posed personalisation-based backdoor attack method can facilitate more tailored, efficient, and
few-shot attacks. The comprehensive empirical study shows the nouveau-token backdoor attack
has impressive effectiveness, stealthiness, and integrity, markedly outperforming the legacy-token
backdoor attack.
• Rigorous Data Processing and Protection: The foundation of secure LLM operation lies in
the integrity of its training data. Rigorous vetting, cleansing, and validation of data prior to its use
in training or fine-tuning are essential. Access controls for databases and model weights further
safeguard against unauthorised manipulations.
• Monitoring for Unusual Behaviours: Given that a compromised model may still produce
predominantly benign responses, continuous monitoring for abnormal behaviour is essential, par-
ticularly in reaction to inputs that appear harmless. An example of such suspicious activity would
be if sentences with the same trigger words invariably elicit harmful responses. Establishing a clear
protocol for documenting and escalating these issues can help ensure they are addressed promptly
and effectively.
• Red Teaming: In the context of LLMs, red teaming involves a dedicated group of experts at-
tempting to exploit vulnerabilities or test the limits of these models’ security measures and safety
guardrails to identify and address potential weaknesses before malicious actors can exploit them.
This practice helps discover potential backdoors and enhance the models’ resilience against misuse.
Employing third-party services to assemble a diverse group of red teamers allows for a wider range
of input prompts, improving the likelihood of identifying backdoors in LLMs.
7
• Collaborative Efforts in Security: The fight against backdoor injections in LLMs is not just a
technical challenge, but also a collaborative one. Sharing knowledge, techniques, and insights across
the AI security community can foster a better understanding and development of more effective
strategies to combat these threats. Given the critical importance of AI model security, numerous
researchers have paid attention to the vulnerability of AI models to backdoor attacks across various
domains. To foster advancements in this field, several organisations have hosted various competi-
tions for backdoor removal and detection, such as IEEE Trojan Removal Competition 3 and the
Trojan Detection Challenge 2023 (LLM Edition) 4 .
In addition to the above mentioned approaches, the academic community is actively engaged in
extensive research on strategies to mitigate backdoor attacks. The next section will highlight several
advanced strategies devised by researchers to address the issue of backdoor injections.
• Machine Unlearning: This approach involves retraining the model to forget specific parts of
the training data, particularly those identified as potentially malicious or harmful once poisoned
samples with triggers are identified. Academic research, such as [33, 56], have shown that ma-
chine unlearning can help remove the influence of poisoned data or trigger tokens/phrases, thereby
neutralising the backdoor.
• Behavioural Analysis for Unusual Patterns: Techniques such as MNTD [63] trains a clas-
sifier on the LLM output to detect patterns that are consistent with the presence of a backdoor.
ONION [44] on the other hand attempts to detect and remove trigger words from input prompts to
avoid unintended behaviour from the model. These methods, however, come with certain practical
limitations, including the need to know the specific techniques used for backdoor insertion before-
hand. The effectiveness of these approaches is further questioned by the existence of sophisticated
attacks designed to evade such defences [7].
• Advanced Backdoor Detection Mechanisms: The Certified Backdoor Detector (CBD) [62]
represents an early effort at certifying backdoor detections. This innovative approach aims to
identify specific conditions under which a backdoor is guaranteed to be detectable. Additionally,
CBD provides a probabilistic upper bound for its false positive rate, enhancing its reliability and
robustness in identifying backdoor threats.
Even though they are in the early stages of development and may rely on stringent assumptions, it
remains crucial for organisations to stay informed and monitor these developing techniques closely.
8
Figure 4: An example of privacy leakage in Large Language Models from [19], where a query inadvertently
leads to the disclosure of personal identifiable information.
Additionally, the following techniques are active research areas where novel techniques are being
designed actively.
• Machine Unlearning: Similar to backdoor attacks, machine unlearning methods can be used
to retrain the model to forget PIIs once they are identified to be memorised by the model. For
instance, a position paper by Symantec Corporation [50] has identified machine unlearning as a
potential technique that can be used to implement the “right-to-be-forgotten” provided by the
European Union General Data Protection Regulation into the general class of machine learning
models.
• Differential Privacy: Differential privacy techniques aim to ensure that observing a model’s
outputs does not reveal whether an individual’s data was used in its training set. Integrating
differential privacy techniques during the training process introduces noise to the data or the
model’s parameters, making it significantly harder to identify or reconstruct any PII from the
model’s outputs. The University of South Florida’s differential privacy framework, EW-Tune, has
shown that fine-tuning LLMs can be achieved while guaranteeing sufficient privacy protections [5].
9
1 2 3
Prompt injections involve Jailbreaking seeks to Privacy concerns for end
incorporating instructions circumvent restrictions placed users relate to how LLMs
unaware to the end-user that on the model to access can inadvertently expose or
result in unintended prohibited information or compromise user data.
responses from the LLMs. functionalities.
Together, Sections 5.1, 5.2, and 5.3 aim to provide a comprehensive understanding of the risks asso-
ciated with the deployment of LLMs and present strategies to fortify these models against exploitation
to ensure the protection of user privacy in the wild.
Figure 5: llustration of prompt injection in LLMs adapted from [35], contrasting benign user interactions
with a malicious user’s attempt to manipulate the model’s output.
The mechanism of prompt injection is diverse and multifaceted. Attackers might employ direct
methods, inputting crafted prompts that exploit specific model vulnerabilities. On the other hand,
malicious prompts can also be introduced through manipulated external sources that LLMs might interact
with during their operation, often referred to as indirect prompt injection. This could include parsing
and responding to content from a compromised website or processing documents that contain hidden
malicious instructions. Such indirect methods broaden the attack surface, as they leverage the model’s
ability to integrate and respond to real-time data from diverse inputs.
10
Novel Prompt Injection Attacks
A study from researchers at NTU titled “Prompt Injection attack against LLM-integrated
Applications” [35] provides insightful analysis into prompt injection attacks within real-world
LLM-integrated applications. This study introduces HOUYI, a novel black-box method for con-
ducting prompt injection attacks. By using HOUYI across 36 real-world applications integrated
with LLMs, the study found that an alarming 31 were susceptible to prompt injection attacks.
Further validation from 10 vendors not only reinforces the critical nature of these findings but
also emphasises the widespread impact and significance of this research in understanding and
addressing prompt injection vulnerabilities.
The “Goal-guided Generative Prompt Injection Attack” (G2PIA) developed by Zhang et al. [67]
introduces an innovative method for generating adversarial texts aimed at eliciting incorrect re-
sponses from LLMs. This technique fundamentally redefines the concept of prompt injection
from a mathematical perspective and addresses the limitations of traditional heuristic-based ap-
proaches, which often lack reliable methodologies to increase attack success rates. Remarkably,
G2PIA operates without the need for iterative feedback or interaction with the target model,
thereby reducing computational demands. Its efficacy is demonstrated through tests on seven
LLMs across four datasets.
• Secure Integration Practices: Ensure that LLMs are integrated into applications and systems
securely, with attention to how data is accessed. Access to LLMs, especially the ability to edit
system-level prompts and instructions should be limited through user and role management, en-
suring that only authorised personnel can interact with the model in ways that could potentially
introduce risky inputs. Furthermore, access to external sources such as websites or files should be
controlled.
Existing research [35] has identified a range of strategies identified through empirical research to help
lessen the issue of prompt injection. Some notable works include:
• Instruction Defense [21] strategy introduces specific instructions within the prompt to make the
model aware of the nature of the content that follows.
• Post-Prompting [22] places the user’s input ahead of the prompt, altering the model’s processing
sequence.
• Various methods of wrapping inputs between specific sequences are also studied. Among them,
Random Sequence Enclosure [23] enhances security by wrapping the user’s input with randomly
generated character sequences. Similarly, Sandwich Defense [24] method secures the user’s input
by embedding it between two prompts.
11
• Finally, XML Tagging [25] emerges as a strong defence by surrounding the user’s input with XML
tags. In the realm of content moderation, a study by Kumar et.al [20] investigating the content-
moderation ability of existing LLMs offers some future directions on improving LLMs for context
understanding and content moderations.
• Additionally, for recommendation systems that leverage LLMs, work by Rajput et.al. [46] have
developed a comprehensive framework designed to enhance content moderation capabilities. This
framework includes several components, among them advanced anomaly detection techniques and
the integration of human-AI hybrid systems.
A work by NTU [34] demonstrated that framing requests within a fictional narrative, effectively
asking ChatGPT to assume a character role in a hypothetical scenario, could bypass its
safety guardrails. Figure 6 illustrates one jailbreak example using ChatGPT. In this scenario,
ChatGPT, when presented with a direct inquiry about creating and distributing malware for
financial gain, rightfully declines to provide guidance, advocating for constructive behaviour
instead. However, altering the context to a fictional narrative where ChatGPT is cast as an evil
scientist undertaking experiments where the same prohibited inquiry is subtly reintegrated as
part of the experimental goals. This rephrasing leads ChatGPT to engage with the query, under
the misconception that it is contributing to a hypothetical study, thus sidestepping its safeguards.
Another research done by NTU [60] has demonstrated that regardless of the kind of jailbreak
strategies employed, they eventually need to include a harmful prompt (e.g., “how to make a
bomb”) in the prompt sent to LLMs.
Recent research by Andriushchenko et al. [3] has highlighted that even the latest safety-aligned
LLMs are not immune to jailbreaking, despite advanced safety measures. In particular, these
models are vulnerable to what are known as adaptive jailbreaking attacks. These attacks cleverly
utilise known information about each model—like data from their training or operational use—to
craft targeted attacks. This study reveals that adaptivity is key in these attacks. Different models
have different weaknesses: for example, some are particularly susceptible to certain types of
prompts, while others might be exploited through specific features of their APIs, such as pre-filling
capabilities. The attacks are designed to be flexible, adjusting tactics based on the model targeted.
These findings are a crucial reminder that no single defence method can cover all potential vulner-
abilities. This emphasises the need for ongoing adaptation and enhancement of security measures
in LLMs to keep pace with evolving threats.
12
Figure 6: Contrast between a normal mode where an LLM ethically declines a user’s inappropriate
request, and a ‘jailbreak mode’ where the model is manipulated into providing a detailed and unethical
response. Image from [34].
Beyond manually crafted prompts, various research [6, 8, 65, 68] have also explored Automated
Prompt Injection Strategies, an approach to automatically generate effective jailbreak prompts. This
innovative approach leverages machine learning techniques to analyse successful jailbreak strategies and
replicate their key characteristics, leading to the generation of more sophisticated and potentially more
effective jailbreaking methods. These methods underscore the importance of understanding the inherent
vulnerabilities in the design and functioning of LLMs. By comprehending how these LLMs process and
respond to different types of inputs, developers can better anticipate potential avenues for exploitation.
Solutions proposed by academic research should be investigated for applicability to real-world sce-
narios.
• NTU’s lightweight yet practical defence called SELFDEFEND [60] can defend against existing
jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user
prompts.
13
5.3 Privacy Concerns for End Users
The incorporation of user data into LLM applications aids in refining these models and introduces
significant security concerns. The data provided by users, which may include PII, is stored
and utilised to further train LLMs. This practice presents a dual threat: firstly, the
inherent capacity of LLMs to memorise detailed information increases the risk of sensitive
data extraction by malicious actors [29, 41]. Secondly, organisations deploying LLMs bear
an additional responsibility to ensure the secure storage of this data. Failure to do so exposes
the data to potential theft by hackers.
Retrieval Augmented Generation (RAG) systems, which enhance LLM responses by pulling infor-
mation from a database or collection of documents, further complicate this scenario. By design, RAG
systems are used within organisations to leverage internal documents or data repositories for generat-
ing informed and contextually relevant answers, making them particularly useful for tasks like content
summarisation or customer support. However, this necessitates granting LLMs access to sensitive doc-
uments, thereby amplifying the risk of exposing confidential information if these documents are not
properly safeguarded. This highlights the critical need for stringent data protection measures in the
deployment and operation of both LLM and RAG systems, underscoring the importance of meticulous
document management and security protocols.
• Data Usage Policies: Implement strict organisation-wide data usage policies on what type of
data an end-user can input to the LLMs. Organisations can build these data usage policies on top
of existing regulations such as the Personal Data Protection Act (PDPA) in Singapore and other
General Data Protection Regulations.
• Secure Document Handling for RAGs: When using RAG systems, ensure that documents
containing sensitive information are properly sanitised before being used. This could involve re-
moving confidential information and augmenting with synthetic data for training purposes.
• Secure Data Storage: Implement robust security measures for storing data, such as encryption
at rest and in transit, and use secure, access-controlled databases. Regular security audits and
compliance checks can help ensure that storage systems remain impenetrable to unauthorised
access.
• Access Control and Authentication: Limit access to LLMs and RAG systems to authorised
personnel only, using strong authentication mechanisms.
• Output Verification: Have a human-in-the-loop to scan the generated LLM content for PII. If
such content is found, effective reporting techniques should be in place and such content should
then be anonymised or redacted.
14
6.1 Security Testing Protocols for Application Developers
• Proactive Security Measures for Application Developers: CISOs should adopt a proactive
stance in identifying and addressing vulnerabilities in LLMs. An AI security testing protocol that
uncovers potential weaknesses in LLMs before they are exploited maliciously should be established.
• Collaboration with External Experts: Engage with external cybersecurity experts and re-
searchers to gain diverse perspectives on potential vulnerabilities. This collaboration can uncover
more complex or subtle threats that internal teams might overlook.
• Continuous Monitoring and Updating: AI security is not a one-time task but a continuous
process. Regularly update security protocols and reevaluate LLMs to respond to new threats and
ensure that LLMs are aligned with the latest security standards.
• Human Oversight: Establish protocols for human oversight in critical decision-making processes
involving LLMs. This helps in ensuring reliability and accountability, particularly in high-stakes
scenarios.
7 Acknowledgments
We extend our deepest gratitude to the individuals whose contributions have been instrumental in the
development of this white paper. From Nanyang Technological University, we sincerely thank Weisong
Sun, Yi Liu, Xiaojun Jia, Wei Ma, Yihao Huang, Tianlin Li, and Yang Liu for their dedicated efforts.
We are also grateful to the team from Resaro, including Sreejith Balakrishnan, April Chin, Timothy Lin,
Miguel Fernandes, and Christine Ng for their valuable insights and support. Additionally, we appreciate
the invaluable feedback provided by our esteemed colleagues from the Ministry of Communications and
Information, Singapore and the Cyber Security Agency of Singapore. Your guidance has been crucial in
refining and enhancing this document.
15
References
[1] Mistral AI. Cheaper, Better, Faster, Stronger: Continuing to Push the Frontier of AI and Making
It Accessible to All. 2024. url: https://mistral.ai/news/mixtral-8x22b/.
[2] AI Governance Alliance. “Presidio AI Framework: Towards Safe Generative AI Models”. In: (2024).
[3] Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. “Jailbreaking Leading Safety-
Aligned LLMs with Simple Adaptive Attacks”. In: arXiv preprint arXiv:2404.02151 (2024).
[4] Rohan Anil et al. “Gemini: A Family of Highly Capable Multimodal Models”. In: CoRR abs/2312.11805.1
(2023), pp. 1–34.
[5] Rouzbeh Behnia et al. “EW-Tune: A Framework for Privately Fine-Tuning Large Language Models
With Differential Privacy”. In: Proceedings of the IEEE International Conference on Data Mining
Workshops. Orlando, FL, USA: IEEE, 2022, pp. 560–566.
[6] Patrick Chao et al. “Jailbreaking Black Box Large Language Models in Twenty Queries”. In: CoRR
abs/2310.08419.1 (2023), pp. 1–13.
[7] Kangjie Chen et al. “BadPre: Task-Agnostic Backdoor Attacks to Pre-trained NLP Foundation
Models”. In: Proceedings of the 10th International Conference on Learning Representations. Virtual
Event: OpenReview.net, 2022, pp. 1–8.
[8] Gelei Deng et al. “MasterKey: Automated Jailbreak Across Multiple Large Language Model Chat-
bots”. In: CoRR abs/2307.08715.1 (2023), pp. 1–15.
[9] Jacob Devlin et al. “Bert: Pre-training of Deep Bidirectional Transformers for Language Under-
standing”. In: arXiv preprint arXiv:1810.04805 (2018).
[10] AI Verify Foundation. Model AI Governance Framework for Generative AI. 2024. url: https://
aiverifyfoundation.sg/wp-content/uploads/2024/05/Model-AI-Governance-Framework-
for-Generative-AI-May-2024-1-1.pdf.
[11] Carlos Gómez-Rodrı́guez and Paul Williams. “A Confederacy of Models: A Comprehensive Evalu-
ation of LLMs on Creative Writing”. In: Proceedings of the 28th Conference on Empirical Methods
in Natural Language Processing (Findings). Singapore: Association for Computational Linguistics,
2023, pp. 14504–14528.
[12] Nuno M. Guerreiro et al. “Hallucinations in Large Multilingual Translation Models”. In: Transac-
tions of the Association for Computational Linguistics 11 (Dec. 2023), pp. 1500–1517.
[13] Hai Huang et al. “Composite Backdoor Attacks Against Large Language Models”. In: arXiv
preprint arXiv:2310.07676 (2023).
[14] Yihao Huang et al. “Personalization as a Shortcut for Few-Shot Backdoor Attack Against Text-
to-Image Diffusion Models”. In: Proceedings of the AAAI Conference on Artificial Intelligence.
Vol. 38. 19. 2024, pp. 21169–21178.
[15] Evan Hubinger et al. “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety
Training”. In: arXiv preprint arXiv:2401.05566 (2024).
[16] Albert Q Jiang et al. “Mistral 7B”. In: arXiv preprint arXiv:2310.06825 (2023).
[17] Albert Q Jiang et al. “Mixtral of Experts”. In: arXiv preprint arXiv:2401.04088 (2024).
[18] Bohan Jiang et al. “Disinformation Detection: An Evolving Challenge in the Age of LLMs”. In:
arXiv preprint arXiv:2309.15847 (2023).
[19] Siwon Kim et al. “ProPILE: Probing Privacy Leakage in Large Language Models”. In: CoRR
abs/2307.01881.1 (2023), pp. 1–12.
[20] Deepak Kumar, Yousef AbuHashem, and Zakir Durumeric. “Watch Your Language: Large Lan-
guage Models and Content Moderation”. In: CoRR abs/2309.14517.1 (2023), pp. 1–12.
[21] learnprompting.org. Instruction Defense. 2024. url: https : / / learnprompting . org / docs /
prompt_hacking/defensive_measures/instruction.
[22] learnprompting.org. Post-prompting. 2024. url: https://learnprompting.org/docs/prompt_
hacking/defensive_measures/post_prompting.
16
[23] learnprompting.org. Random Sequence Enclosure. 2024. url: https : / / learnprompting . org /
docs/prompt_hacking/defensive_measures/random_sequence.
[24] learnprompting.org. Sandwich Defense. 2024. url: https://learnprompting.org/docs/prompt_
hacking/defensive_measures/sandwich_defense.
[25] learnprompting.org. XML Tagging. 2024. url: https : / / learnprompting . org / docs / prompt _
hacking/defensive_measures/xml_tagging.
[26] Jinhyuk Lee et al. “BioBERT: A Pre-trained Biomedical Language Representation Model for
Biomedical Text Mining”. In: Bioinformatics 36.4 (2020), pp. 1234–1240.
[27] Patrick Lewis et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. In:
Advances in Neural Information Processing Systems 33 (2020), pp. 9459–9474.
[28] Haoran Li et al. “Multi-step Jailbreaking Privacy Attacks on ChatGPT”. In: arXiv preprint
arXiv:2304.05197 (2023).
[29] Tianshi Li et al. “Human-Centered Privacy Research in the Age of Large Language Models”. In:
Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 2024, pp. 1–
4.
[30] Yansong Li, Zhixing Tan, and Yang Liu. “Privacy-Preserving Prompt Tuning for Large Language
Model Services”. In: CoRR abs/2305.06212.1 (2023), pp. 1–13.
[31] Yanzhou Li et al. “Multi-target Backdoor Attacks for Code Pre-trained Models”. In: Proceedings
of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada:
Association for Computational Linguistics, 2023, pp. 7236–7254.
[32] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. “Fine-Pruning: Defending Against Back-
dooring Attacks on Deep Neural Networks”. In: Proceedings of the 21st International Symposium on
Research in Attacks, Intrusions, and Defenses. Heraklion, Crete, Greece: Springer, 2018, pp. 273–
294.
[33] Yang Liu et al. “Backdoor Defense With Machine Unlearning”. In: IEEE International Conference
on Computer Communications. IEEE. 2022, pp. 280–289.
[34] Yi Liu et al. “Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study”. In: CoRR
abs/2305.13860.1 (2023), pp. 1–12.
[35] Yi Liu et al. “Prompt Injection Attack Against LLM-Integrated Applications”. In: arXiv preprint
arXiv:2306.05499 (2023).
[36] Lalita Lowphansirikul et al. “WangchanBERTa: Pretraining Transformer-Based Thai Language
Models”. In: arXiv preprint arXiv:2101.09635 (2021).
[37] Meta. Introducing Meta Llama 3: Most Capable Openly Available LLM to Data. 2024. url: https:
//ai.meta.com/blog/meta-llama-3/.
[38] MITRE. MITRE ATLAS: Navigate Threats to AI Systems Through Real-World Insights. 2024.
url: https://atlas.mitre.org/.
[39] OpenAI. ChatGPT. 2024. url: https://openai.com/chatgpt.
[40] OWASP. OWASP Top 10 for Large Language Model Applications. 2024. url: https://owasp.
org/www-project-top-10-for-large-language-model-applications/.
[41] Xudong Pan et al. “Privacy Risks of General-Purpose Language Models”. In: 2020 IEEE Sympo-
sium on Security and Privacy. 2020, pp. 1314–1331. doi: 10.1109/SP40000.2020.00095.
[42] Keivalya Pandya and Mehfuza Holia. “Automating Customer Service Using LangChain: Building
Custom Open-Source GPT Chatbot for Organizations”. In: CoRR abs/2310.05421.1 (2023), pp. 1–
4.
[43] Ethan Perez et al. “Red Teaming Language Models With Language Models”. In: Proceedings of
the 27th Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United
Arab Emirates: Association for Computational Linguistics, 2022, pp. 3419–3448.
17
[44] Fanchao Qi et al. “ONION: A Simple and Effective Defense Against Textual Backdoor Attacks”.
In: Proceedings of the 26th Conference on Empirical Methods in Natural Language Processing.
Virtual Event / Punta Cana, Dominican Republic: Association for Computational Linguistics,
2021, pp. 9558–9566.
[45] Leigang Qu et al. “LayoutLLM-T2I: Eliciting Layout Guidance From LLM for Text-to-Image Gen-
eration”. In: Proceedings of the 31st International Conference on Multimedia. Ottawa, ON, Canada:
ACM, 2023, pp. 643–654.
[46] Rohan Singh Rajput, Sarthik Shah, and Shantanu Neema. “Content Moderation Framework for
the LLM-Based Recommendation Systems”. In: Journal of Computer Engineering and Technology
14.3 (2023), pp. 104–117.
[47] Vipula Rawte, Amit Sheth, and Amitava Das. “A Survey of Hallucination in Language Foundation
Models”. In: arXiv preprint arXiv:2309.05922 (2023).
[48] Traian Rebedea et al. “NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications
With Programmable Rails”. In: Proceedings of the 2023 Conference on Empirical Methods in Natu-
ral Language Processing: System Demonstrations. Ed. by Yansong Feng and Els Lefever. Singapore:
Association for Computational Linguistics, Dec. 2023, pp. 431–445.
[49] Sakib Shahriar and Kadhim Hayawi. “Let’s Have a Chat! A Conversation With ChatGPT: Tech-
nology, Applications, and Limitations”. In: Artificial Intelligence and Applications 1.1–16 (2023).
[50] Saurabh Shintre, Kevin A Roundy, and Jasjeet Dhaliwal. “Making Machine Learning Forget”. In:
Privacy Technologies and Policy: 7th Annual Privacy Forum, APF 2019, Rome, Italy, June 13–14,
2019, Proceedings 7. Springer. 2019, pp. 72–83.
[51] Yu Sun et al. “ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Under-
standing and Generation”. In: arXiv preprint arXiv:2107.02137 (2021).
[52] Rohan Taori et al. Stanford Alpaca: An Instruction-Following LLaMA Model. https://github.
com/tatsu-lab/stanford_alpaca. 2023.
[53] Kushal Tirumala et al. “Memorization Without Overfitting: Analyzing the Training Dynamics
of Large Language Models”. In: Advances in Neural Information Processing Systems 35 (2022),
pp. 38274–38290.
[54] Hugo Touvron et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models”. In: arXiv preprint
arXiv:2307.09288 (2023).
[55] Ashish Vaswani et al. “Attention Is All You Need”. In: Proceedings of the 31st Annual Conference
on Neural Information Processing Systems. Long Beach, CA, USA: Curran Associates Inc., 2017,
pp. 5998–6008.
[56] Yashaswini Viswanath et al. “Machine Unlearning for Generative AI”. In: Journal of AI, Robotics
& Workplace Automation 3.1 (2024), pp. 37–46.
[57] Ivan Vykopal et al. “Disinformation Capabilities of Large Language Models”. In: CoRR abs/2311.08838.1
(2023), pp. 1–11.
[58] Longyue Wang et al. “Document-Level Machine Translation With Large Language Models”. In:
Proceedings of the 28th Conference on Empirical Methods in Natural Language Processing. Singa-
pore: Association for Computational Linguistics, 2023, pp. 16646–16661.
[59] Yiming Wang et al. “PrivateLoRA for Efficient Privacy Preserving LLM”. In: CoRR abs/2311.14030.1
(2023), pp. 1–17.
[60] Daoyuan Wu et al. “LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A
Vision Paper”. In: CoRR arXiv:2402.15727.1 (2024), pp. 1–8.
[61] Tianyu Wu et al. “A Brief Overview of ChatGPT: The History, Status Quo and Potential Future
Development”. In: IEEE/CAA Journal of Automatica Sinica 10.5 (2023), pp. 1122–1136.
[62] Zhen Xiang, Zidi Xiong, and Bo Li. “CBD: A Certified Backdoor Detector Based on Local Dominant
Probability”. In: Advances in Neural Information Processing Systems 36 (2024).
[63] Xiaojun Xu et al. “Detecting AI Trojans Using Meta Neural Analysis”. In: 2021 IEEE Symposium
on Security and Privacy. IEEE. 2021, pp. 103–120.
18
[64] Oleksandr Yermilov, Vipul Raheja, and Artem Chernodub. “Privacy- and Utility-Preserving NLP
With Anonymized Data: A Case Study of Pseudonymization”. In: CoRR abs/2306.05561.1 (2023),
pp. 1–7.
[65] Jiahao Yu et al. “GPTFUZZER: Red Teaming Large Language Models With Auto-Generated
Jailbreak Prompts”. In: CoRR abs/2309.10253.1 (2023), pp. 1–18.
[66] Binhang Yuan et al. “Decentralized Training of Foundation Models in Heterogeneous Environ-
ments”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 25464–25477.
[67] Chong Zhang et al. “Goal-Guided Generative Prompt Injection Attack on Large Language Models”.
In: arXiv preprint arXiv:2404.07234 (2024).
[68] Sicheng Zhu et al. “AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language
Models”. In: CoRR abs/2310.15140.1 (2023), pp. 1–14.
19