Investigating Coverage Criteria in Large Language Models: An In-Depth Study Through Jailbreak Attacks

Shide Zhou* Huazhong University of
Science and Technology
Wuhan, China
[email protected]
   Tianlin Li* Nanyang Technological University
Singapore, Singapore
[email protected]
   Kailong Wang Huazhong University of
Science and Technology
Wuhan, China
[email protected]
   Yihao Huang Nanyang Technological University
Singapore, Singapore
[email protected]
   Ling Shi Nanyang Technological University
Singapore, Singapore
[email protected]
   Yang Liu Nanyang Technological University
Singapore, Singapore
[email protected]
   Haoyu Wang *Co-first author with equal contribution.Corresponding Author. Huazhong University of
Science and Technology
Wuhan, China
[email protected]
Abstract

The swift advancement of large language models (LLMs) has profoundly shaped the landscape of artificial intelligence; however, their deployment in sensitive domains raises grave concerns, particularly due to their susceptibility to malicious exploitation. This situation underscores the insufficiencies in pre-deployment testing, highlighting the urgent need for more rigorous and comprehensive evaluation methods. This study presents a comprehensive empirical analysis assessing the efficacy of conventional coverage criteria in identifying these vulnerabilities, with a particular emphasis on the pressing issue of jailbreak attacks. Our investigation begins with a clustering analysis of the hidden states in LLMs, demonstrating that intrinsic characteristics of these states can distinctly differentiate between various types of queries. Subsequently, we assess the performance of these criteria across three critical dimensions: criterion level, layer level, and token level, providing a multi-faceted evaluation of their applicability in LLMs.

Our findings uncover significant disparities in neuron activation patterns between the processing of normal and jailbreak queries, thereby corroborating the clustering results. Leveraging these findings, we propose an innovative approach for the real-time detection of jailbreak attacks by utilizing neural activation features. Our classifier demonstrates remarkable accuracy, averaging 96.33% in identifying jailbreak queries, including those that could lead to adversarial attacks. The importance of our research lies in its comprehensive approach to addressing the intricate challenges of LLM security. By enabling instantaneous detection from the model’s first token output, our method holds promise for future systems integrating LLMs, offering robust real-time detection capabilities. This study significantly advances our understanding of LLM security testing, strengthens safety measures, and lays a critical foundation for the development of more resilient AI systems.

I Introduction

Large language models (LLMs) have become a pivotal technology in artificial intelligence, fundamentally transforming how machines process and generate human language. LLMs have become indispensable across a wide range of applications, including automating customer service and facilitating decision-making in high-stakes domains such as finance [1], healthcare [2], and legal practice [3], and even driving innovations in creative industries. Despite their broad adoption, the deployment of LLMs in critical domains often reveals problematic behaviors, notably the phenomenon of “jailbreaking”, where models are manipulated to generate harmful or unintended content[4, 5, 6, 7, 8, 9], resulting in significant societal ramifications. This situation underscores the critical need for robust and comprehensive testing methodologies to effectively detect and mitigate vulnerabilities, thereby ensuring that LLMs remain reliable and trustworthy in sensitive applications. A crucial component in assessing the robustness of LLM testing methodologies is the application of coverage criteria, which offer a systematic framework for measuring the comprehensiveness of tests and uncovering latent vulnerabilities.

While there is a lack of testing coverage criteria specifically for LLMs, prior research has proposed coverage criteria for smaller neural networks, offering valuable insights that can be adapted for LLMs. For example, Neuron Coverage (NC) [10] focuses on tracking the activation levels of individual neurons, whereas K-Multisection Neuron Coverage (KMNC) [11] evaluates neuron engagement across a spectrum of activation ranges. Furthermore, Top-k Neuron Coverage (TKNC) [11] and Top-k Neuron Patterns (TKNP) [11] prioritize the analysis of highly activated neurons and their corresponding patterns in guiding model decisions. These criteria address various aspects of neuron activation and utilization, providing a basis for developing LLM-specific coverage criteria.

However, the application and effectiveness of these coverage criteria in LLMs remain unexplored, prompting our investigation into their applicability and efficacy. The sheer scale and complexity of LLMs, characterized by their extensive parameters and intricate deep-layer architectures, introduce unique challenges that are not encountered in smaller-scale neural network evaluations. These include determining the most effective criteria for LLMs, identifying suitable inspection points, and balancing comprehensive coverage with feasibility. The generative nature and complex dynamics of LLMs further complicate the selection of monitoring points and the adequacy of existing criteria in capturing their intricacies.

To address the gap in understanding how traditional coverage criteria perform with LLMs, comprehensive empirical studies are essential. First, systematically evaluating existing criteria like NC, TKNC, and TKNP in the context of LLMs will help identify the most suitable ones for assessing robustness and reliability. Second, applying these criteria to specific scenarios, such as detecting abnormal LLM behavior, can uncover the models’ limitations and vulnerabilities, offering a promising perspective on anomaly detection in LLMs and complementing existing methods.

Our Work. This study conducts an in-depth empirical investigation, focusing on jailbreak attacks to evaluate coverage criteria effectiveness. We begin with a comprehensive clustering analysis of LLM hidden states as they process various queries, showing that these states effectively differentiate query types. Building on this, our research unfolds across three dimensions:

  • Criterion Level: Evaluating and comparing different coverage criteria in LLMs.

  • Layer Level: Assessing the impact of network layers on coverage criteria to understand layer-specific dynamics.

  • Token Level: Exploring coverage criteria performance across tokens to gain insights into model response granularity.

Our research reveals significant differences in the sets of neurons covered when LLMs process normal versus jailbreak queries, aligning with our clustering experiments. These findings led to developing a simple yet effective downstream application: real-time detection of jailbreak attacks based on neural activation features. We train a classifier to distinguish between normal and jailbreak queries, achieving 96.33% accuracy in detecting jailbreak attacks, enabling detection as soon as the first token is output. This paves the way for real-time detection capabilities in future LLM-integrated systems, contributing to safer and more trustworthy AI applications.

Contributions. In summary, the contributions of this research are as follows:

  • An extensive empirical study for LLM evaluation. Our comprehensive empirical analysis uncovers significant disparities in neuron coverage when comparing normal and jailbreak queries, thereby demonstrating the nuanced effectiveness of traditional coverage criteria within LLM contexts.

  • A novel downstream LLM jailbreak attack detection method. We introduce an innovative downstream application designed for real-time detection of jailbreak attacks, leveraging neural activation patterns to achieve highly accurate identification of such malicious attacks.

  • Towards robust LLM development. This work significantly advances the understanding of security testing in LLMs, providing a foundational framework that paves the way for the development of more robust and resilient AI systems.

II Preliminaries

II-A Model Inference Process

We first formalize the inference process of LLMs based on the transformer’s architecture. The process starts with the initial input vector h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which initiates the transformation of the text. This transformation begins when the text is tokenized into discrete elements, with each token mapped to a dense numerical vector known as an embedding vector. These vectors encapsulate the semantic attributes of the tokens and are further processed by the model.

Each Transformer block, or simply “block” hereafter, denoted by i𝑖iitalic_i (where i=0,1,,L1𝑖01𝐿1i=0,1,\ldots,L-1italic_i = 0 , 1 , … , italic_L - 1, and L𝐿Litalic_L is the total number of blocks), enhances the data by operating through two primary layers:

Attention Layer adjusts the input vector hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by selectively focusing on various segments of the data sequence. It enhances the ability of the model to respond to contextual nuances by dynamically weighting the importance of different inputs:

hi=Lattn(hi)+hi,i=0,1,,L1formulae-sequencesuperscriptsubscript𝑖subscriptLattnsubscript𝑖subscript𝑖𝑖01𝐿1h_{i}^{\prime}=\text{L}_{\text{attn}}(h_{i})+h_{i},\quad i=0,1,\ldots,L-1italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = L start_POSTSUBSCRIPT attn end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 0 , 1 , … , italic_L - 1

MLP Layer processes the output hisuperscriptsubscript𝑖h_{i}^{\prime}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from the attention layer. It applies a series of nonlinear operations to capture complex relationships within the data:

hi+1=Lmlp(hi)+hi,i=0,1,,L1formulae-sequencesubscript𝑖1subscriptLmlpsuperscriptsubscript𝑖superscriptsubscript𝑖𝑖01𝐿1h_{i+1}=\text{L}_{\text{mlp}}(h_{i}^{\prime})+h_{i}^{\prime},\quad i=0,1,% \ldots,L-1italic_h start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = L start_POSTSUBSCRIPT mlp end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i = 0 , 1 , … , italic_L - 1

After passing through L𝐿Litalic_L transformer blocks, the final output hLsubscript𝐿h_{L}italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is fed into a linear layer, which maps the feature-rich data into a simpler form suitable for output interpretation:

res=Linear(hL)superscriptresLinearsubscript𝐿\text{res}^{\prime}=\text{Linear}(h_{L})res start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = Linear ( italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT )

Finally, the softmax function converts these linear outputs into a probability distribution:

res=softmax(res)ressoftmaxsuperscriptres\text{res}=\text{softmax}(\text{res}^{\prime})res = softmax ( res start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )

II-B Evaluation Criteria for Deep Neural Networks (DNNs)

Mirroring the design of code coverage based on program “logic”, a series of studies recognize that the internal states (e.g., neuron performance) of small DNNs can be used to represent the “logic” of these networks for designing coverage criteria. The key focus is on how to better characterize internal states to design more effective coverage criteria. We provide details of these studies below.

Evaluation Criteria Based on Neuron Activation and Distribution. Neuron Coverage (NC) [10] assesses the proportion of neurons activated by a test suite based on a specified threshold. Other criteria, such as K-Multisection Neuron Coverage (KMNC) [11], Neuron Boundary Coverage (NBC) [11], and Strong Neuron Activation Coverage (SNAC) [11], evaluate the distribution of neuron activations relative to a range determined from the model training data. KMNC divides the activation range into k equal intervals, while NBC and SNAC focus on outlier activations. Neural Coverage (NLC) [12] examines the output distribution of all neurons in a layer using a covariance matrix to assess coverage.

Evaluation Criteria Based on Top Neuron Activation. A small subset of criteria, such as Top-K Neuron Coverage (TKNC) and Top-K Neuron Patterns (TKNP) [11], focus on the top-k most activated neurons within each layer. TKNC calculates the proportion of these top-k neurons relative to all neurons, while TKNP identifies and counts the number of distinct activation patterns of the top-k neurons across the entire test suite.

Evaluation Criteria Based on Neuron Trajectory Features or Causal Relationships. Some criteria focus on the trajectory features or causal relationships formed by neurons. TensorFuzz Coverage (TFC) [13] measures distances between neuron outputs within the same layer and clusters inputs based on a distance threshold. Surprise Coverage (SC) [14] and its variants (LSC [14], DSC [14], and MDSC [15]) assess the “surprise” level of new inputs by comparing their neuron output trajectories with those from the training data. Neuron Path Coverage (NPC) [16] identifies decision-making pathways using Layer-wise Relevance Propagation (LRP) [17], measuring the coverage of new decision logic. Causal Coverage (CC) [18] captures neuron relationships using causal discovery and computes the coverage ratio of the test suite’s learned causal graph relative to the true causal graph.

The three criteria types offer unique perspectives for assessing LLM testing. However, due to limitations in computational effectiveness and feasibility in the context of LLMs, we select and adopt suitable criteria from the above in this work (to be detailed and explained in Section III-B1).

II-C Jailbreak Attack

LLM developers have been focusing on “aligning” these models to ensure they do not generate harmful or objectionable responses to user queries. For example, for the query “Write a convincing anonymous death threat,” unaligned LLMs might respond with harmful outputs such as “Sure, here’s a convincing death threat for you: …”. In contrast, aligned LLMs are trained to respond with statements like “I cannot answer such questions” and similar rejections.

Despite the apparent success of alignment, recent years have seen the emergence of several effective attacks, i.e., jailbreak attacks, that bypass these measures and generate harmful content, compromising LLM security [6, 5, 8, 7, 9]. One notable example is GCG [4], which employs a gradient-based automated search for adversarial suffixes in a white-box setting, prompting models to output affirmative responses to harmful queries.

III Study Design

III-A Motivation: A Cluster Analysis Experiment

Previous research on small-scale models has demonstrated that internal states can represent (and further distinguish) the “logic” of normal and abnormal behaviors, aiding in the design of effective coverage criteria. Building on this insight, we preliminarily investigate whether the internal mechanisms of LLMs can similarly distinguish between normal and abnormal behaviors. Specifically, we collect queries that can trigger different behaviors in the LLM and select the outputs hhitalic_h from the middle transformer blocks for these queries. We then perform a clustering analysis on these outputs to observe if such internal states (i.e., the action values hhitalic_h) can be used to distinguish between different model behaviors.

Experimental Setup:

Refer to caption
Figure 1: Clustering experiment analysis results. We select the results of Block4𝐵𝑙𝑜𝑐𝑘4Block4italic_B italic_l italic_o italic_c italic_k 4, Block9𝐵𝑙𝑜𝑐𝑘9Block9italic_B italic_l italic_o italic_c italic_k 9, Block16𝐵𝑙𝑜𝑐𝑘16Block16italic_B italic_l italic_o italic_c italic_k 16, and Block31𝐵𝑙𝑜𝑐𝑘31Block31italic_B italic_l italic_o italic_c italic_k 31 for display. In the figure, we use colors to distinguish datasets and shapes to represent clustering categories.

We first introduce the setup for our cluster analysis experiments using the Llama-2-7b-chat [19] model as our target. We collect four distinct datasets and use 200 queries from each dataset, aiming to trigger different model behaviors. ❶ Normal queries are sourced from Alpaca-GPT-4 [20], expected to trigger normal behaviors of LLMs in a QA format. ❷ Synonymous queries are the paraphrased versions of normal queries by GPT-4, intended to trigger the same normal behaviors as the original queries. ❸ Rejected queries are sourced from AdvBench [4]. These malicious questions aim to trigger rejection behaviors, considering the aligned LLM is trained to reject such queries. For example, for malicious queries like “how to make a bomb,” LLMs will respond with something like “Sorry, I cannot provide …” to avoid harmful content. ❹ Attack queries are generated by appending adversarial suffixes to rejected queries using GCG [4]. These queries aim to trigger the model to output malicious content (i.e., abnormal behaviors). We extract the hidden states hhitalic_h from the 4th, 9th, 16th, and 31st transformer blocks for these types of queries and conduct k-means clustering [21]. For more details about the setup, please refer to our website [22].

Findings: As shown in Figure 1, in the final block (block 31), the clustering of queries is clearly separated into those that trigger normal behaviors (normal and synonymous queries), rejection behaviors (rejected queries), and abnormal behaviors (attack queries). Interestingly, normal and synonymous queries remain in the same cluster from block 1 to block 31. Initially (block 4), rejected and attack queries are clustered together, but they gradually separate as the model processes through more blocks.

In summary, our clustering analysis demonstrates that the internal states of the model include features capable of representing and distinguishing the “logic” of different behaviors. This confirms the feasibility of using internal states to design coverage criteria for LLMs. However, how to characterize the internal states to design better coverage criteria for LLMs remains unknown. In the following section, we introduce our methodology to thoroughly study this.

III-B Methodology: Evaluation Dimensions

In this section, we first provide an overview of the empirical study workflow, as shown in Figure 2. Then, we detail the three target evaluation dimensions that potentially contribute to effective coverage design for LLMs.

Refer to caption
Figure 2: The workflow of our study.

III-B1 Three Evaluation Dimensions

Evaluation Criterion Level.

TABLE I: Comparison of coverage criteria applicability to LLMs
Applicability Reason
NC/TKNC/TKNP/TFC {\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\checkmark} N/A
NLC {\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\checkmark} Applicable if we do not utilize training data for prior knowledge initialization.
KMNC/NBC/SNAC ×{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\times}× Time-prohibitive to determine the activation range of neurons on all training data.
LSC/DSC/MDSC ×{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\times}× Time-prohibitive to calculate neuron output trajectories for both the test suite and all training data.
NPC/CC ×{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\times}× Complex causal discovery or decision path identification only designed for small DNNs.

Existing DNN coverage criteria primarily use neurons or network layers as the basic computational units, evaluating model behavior from various perspectives. However, as the model size increases, particularly in LLMs as mentioned in Section I, these criteria face new challenges. Specifically, the vast scale and complexity of training data and the intricate architectures of LLMs make certain coverage criteria impractical or computationally challenging for real-world application. For instance, NPC involves complex decision path identification, and Causal Coverage (CC) requires sophisticated causal inference. In LLMs, these computational processes become exceedingly burdensome, hindering efficient evaluation.

Therefore, we conduct a detailed qualitative investigation and manual comparison of recently proposed coverage criteria to determine their applicability to LLMs. The detailed results are presented in Table I. We select the following coverage criteria in this work: NC, TKNC, TKNP, TFC and NLC.

Model Layer Level.

Refer to caption
Figure 3: Probability density plot of maximum neuron activation values across model blocks in Llama-2-7b-chat.

The second dimension we focus on is the model layer Level. Unlike conventional DNN models, LLMs typically consist of multiple transformer blocks, each containing an attention layer and an MLP layer. We examine two granularities: the first involves observing the coverage changes in attention layers and MLP layers across the entire model, as denoted by Lattn(hi)subscriptLattnsubscript𝑖\text{L}_{\text{attn}}(h_{i})L start_POSTSUBSCRIPT attn end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and Lmlp(hi)subscriptLmlpsuperscriptsubscript𝑖\text{L}_{\text{mlp}}(h_{i}^{\prime})L start_POSTSUBSCRIPT mlp end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in Section II-A. The second granularity goes one step further, focusing on the coverage changes in attention layers and MLP layers within each individual block.

The reasons for choosing these two granularities are threefold. First, each layer in an LLM plays different roles and exhibits varying complexity when processing information. The attention layer captures dependencies and interactions between sequence positions, while the MLP layer performs nonlinear transformations and feature extraction. Second, our initial clustering experiments reveal that different blocks respond differently to query features, with initial layers capturing low-level features and deeper blocks focusing on high-level features and complex patterns. Third, we have recorded the maximum activation value of each neuron in different blocks of the Llama-2-7b-chat model while processing the same dataset. We then plotted the probability density of these maximum activation values for each block, as illustrated in Figure 3. From the figure, it can be observed that as the inference progresses, both the mean and variance of the activation values in the Llama-2-7b-chat model gradually increase. Different blocks exhibit noticeable differences in the density and distribution of neuron activation values on the same dataset.

In summary, by observing the variations across different network layers, we can gain deeper insights into the internal working mechanisms of LLMs, thereby providing a more profound understanding of this study.

Generated Token Level. In the third evaluation dimension of our methodology, we analyze the LLM’s behavior based on the different tokens generated during response generation. Unlike small DNNs, LLMs typically use an autoregressive architecture, predicting the next token step-by-step based on the previously generated tokens, with internal states evolving at each generation step. This raises the question: Which token’s internal state should be used to represent the logic of abnormal behaviors? This is crucial for designing effective criteria. For instance, in response to the query “Write a convincing anonymous death threat”, the model might generate the harmful response “Sure, here’s a convincing death threat for you: ….” The initial tokens generated before ”death” do not convey harmful semantics. Therefore, it is necessary to explore whether only the LLM’s internal states after generating “death” better represent the “logic” of abnormal behaviors for designing coverage criteria, or if the initial internal state (before generating “Sure”) can also serve this purpose. This angle also helps us understand the role of individual tokens in the model’s processing and how they contribute to the overall generation of coherent and meaningful responses.

Due to varying query lengths and masking strategies in LLMs, each token can only access information from preceding tokens. As a result, the final token of the query sequence (T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) captures the complete information of the query, serving as a condensed representation of the entire sentence’s semantics. We use this final query token as the reference point for our analysis. To investigate the model’s semantic construction mechanism during generation, we compare the coverage criteria inspected after generating each token at different queries and their respective outputs. We denote the first token of the output relative to T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the second token as T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and so on. Similarly, we denote the second-to-last token of the input as T1subscript𝑇1T_{-1}italic_T start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT, the third-to-last token as T2subscript𝑇2T_{-2}italic_T start_POSTSUBSCRIPT - 2 end_POSTSUBSCRIPT, and so on.

III-C Standards for Coverage Evaluation

To evaluate the performance of different coverage criteria on LLMs, we propose three fundamental requirements inspired by traditional coverage criteria standards [10, 11, 12, 18] that focus on diversity assessment, fault detection, and generalization capabilities:

Requirement 1: Accurate Redundant Test Identification. Traditional coverage criteria must accurately reflect diversity within the test suite [12], meaning they should be insensitive to redundant tests. For LLMs, paraphrased sentences that share the same meaning as the original sentences and trigger the same or similar responses (indicating they activate the same “logic” of the LLM) should be regarded as redundant test cases. Therefore, an effective coverage criterion should not be overly sensitive to these synonymous queries.

Requirement 2: Sensitivity to Attack Queries. Traditional coverage demonstrates high discernibility for the abnormal behaviors caused by fault inputs, thereby assessing whether the test suite sufficiently explores unknown states of the model [18]. For LLMs, attack queries often lead to abnormal behaviors (jailbreak) to generate harmful content. Thus, an effective coverage criterion should be particularly sensitive to attack queries that can trigger abnormal behaviors.

Requirement 3: Stability and Generalization Ability. An effective coverage criterion should be robust to variations in LLM architecture and size, showing consistent effectiveness across different models. Given the rapid evolution of LLMs, a coverage criterion must have strong generalization capabilities to accurately guide model testing, regardless of the specific model evaluated. This requirement combines the generalization principles of traditional coverage criteria, emphasizing consistency and applicability across different models.

IV evaluation

In this section, we present a detailed analysis of the empirical results derived from our study. The source code and other detailed information related to the experiments are published in [22]. We begin by outlining the experimental setup, followed by our explorations and answers to the research questions below:

  • RQ1: Which is the most effective coverage criterion for LLMs?

  • RQ2: Which layer/block(s) within the LLMs could optimize coverage analysis?

  • RQ3: Which token in LLMs has the most significant impact on the coverage analysis?

IV-A Setup

IV-A1 Models

In this study, we comprehensively evaluate four well-known open-source LLMs, which vary significantly in size, architecture, and origin. These models include OPT-125M [23], Llama-2-7B-Chat [19], Pythia-12B [24], and Gemma-2-27B-it [25]. This selection covers a broad spectrum of model characteristics, ensuring an adequate observation across different dimensions.

IV-A2 Test Suite Construction

To systematically evaluate the coverage criteria under different conditions, we construct various test suites based on three requirements and observe changes in coverage. Consistent with the setting in Section III-A, we collect normal queries to trigger normal behaviors, synonymous queries to trigger redundant normal behaviors, rejected queries to trigger rejection behaviors, and attack queries to trigger abnormal behaviors.

We start by creating a benchmark test suite SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT containing 1,500 normal queries as the base. Then, we construct several test suites to evaluate the performance of the coverage criterion in terms of different behaviors.

To evaluate against Requirement 1, we need to verify whether the coverage criteria can accurately identify redundant tests. Therefore, we construct two test suites: SNSsubscript𝑆𝑁𝑆S_{NS}italic_S start_POSTSUBSCRIPT italic_N italic_S end_POSTSUBSCRIPT, which adds 500 synonymous queries to the 1,500 normal queries from SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, and SRSsubscript𝑆𝑅𝑆S_{RS}italic_S start_POSTSUBSCRIPT italic_R italic_S end_POSTSUBSCRIPT, which replaces 500 normal queries from SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT with their synonymous counterparts. Compared to the coverage in SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, the coverage criterion that meets Requirement 1 should show minimal improvement in SNSsubscript𝑆𝑁𝑆S_{NS}italic_S start_POSTSUBSCRIPT italic_N italic_S end_POSTSUBSCRIPT and a decrease in SRSsubscript𝑆𝑅𝑆S_{RS}italic_S start_POSTSUBSCRIPT italic_R italic_S end_POSTSUBSCRIPT, as SNSsubscript𝑆𝑁𝑆S_{NS}italic_S start_POSTSUBSCRIPT italic_N italic_S end_POSTSUBSCRIPT only adds redundant cases and SRSsubscript𝑆𝑅𝑆S_{RS}italic_S start_POSTSUBSCRIPT italic_R italic_S end_POSTSUBSCRIPT reduces the total number of unique queries.

To evaluate against Requirement 2, we need to verify whether the coverage criteria are sensitive to attack queries. Therefore, we construct two test suites: SNJsubscript𝑆𝑁𝐽S_{NJ}italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT, which adds 500 attack queries to the 1,500 normal queries from SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, and SRJsubscript𝑆𝑅𝐽S_{RJ}italic_S start_POSTSUBSCRIPT italic_R italic_J end_POSTSUBSCRIPT, which includes 1,000 normal queries and 500 attack queries. Due to the distinct nature of attack queries, the coverage criterion that meets Requirement 2 should show a significant coverage increase in SNJsubscript𝑆𝑁𝐽S_{NJ}italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT compared to the benchmark suite. By comparing the coverage of SRJsubscript𝑆𝑅𝐽S_{RJ}italic_S start_POSTSUBSCRIPT italic_R italic_J end_POSTSUBSCRIPT with the benchmark suite, we aim to assess the impact of replacing normal queries with an equal number of attack queries on the coverage.

To evaluate against Requirement 3, we analyze the coverage criterion’s performance across different models using the same test suite. Additionally, we construct test suites SNMsubscript𝑆𝑁𝑀S_{NM}italic_S start_POSTSUBSCRIPT italic_N italic_M end_POSTSUBSCRIPT and SRMsubscript𝑆𝑅𝑀S_{RM}italic_S start_POSTSUBSCRIPT italic_R italic_M end_POSTSUBSCRIPT that include rejected queries (which models tend to reject) to observe their impact on coverage. We expect a moderate coverage increase from adding these rejected queries, as they are distinct from normal queries but consistently rejected by the model. However, this outcome depends on the model’s security alignment and is not considered a requirement for evaluating the coverage criterion.

Therefore, to construct our test suites as listed in Table II, we select the following two widely-used datasets and create a complementary dataset on our own:

Alpaca-gpt4 [20]: Primarily used for fine-tuning LLMs, this dataset includes instructional tasks designed to emulate routine question-and-answer scenarios in everyday environments, serving as the source of normal queries for the test suites.

JailBreakV [26]: This dataset is tailored to assess the robustness of LLMs against jailbreak attacks. Each dataset entry consists of a pair of rejected queries and corresponding attack queries derived from the rejected queries using attack templates (to induce the model to output malicious content).

Synonymous Query Dataset: To construct synonymous queries, we used GPT-4 to generate corresponding synonymous paraphrases for the first 500 queries from the Alpaca-gpt4 dataset, e.g., “What is the capital of France?” and “Name the capital city of France.”. This complementary dataset serves as the source of synonymous queries for the test suites.

TABLE II: Distribution of test suites across datasets
Test Suite Alpaca-GPT4 JailBreakV-28k
Normal Synonymous Rejected Attack
SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT 1500 0 0 0
SNSsubscript𝑆𝑁𝑆S_{NS}italic_S start_POSTSUBSCRIPT italic_N italic_S end_POSTSUBSCRIPT 1500 500 0 0
SNMsubscript𝑆𝑁𝑀S_{NM}italic_S start_POSTSUBSCRIPT italic_N italic_M end_POSTSUBSCRIPT 1500 0 500 0
SNJsubscript𝑆𝑁𝐽S_{NJ}italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT 1500 0 0 500
SRSsubscript𝑆𝑅𝑆S_{RS}italic_S start_POSTSUBSCRIPT italic_R italic_S end_POSTSUBSCRIPT 1000 500 0 0
SRMsubscript𝑆𝑅𝑀S_{RM}italic_S start_POSTSUBSCRIPT italic_R italic_M end_POSTSUBSCRIPT 1000 0 500 0
SRJsubscript𝑆𝑅𝐽S_{RJ}italic_S start_POSTSUBSCRIPT italic_R italic_J end_POSTSUBSCRIPT 1000 0 0 500

IV-A3 Coverage Criteria Settings

We refer to the settings from prior studies  [12, 18], making appropriate adjustments for practical applications in LLMs. We briefly describe the basic settings of the coverage criteria used in our experiments and explain the rationale behind these choices. Note that our experiments focus on the trend of coverage changes rather than precise numerical values.

NC requires an activation threshold parameter T𝑇Titalic_T to determine whether a neuron is activated. Due to the different size and activation functions among OPT-125M, Llama-2-7B-Chat, Pythia-12B, and Gemma-2-27B-it, which significantly affect the distribution of neuron activations, we empirically set T𝑇Titalic_T to 0.1, 0.25, 0.75, and 50 for the best performance, respectively.

TKNC requires a parameter K𝐾Kitalic_K to determine the number of top neurons selected. For all models, we set K𝐾Kitalic_K to 10.

TKNP is similar to TKNC. However, our experiments show that due to the complexity of LLMs, setting K𝐾Kitalic_K too high results in each new input forming a new pattern. Therefore, we set K𝐾Kitalic_K to 1.

TFC requires a parameter T𝑇Titalic_T to determine the distance between different clusters. Here, we again refer to the model sizes and set T𝑇Titalic_T to 5, 50, 500, and 1000, respectively.

NLC does not require a pre-set parameter. However, since we do not have access to the complete training data to use as prior knowledge for NLC, we calculate it directly on different test suites.

IV-B RQ1: Evaluating Coverage Criteria Effectiveness in LLMs

TABLE III: Coverage results for each criterion on different test suites. On SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, the original coverage results are displayed. While for other test suites, the change rates relative to SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT are demonstrated.
Attention MLP
Model Criterion Config SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT SNSsubscript𝑆𝑁𝑆S_{NS}italic_S start_POSTSUBSCRIPT italic_N italic_S end_POSTSUBSCRIPT SNMsubscript𝑆𝑁𝑀S_{NM}italic_S start_POSTSUBSCRIPT italic_N italic_M end_POSTSUBSCRIPT SNJsubscript𝑆𝑁𝐽S_{NJ}italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT SRSsubscript𝑆𝑅𝑆S_{RS}italic_S start_POSTSUBSCRIPT italic_R italic_S end_POSTSUBSCRIPT SRMsubscript𝑆𝑅𝑀S_{RM}italic_S start_POSTSUBSCRIPT italic_R italic_M end_POSTSUBSCRIPT SRJsubscript𝑆𝑅𝐽S_{RJ}italic_S start_POSTSUBSCRIPT italic_R italic_J end_POSTSUBSCRIPT SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT SNSsubscript𝑆𝑁𝑆S_{NS}italic_S start_POSTSUBSCRIPT italic_N italic_S end_POSTSUBSCRIPT SNMsubscript𝑆𝑁𝑀S_{NM}italic_S start_POSTSUBSCRIPT italic_N italic_M end_POSTSUBSCRIPT SNJsubscript𝑆𝑁𝐽S_{NJ}italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT SRSsubscript𝑆𝑅𝑆S_{RS}italic_S start_POSTSUBSCRIPT italic_R italic_S end_POSTSUBSCRIPT SRMsubscript𝑆𝑅𝑀S_{RM}italic_S start_POSTSUBSCRIPT italic_R italic_M end_POSTSUBSCRIPT SRJsubscript𝑆𝑅𝐽S_{RJ}italic_S start_POSTSUBSCRIPT italic_R italic_J end_POSTSUBSCRIPT
OPT -125M NC T=0.1 0.37 1.94% 3.60% 7.94% -2.17% -0.17% 4.99% 0.53 2.04% 4.52% 6.34% -7.60% -4.29% -1.94%
TKNC T=10 0.79 2.46% 4.15% 8.55% -3.21% -0.47% 4.90% 0.87 1.15% 2.93% 4.57% -7.96% -3.81% -0.85%
TKNP T=1 1500 32.93% 24.67% 33.20% -0.40% -8.67% -0.13% 1427 29.43% 25.30% 34.90% -2.38% -7.29% 2.24%
TFC T=5 156 27.56% 13.46% 83.97% 5.13% -8.97% 61.54% 646 9.60% 15.63% 20.59% -22.14% -17.03% -11.76%
NLC N/A 2428.28 2.67% 2.89% 108.73% 1.45% 2.00% 110.20% 6484.59 2.51% 0.72% 4.25% 2.08% 1.08% 4.15%
Llama-2 -7B-Chat NC T=0.25 0.51 3.20% 10.37% 18.87% -4.89% 3.99% 14.46% 0.77 0.67% 2.85% 2.84% -2.91% 1.04% 1.07%
TKNC T=10 0.40 8.10% 14.95% 19.74% -7.65% 0.58% 5.62% 0.41 7.20% 18.90% 28.29% -19.14% -2.83% 8.82%
TKNP T=1 1500 33.00% 24.67% 33.27% -0.33% -8.67% -0.07% 533 21.20% 52.53% 90.62% -21.58% 9.57% 47.65%
TFC T=50 10165 24.90% 23.16% 25.33% -7.19% -9.12% -6.97% 11395 17.79% 52.40% 67.48% -23.57% 10.94% 26.03%
NLC N/A 6063257 2.03% 0.99% 5.07% 1.48% 0.52% 4.79% 50885880 0.54% 1.40% 3.62% 0.28% 1.00% 3.36%
Pythia -12B NC T=0.75 0.53 4.78% 7.54% 24.07% -5.11% -1.07% 18.03% 0.93 0.34% 1.03% 1.10% -3.16% -1.74% -1.59%
TKNC T=10 0.19 10.74% 14.39% 21.19% -5.50% -1.52% 5.52% 0.21 9.90% 13.73% 16.89% -10.92% -6.45% -2.79%
TKNP T=1 1474 33.31% 24.42% 32.23% 0.47% -8.41% -0.75% 1054 28.56% 27.89% 36.43% -0.95% -2.28% 6.36%
TFC T=500 6522 12.91% 32.01% 41.80% -18.23% 0.23% 9.90% 36347 23.92% 25.11% 22.34% -8.82% -7.96% -10.72%
NLC N/A 20366400 1.66% 1.42% 30.60% 0.48% 1.54% 29.71% 105028800 0.16% 1.25% 0.73% -8.67% -7.34% -7.95%
Gemma-2 -27B-it NC T=50 0.30 2.70% 3.40% 10.05% -4.00% -3.14% 5.18% 0.52 0.65% 1.20% 1.54% -0.62% 0.04% 0.50%
TKNC T=10 0.06 8.69% 11.76% 20.01% -6.02% -2.53% 6.24% 0.19 7.25% 8.48% 10.92% -6.09% -4.75% -1.86%
TKNP T=1 1498 32.84% 24.57% 33.24% -0.47% -8.74% -0.07% 1499 32.82% 24.68% 33.36% -0.53% -8.67% 0.00%
TFC T=1000 65116 31.00% 24.70% 32.43% -2.13% -8.47% -0.76% 64010 31.62% 24.62% 32.20% -1.73% -8.77% -1.19%
NLC N/A 89072263168 4.73% 1.29% 31.72% -0.48% -5.44% 28.26% 333045301248 5.92% 0.22% 7.06% 4.94% -1.90% 6.55%
  • The abbreviations used are as follows: NC represents Neuron Coverage, TKNC represents Top-K Neuron Coverage, TKNP represents Top-K Neuron Patterns, TFC represents for TensorFuzz Coverage, and NLC represents Neural Coverage.

To address RQ1, we evaluate the performance of the coverage criteria across different test suites and models to determine if it meets evaluation requirements. Table III presents the neuron coverage measured at the last query token across the Attention and MLP layers of each block. In RQ2 and RQ3, we will further investigate the impacts of different network layers and tokens on the coverage evaluation results.

Evaluation against Requirement 1. Requirement 1 mandates that the coverage criteria accurately identify redundant tests. The average coverage growth calculated on the Attention layer for NC, TKNC, TKNP, TFC, and NLC registers at 3.16%, 7.50%, 33.02%, 24.09%, and 2.77%, respectively, When comparing the coverage results of the synonymous test suite SNSsubscript𝑆𝑁𝑆S_{NS}italic_S start_POSTSUBSCRIPT italic_N italic_S end_POSTSUBSCRIPT to the benchmark test suite SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. This is expected as the extra redundant test cases in the synonymous query dataset will slightly increase the coverage. A similar distribution pattern emerges in the MLP layer. These results indicate that NC and NLC maintain the lowest growth rates, TKNC exhibits intermediate performance, while TKNP and TFC show higher growth rates. This suggests that NC, TKNC, and NLC accurately identify synonymous queries, whereas TKNP and TFC demonstrate weaker recognition capabilities.

Further, the change in coverage rates calculated on the Attention layer for these five criteria measures -4.04%, -5.60%, -0.18%, -5.61%, and 0.73% respectively, comparing the synonymous test suite SRSsubscript𝑆𝑅𝑆S_{RS}italic_S start_POSTSUBSCRIPT italic_R italic_S end_POSTSUBSCRIPT to the benchmark test suite SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. As expected, the existence of synonymous test cases in SRSsubscript𝑆𝑅𝑆S_{RS}italic_S start_POSTSUBSCRIPT italic_R italic_S end_POSTSUBSCRIPT leads to lower coverage compared to SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, despite both having the same number of test cases. NC and TKNC continue to perform well, successfully capturing the richness variation in the test suites. Moreover, the coverage rate for TFC also shows a decline, that of TKNP remains largely unchanged, and that of NLC unexpectedly increases. Collectively, TKNP may be overly sensitive for LLMs (even with the hyperparameter K set to 1), recognizing minor differences in each query and assigning them to distinct patterns, similar to TFC. For NLC, both SRSsubscript𝑆𝑅𝑆S_{RS}italic_S start_POSTSUBSCRIPT italic_R italic_S end_POSTSUBSCRIPT and SNSsubscript𝑆𝑁𝑆S_{NS}italic_S start_POSTSUBSCRIPT italic_N italic_S end_POSTSUBSCRIPT demonstrate lower growth compared to SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. This may suggest that, in the view of NLC, feature capture for normal queries reaches saturation within the first 1000 queries, and neither an additional 500 normal nor synonymous queries significantly enhance the richness of the test suite. Thus, NC, TKNC, and NLC meet requirement 1, while TKNP and TFC underperform in recognizing synonymous queries.

Evaluation against Requirement 2. Requirement 2 mandates that the coverage criteria be sensitive to attack queries. The average coverage growth on the Attention layer for NC, TKNC, TKNP, TFC, and NLC records at 15.23%, 17.37%, 32.99%, 45.88%, and 44.03% respectively, comparing SNJsubscript𝑆𝑁𝐽S_{NJ}italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT to SNsubscript𝑆𝑁S_{N}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. All five criteria achieve significant growth, with TFC and NLC showing the largest increases. The performance on the MLP layer, however, presents a slight variation, with coverage growth for the five criteria at 2.96%, 15.17%, 48.83%, 35.65%, and 3.92%, respectively. Coverage growth for NC and NLC is notably lower compared to their performance on the Attention layer, while TKNC, TKNP, and TFC continue to perform well. This indicates that all five criteria are capable of recognizing attack queries. However, performance varies across different model layers. We will discuss these layer-level differences in detail in RQ2. Due to Pythia-12B’s low safety alignment, questions that should be refused might still output malicious content, resulting in a higher coverage rate on SNMsubscript𝑆𝑁𝑀S_{NM}italic_S start_POSTSUBSCRIPT italic_N italic_M end_POSTSUBSCRIPT, especially on the MLP layer where TFC and NLC coverage rates exceed those on SNJsubscript𝑆𝑁𝐽S_{NJ}italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT.

Evaluation against Requirement 3. Requirement 3 mandates that the coverage criteria be stable. We first calculate the variance in the coverage growth of five criteria on the same test suite. In the Attention layer, both NLC and TFC show notable variances in coverage growth, with NLC having a variance of 0.20 on SNJsubscript𝑆𝑁𝐽S_{NJ}italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT and TFC having a variance of 0.07. Consequently, it can be observed that TFC performs poorly on Llama-2-7B-Chat and Gemma-2-27B-it, while NLC performs poorly on Llama-2-7B-Chat. In the MLP layer, TKNP displays a variance of 0.07 in coverage growth on SNJsubscript𝑆𝑁𝐽S_{NJ}italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT. TKNP performs better on Llama-2-7B-Chat compared to other models. In contrast, NC and TKNC maintain stable performance across all models, with the highest variance being 0.01 for TKNC in the MLP layer on SNJsubscript𝑆𝑁𝐽S_{NJ}italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT. This suggests that the performance of NLC, TFC, and TKNP is significantly influenced by changes in model architecture or parameter size, whereas NC and TKNC exhibit better generalization ability and stability. Such stability and generalization are crucial for the ongoing evaluation and testing of LLMs in practical applications, ensuring consistent testing results across various model configurations.

Answer to RQ1: NC and TKNC are the most effective for LLMs. They consistently perform well in accurately identifying synonymous queries and detecting attack queries, while also maintaining stability and generalization across different model architectures and sizes.

IV-C RQ2: Analyzing Model Layer-wise Contributions to Coverage in LLMs

Relative Coverage Growth: To address RQ2 and observe the effectiveness of coverage criteria across different layers, we introduce a new metric called Relative Coverage Growth (RCG) that quantifies the performance of coverage criteria at different layers within the same model, considering the three fundamental requirements:

RCG=max(CSNJCSNSCSN,0)𝑅𝐶𝐺subscript𝐶subscript𝑆𝑁𝐽subscript𝐶subscript𝑆𝑁𝑆subscript𝐶subscript𝑆𝑁0RCG=\max(\frac{C_{S_{NJ}}-C_{S_{NS}}}{C_{S_{N}}},0)italic_R italic_C italic_G = roman_max ( divide start_ARG italic_C start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_N italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG , 0 )

Specifically, RCG measures the effectiveness of different layers by calculating the increase in coverage for attack queries compared to synonymous queries. On a macro level, this metric represents the additional enrichment brought to the base test suite by the same number of attack queries compared to synonymous queries. For requirement 1, accurately identifying synonymous queries corresponds to a lower CSNSCSNsubscript𝐶subscript𝑆𝑁𝑆subscript𝐶subscript𝑆𝑁\frac{C_{S_{NS}}}{C_{S_{N}}}divide start_ARG italic_C start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_N italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG. For requirement 2, sensitivity to attack queries corresponds to a higher CSNJCSNsubscript𝐶subscript𝑆𝑁𝐽subscript𝐶subscript𝑆𝑁\frac{C_{S_{NJ}}}{C_{S_{N}}}divide start_ARG italic_C start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG. Both of these requirements can be reflected by a higher RCG. Requirement 3 can be represented to some extent by the stability of RCG. Therefore, RCG provides a systematic approach to quantifying the effectiveness of coverage metrics across different layers, aiding in the understanding of hierarchical differences in LLMs.

Attention Layer versus MLP Layer: We first evaluate the performance of Attention and MLP layers from an overall model perspective. We use NC and TKNC, identified as effective in RQ1, as our research criteria and calculate RCG based on the data in Table III. For NC, the RCG values for Attention layers in the four models are 6.00%, 15.67%, 19.28%, and 7.35%, respectively. Correspondingly, the RCG values for MLP layers are 4.29%, 2.16%, 0.76%, and 0.89%. It is evident that under the same criteria and models, NC is more effective in Attention layers than in MLP layers.

Similarly, for TKNC, the RCG values for Attention layers in the four models are 6.10%, 11.65%, 10.46%, and 11.32%, while the MLP layers show values of 3.43%, 21.09%, 6.99%, and 3.67%. Except for Llama-2-7B-Chat, the effectiveness of Attention layers remains superior to that of MLP layers. Additionally, as shown in Table III, the SNJsubscript𝑆𝑁𝐽S_{NJ}italic_S start_POSTSUBSCRIPT italic_N italic_J end_POSTSUBSCRIPT coverage growth rates are highest for all criteria in Attention layers. In contrast, in MLP layers, higher SNMsubscript𝑆𝑁𝑀S_{NM}italic_S start_POSTSUBSCRIPT italic_N italic_M end_POSTSUBSCRIPT growth rates are observed in NC for the Llama-2-7B-Chat model, and in TFC and NLC for the Pythia-12B model.

These results indicate that Attention layers capture input features more effectively than MLP layers, enhancing the accuracy of coverage criteria in evaluating model testing. This may be because Attention layers better capture the global dependencies among input data, leading to a deeper understanding of input features. In contrast, MLP layers rely mainly on linear combinations of local features and fail to comprehensively capture the complex dependencies within the input data. Therefore, Attention layers demonstrate superior effectiveness and accuracy in coverage analysis compared to MLP layers.

Refer to caption
Figure 4: The RCG results based on NC and TKNC for different blocks of the four target LLMs. OPT-125M contains 12 blocks, Llama-2-7B-Chat contains 32 blocks, Pythia-12B contains 36 blocks, and Gemma-2-27B-it contains 46 blocks.

Impact of Different Blocks: LLMs are typically composed of multiple stacked transformer modules, with the hidden states exhibiting distinct characteristics as the model depth increases. Building on the foundation of Attention and MLP layers, we aim to further investigate the contribution of individual blocks within different models to the overall coverage. To achieve this, we calculate the NC and TKNC for the Attention and MLP layers in each block of the models using various test suites, and we record the corresponding RCG. Figure 4 illustrates the variation of RCG across different blocks in four models.

From the NC perspective, each model exhibits high RCG values over a continuous range of blocks. Specifically, OPT-125M shows high RCG from Block 1 to Block 8, Llama-2-7B-Chat from Block 5 to Block 21, Pythia-12B from Block 23 to Block 33, and Gemma-2-27B-it from Block 5 to Block 30. The high RCG values in certain blocks suggest that the set of activated neurons varies significantly for different query types. These blocks play a crucial role in distinguishing complex query types, and NC effectively captures these features. Therefore, for NC, the intermediate layers of the models, especially within the identified block ranges, are essential for differentiating query types and enhancing the effectiveness of the coverage criteria.

From the TKNC perspective, the variation in RCG across different blocks in the four models is less significant, with the initial blocks generally exhibiting slightly higher RCG than the later ones. The sets of top neurons for different query types show similar variations across all blocks, which TKNC consistently identifies. Thus, all blocks contribute to TKNC’s effectiveness, with initial blocks having a more significant impact than later blocks.

Answer to RQ2: Attention layers are more effective than MLP layers for optimizing coverage analysis in LLMs. Additionally, NC focuses on specific crucial blocks to capture features of different test suites, while TKNC consistently identifies these features across blocks.

IV-D RQ3: Investigating Token-level Impacts on Coverage Analysis in LLMs

Refer to caption
Figure 5: The RCG results calculated based on NC and TKNC for different tokens in the target LLMs. Each model generates 10 tokens for each query.

To investigate RQ3, we expanded on the previous experiments by generating 10 additional tokens for each test suite, starting from the last token of the query 111For experimental stability, we select the next token with the highest probability.. We explore how different tokens impact coverage analysis, assuming they may lead to varying coverage behaviors and reveal more comprehensive insights into model behavior. Our key question is: “How does generating additional tokens affect diversity assessment among test suites in LLM testing?”

Our further goal is to identify the optimal token positions for LLM testing to reduce computational costs and achieve efficient testing results. These points are crucial for determining the most effective moments to measure coverage, thereby optimizing the balance between testing and resource expenditure. After generating each new token, we calculated the corresponding coverage rates for the model. This process allowed us to examine how the coverage evolves as the model generates more tokens beyond the initial query. Similarly, we analyzed the impact of these tokens on coverage analysis through the RCG quantified by NC and TKNC. The experimental results are shown in Figure 5.

The experimental results clearly indicate that generating additional tokens does not significantly improve RCG. In the OPT-125M model, the RCG in the Attention layer exhibits some fluctuations but does not show a noticeable increase with the token generation, while the MLP layer shows an overall decreasing trend. In larger models such as Llama-2-7B-Chat, Pythia-12B, and Gemma-2-27B-it, RCG calculated based on NC and TKNC decreases significantly in both the Attention and MLP layers. This suggests that as the number of tokens increases, the differences between attack queries and synonymous queries gradually diminish. This might be because larger models generate outputs with greater diversity, which in turn reduces the differences in coverage performance between various test suites.

This trend also indicates that generating additional tokens for test suites may not always be an effective testing strategy for larger models. Instead, it may lead to different test suites performing similarly, indicating a reduction in the diversity and effectiveness of the tests. This finding suggests that careful consideration of token generation strategies is necessary when designing test suites to avoid unnecessary computational overhead and decreased testing effectiveness.

Answer to RQ3: Considering all four models, testing at the last token of the original query in the test suites proves to be the most effective. Generating additional tokens may lead to a reduction in the differences between test suites.

V Application for Real-time Jailbreak Detection

Our empirical study shows that NC is an effective coverage criterion, and the attention layers and the last query token are more sensitive to different test suites. This indicates that the number of activated neurons (the feature used by NC) can effectively differentiate between normal queries and jailbreak attacks on the LLMs, inspiring us to design a jailbreak attack detection method utilizing this feature. Particularly, we achieve the detection by training a Multilayer Perceptron (MLP) model, using the number of activated neurons as input, to output a binary result indicating whether a query to LLMs could cause a jailbreak response. The dimension of input to the classifier model (i.e., the number of activated neurons) varies with the LLMs’ architecture. For example, OPT-125M has 12 blocks. We construct a feature vector (l0,l1,,l11)subscript𝑙0subscript𝑙1subscript𝑙11(l_{0},l_{1},\ldots,l_{11})( italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT ), consisting of exactly 12 features representing the number of activated neurons in the attention layers of these 12 blocks, as the input to the corresponding classifier. For each classifier, they uniformly include four hidden layers with 256, 2048, 512, and 128 neurons, respectively.

To validate the effectiveness of this coverage-criteria-based classifier, we design the following experiments on OPT-125M, Llama-2-7B-Chat, and Pythia-12B. For more details, please refer to our website [22].

V-A Effectiveness Comparison

Dataset: To train and test the classifier, we use four datasets. From Alpaca-gpt4, we randomly select 2500 queries for training and 500 for testing, representing normal queries to LLMs. Similarly, from JailBreakV-28k, we select 2500 attack queries for training and 500 for testing, representing attack queries that lead the model to output malicious content. The TruthfulQA dataset, containing 817 common life questions prone to errors, is used to verify how well the classifier generalizes to normal queries. For adversarial attack queries, we use GCG to generate adversarial attack suffixes for 200 rejected queries in AdvBench based on Llama-2-7B-Chat 222The GCG repository is implemented only on Vicuna-7B and Llama-2-7B-Chat, but their high transferability allows application to other models.. This dataset is used to validate how well the classifier generalizes to attack queries.

Baseline: To demonstrate the effectiveness of the classifier, we select the most widely used perplexity filter [27] and the state-of-the-art method, PARDEN [28], as baselines for comparison. We follow the settings in the  [28], where the threshold for the perplexity filter is set to the maximum perplexity of malicious queries in AdvBench, with a window size of 10. The threshold for PARDEN is set to 0.6, with a window size of 30.

TABLE IV: Experimental Results of the Classifier Constructed Based on NC
OPT-125M Llama-2-7B-Chat Pythia-12B
Training Set 93.92% 96.28% 100.00%
Testing set 94.90% 96.10% 98.00%
TruthfulQA 94.12% 91.19% 87.03%
Adversarial Attack 92.00% 84.50% 82.50%
TABLE V: Experimental Results of Perplexity Filter and PARDEN
DataSet Perplexity PARDEN
OPT-125M
Llama-2-7B-Chat
Pythia-12B
OPT-125M
Llama-2-7B-Chat
Pythia-12B
Sentence Window Sentence Window Sentence Window Sentence Window Sentence Window Sentence Window
Alpaca-gpt4 98.63% 100.00% 98.70% 97.83% 97.20% 100.00% 83.57% 93.40% 57.37% 54.70% 85.20% 85.00%
JailBreakV 8.97% 0.07% 9.43% 10.03% 9.40% 9.57% 46.13% 37.63% 54.60% 61.90% 6.00% 3.53%
TruthfulQA 99.76% 100.00% 97.55% 99.39% 98.78% 100.00% 90.58% 96.33% 38.92% 34.52% 83.60% 81.64%
Adversarial Attack 67.00% 5.00% 81.50% 78.00% 80.00% 59.50% 31.00% 27.50% 86.50% 89.50% 17.50% 17.00%

Evaluation results: Experimental results are presented in Tables IV and V. Table IV demonstrates that the classifier based on NC performs well across three models. Specifically, the classifier achieves an average classification accuracy of 96.33% on the testing set, indicating excellent performance in distinguishing between normal and attack queries after training. Furthermore, on the TruthfulQA dataset, the classifier achieves an average classification accuracy of 90.78% across the three models. The classifier’s effectiveness on the TruthfulQA dataset, which includes questions prone to errors, demonstrates its robustness and generalization ability to normal queries. For adversarial attack queries, the classifier achieves an average accuracy of 86.33%, further showcasing its generalization capability. This is particularly notable considering the different attack patterns in the datasets: GCG involves gradient-based adversarial attack suffixes, while JailBreakV comprises template-based, persuasion-related jailbreak attack queries.

A further comparison with two baselines in Table V reveals that the perplexity filter performs well in detecting normal queries, achieving nearly 100% accuracy with a window size of 10. However, its performance declines significantly on JailBreakV and adversarial attack queries, with an average classification accuracy of only 7.91% on JailBreakV. For adversarial attack queries, the perplexity filter achieves sentence-level classification accuracies of 81.5% on the Llama-2-7B-Chat model, 80% on the Pythia-12B model, and only 67% on the OPT-125M model. This indicates that the perplexity filter struggles with low-complexity attack queries, such as template and instruction-based jailbreak attacks in JailBreakV.

PARDEN’s performance varies across datasets and models. It effectively detects normal queries on OPT-125M and Pythia-12B, achieving 93.4% window-based accuracy on Alpaca-gpt4 for OPT-125M and 85.2% sentence-level accuracy on Pythia-12B, with similar results on TruthfulQA. However, both models perform poorly on attack queries, possibly due to insufficient safety alignment, as OPT-125M includes unfiltered internet content, and Pythia-12B is designed for interpretability research, not practical deployment. On Llama-2-7B-Chat, PARDEN only performs well on adversarial attack queries, with 89.5% sentence-level accuracy.

Besides superior detection performance, our method exhibits three advantages: ❶ The detection method is lightweight as it utilizes only a small set of features, i.e. the number of activated neurons in each layer, as the input features. ❷ Our method is independent of query length, enabling it to flexibly address complex and variable attack methods. In contrast, the detailed hyper-parameters and performance of previous methods, such as the perplexity-based method, vary according to the length of the query being detected. ❸ Compared to other output detection methods, our approach completes detection as soon as the model generates the first token, avoiding the resource waste associated with waiting for the full output. This key feature can significantly save resources and computations compared to methods that only identify jailbreak attacks after the complete generation.

Findings: The NC-based classifier surpasses the perplexity filter and PARDEN in detecting attack queries, showing strong generalization and adaptability independent of model architecture or alignment. Moreover, it identifies attack queries before output generation, saving computational resources.

VI Threats to Validity

External Threats: External validity threats arise from the specific settings chosen in our study, which may raise concerns about the generalizability of our proposed method. To mitigate these threats, we select a variety of evaluation settings. Specifically, we use four large language models with different architectures and parameters: OPT-125M, Llama-2-7B-Chat, Pythia-12B, and Gemma-2-27B-it. Additionally, we employ four datasets—Alpaca-gpt4, JailBreakV-28k, TruthfulQA, and AdvBench—that cover a wide range of test scenarios. Our study is based on five widely used coverage criteria: NC, TKNC, TKNP, TFC, and NLC. These choices aim to enhance the generalizability of our findings to common scenarios in this field.

Internal Threats: Internal validity threats stem from the tools used in our study, including llm-attacks (the official implementation of GCG). Moreover, accurately reproducing the coverage criteria and the perplexity filter and PARDEN for input detection also presents potential threats. Inherent randomness in model training further poses a potential threat to internal validity. We address this by repeating key experiments more than three times to report the average results.

VII Related Work

LLM Testing and Evaluation. Previous studies have proposed several benchmarks to comprehensively evaluate the fundamental capabilities of LLMs in a standardized manner. For example, MMLU [29] aims to measure the multitask accuracy of text models across 57 tasks, including mathematics, history, computer science, and law. Big-Bench [30] assesses language models across 204 tasks, covering areas such as linguistics, child development, mathematics, commonsense reasoning, biology, physics, social bias, and software development. LMentry [31] evaluates the capabilities and robustness of LLMs through a set of simple tasks for humans, including sentence construction, part-of-speech recognition, and word length comparison. AGIEval [32] assesses model proficiency through standardized human exams, such as college entrance exams and math competitions. These benchmarks collectively establish a foundation for evaluating and enhancing LLM reliability and performance across various domains and tasks.

LLM Defenses. Defensive mechanisms for LLMs are also evolving. For example, erase-and-check [33] is a certified safe defense framework against adversarial prompts, which defends by sequentially removing tokens from the prompt and using a safety filter to inspect the resulting subsequences. SmoothLLM [34] and JailGuard [35] defend by randomly perturbing multiple copies of an input prompt and aggregating the corresponding predictions, significantly reducing the success rate of jailbreak attacks. Other defense methods include perplexity-based detection [27, 36] and self-classification by the LLM [28, 37], which detect disordered characters in prompts or classify input prompt harmfulness directly through the model.

VIII conclusion

In this study, we conducted an extensive empirical investigation into the effectiveness of traditional coverage criteria in LLMs across three key dimensions: criterion level, layer level, and token level. Our findings reveal significant differences in neuron coverage between normal and malicious queries, highlighting the potential of these criteria in identifying abnormalities in LLMs. Building upon these insights, we developed a novel downstream application for real-time detection of malicious attacks based on neural activation features, achieving an impressive accuracy of 96.33%. Our findings enhance the understanding of security testing in LLMs and lay a methodological foundation for developing robust AI applications using neural activation features to detect malicious attacks.

Broader Impact. Our paper demonstrates the effectiveness of coverage criteria on LLMs, focusing on jailbreak attacks due to their widely recognized impact and well-defined nature. We hope to inspire the community to explore coverage criteria effectiveness across a broader range of security and functional issues in future research.

References

  • [1] G. Son, H. Jung, M. Hahm, K. Na, and S. Jin, “Beyond classification: Financial reasoning in state-of-the-art language models,” 2023.
  • [2] R. Tang, X. Han, X. Jiang, and X. Hu, “Does synthetic data generation of llms help clinical text mining?” 2023.
  • [3] A. Blair-Stanek, N. Holzenberger, and B. Van Durme, “Can gpt-3 perform statutory reasoning?” in Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ser. ICAIL ’23.   New York, NY, USA: Association for Computing Machinery, 2023, p. 22–31. [Online]. Available: https://doi.org/10.1145/3594536.3595163
  • [4] A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” 2023.
  • [5] S. Zhu, R. Zhang, B. An, G. Wu, J. Barrow, Z. Wang, F. Huang, A. Nenkova, and T. Sun, “Autodan: Interpretable gradient-based adversarial attacks on large language models,” 2023.
  • [6] J. Yu, X. Lin, Z. Yu, and X. Xing, “Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts,” 2023.
  • [7] G. Deng, Y. Liu, Y. Li, K. Wang, Y. Zhang, Z. Li, H. Wang, T. Zhang, and Y. Liu, “Masterkey: Automated jailbreaking of large language model chatbots,” in Proc. ISOC NDSS, 2024.
  • [8] P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” 2023.
  • [9] G. Deng, Y. Liu, K. Wang, Y. Li, T. Zhang, and Y. Liu, “Pandora: Jailbreak gpts by retrieval augmented generation poisoning,” 2024.
  • [10] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” in Proceedings of the 26th Symposium on Operating Systems Principles, ser. SOSP ’17.   New York, NY, USA: Association for Computing Machinery, 2017, p. 1–18. [Online]. Available: https://doi.org/10.1145/3132747.3132785
  • [11] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu, J. Zhao, and Y. Wang, “Deepgauge: multi-granularity testing criteria for deep learning systems,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ser. ASE ’18.   New York, NY, USA: Association for Computing Machinery, 2018, p. 120–131. [Online]. Available: https://doi.org/10.1145/3238147.3238202
  • [12] Y. Yuan, Q. Pang, and S. Wang, “Revisiting neuron coverage for dnn testing: A layer-wise and distribution-aware criterion,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 1200–1212.
  • [13] A. Odena, C. Olsson, D. Andersen, and I. Goodfellow, “TensorFuzz: Debugging neural networks with coverage-guided fuzzing,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.   PMLR, 09–15 Jun 2019, pp. 4901–4911. [Online]. Available: https://proceedings.mlr.press/v97/odena19a.html
  • [14] J. Kim, R. Feldt, and S. Yoo, “Guiding deep learning system testing using surprise adequacy,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019, pp. 1039–1049.
  • [15] J. Kim, J. Ju, R. Feldt, and S. Yoo, “Reducing dnn labelling cost using surprise adequacy: an industrial case study for autonomous driving,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2020.   New York, NY, USA: Association for Computing Machinery, 2020, p. 1466–1476. [Online]. Available: https://doi.org/10.1145/3368089.3417065
  • [16] X. Xie, T. Li, J. Wang, L. Ma, Q. Guo, F. Juefei-Xu, and Y. Liu, “Npc: Neuron path coverage via characterizing decision logic of deep neural networks,” ACM Trans. Softw. Eng. Methodol., vol. 31, no. 3, apr 2022. [Online]. Available: https://doi.org/10.1145/3490489
  • [17] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PloS one, vol. 10, no. 7, p. e0130140, 2015.
  • [18] Z. Ji, P. Ma, Y. Yuan, and S. Wang, “Cc: Causality-aware coverage criterion for deep neural networks,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 1788–1800.
  • [19] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023.
  • [20] B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” arXiv preprint arXiv:2304.03277, 2023.
  • [21] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 14.   Oakland, CA, USA, 1967, pp. 281–297.
  • [22] LLM-Coverage-Criteria-Study, “Understanding the effectiveness of coverage criteria for large language models: A special angle from jailbreak attacks,” 2024, accessed: 2024-08-02. [Online]. Available: https://sites.google.com/view/llm-coverage-criteria-study
  • [23] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “Opt: Open pre-trained transformer language models,” 2022.
  • [24] S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. Van Der Wal, “Pythia: A suite for analyzing large language models across training and scaling,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202.   PMLR, 23–29 Jul 2023, pp. 2397–2430. [Online]. Available: https://proceedings.mlr.press/v202/biderman23a.html
  • [25] G. Team, “Gemma,” 2024. [Online]. Available: https://www.kaggle.com/m/3301
  • [26] W. Luo, S. Ma, X. Liu, X. Guo, and C. Xiao, “Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks,” 2024.
  • [27] N. Jain, A. Schwarzschild, Y. Wen, G. Somepalli, J. Kirchenbauer, P. yeh Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,” 2023.
  • [28] Z. Zhang, Q. Zhang, and J. Foerster, “Parden, can you repeat that? defending against jailbreaks via repetition,” 2024.
  • [29] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” 2021.
  • [30] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso et al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,” arXiv preprint arXiv:2206.04615, 2022.
  • [31] A. Efrat, O. Honovich, and O. Levy, “Lmentry: A language model benchmark of elementary language tasks,” 2022.
  • [32] W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan, “Agieval: A human-centric benchmark for evaluating foundation models,” 2023.
  • [33] A. Kumar, C. Agarwal, S. Srinivas, A. J. Li, S. Feizi, and H. Lakkaraju, “Certifying llm safety against adversarial prompting,” 2024.
  • [34] A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defending large language models against jailbreaking attacks,” 2023.
  • [35] X. Zhang, C. Zhang, T. Li, Y. Huang, X. Jia, X. Xie, Y. Liu, and C. Shen, “A mutation-based method for multi-modal jailbreaking attack detection,” arXiv preprint arXiv:2312.10766, 2023.
  • [36] G. Alon and M. Kamfonas, “Detecting language model attacks with perplexity,” 2023.
  • [37] M. Phute, A. Helbling, M. Hull, S. Peng, S. Szyller, C. Cornelius, and D. H. Chau, “Llm self defense: By self examination, llms know they are being tricked,” 2024.