AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens (2024)

Lin Lu¹ Hai Yan¹^∗ Zenghui Yuan¹^∗ Jiawen Shi¹^∗
Wenqi Wei² Pin-Yu Chen³ Pan Zhou¹^†
¹Huazhong University of Science and Technology
²Computer and Information Sciences Department Fordham University
³IBM Research
{loserlulin,yanhai,zenghuiyuan,shijiawen,panzhou}@hust.edu.cn
wenqiwei@fordham.edu pin-yu.chen@ibm.comEqual contribution.Corresponding author.

Abstract

Jailbreak attacks in large language models (LLMs) entail inducing the models to generate content that breaches ethical and legal norm through the use of malicious prompts, posing a substantial threat to LLM security. Current strategies for jailbreak attack and defense often focus on optimizing locally within specific algorithmic frameworks, resulting in ineffective optimization and limited scalability. In this paper, we present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques, generalizing them to all possible attack surfaces. We employ directed acyclic graphs (DAGs) to position and analyze existing jailbreak attacks, defenses, and evaluation methodologies, and propose three comprehensive, automated, and logical frameworks.AutoAttack investigates dependencies in two lines of jailbreak optimization strategies: genetic algorithm (GA)-based attacks and adversarial-generation-based attacks, respectively. We then introduce an ensemble jailbreak attack to exploit these dependencies. AutoDefense offers a mixture-of-defenders approach by leveraging the dependency relationships in pre-generative and post-generative defense strategies.AutoEvaluation introduces a novel evaluation method that distinguishes hallucinations, which are often overlooked, from jailbreak attack and defense responses.Through extensive experiments, we demonstrate that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.

1 Introduction

Jailbreak attacks(Liu etal., 2023a; Shen etal., 2023; Zou etal., 2023) have emerged as significant threats to the security of large language models (LLMs). Such attacks compel LLMs to generate harmful or unethical content by crafting malicious prompts. While LLMs’ owners can mitigate simple jailbreak prompts by fine-tuning their models with data aligned with human values(Dai etal., 2023; Bai etal., 2022; Li etal., 2023), attackers could still achieve their objectives through carefully crafted templates or algorithms. Consequently, in this query-based black-box scenario, addressing jailbreak attacks has emerged as a paramount concern within the LLM community.

However, we have observed that the overwhelming majority of current black-box jailbreak attacks and defenses fall into a local optimization trap. In terms of attack tactics, existing jailbreak approaches typically adhere to a generic optimization framework, such as genetic algorithms (GA)(Liu etal., 2023a; Yu etal., 2023; Li etal., 2024b). They enhance the jailbreak success rate by optimizing a specific sub-component within this framework while neglecting the importance of other sub-components. On the defense side, jailbreak defenses(Cao etal., 2023; Robey etal., 2023; Kumar etal., 2023) often focus solely on a specific type of jailbreak prompts, such as those with adversarial suffixes, thereby limiting their efficacy against a wider range of attacks. This ongoing cat-and-mouse interaction is ensnared in local optimization, failing to genuinely enhance the robustness of LLMs.

The aforementioned concern propels the development of AutoJailbreak, a framework designed to comprehensively evaluate the resilience of LLMs against jailbreak attacks. Specifically, we conduct an exhaustive examination of jailbreak attacks and defenses scrutinizing over 28 jailbreak attacks and 12 jailbreak defenses documented in Table 6 in AppendixA. We consider the black-box threat model as API communication has emerged as the predominant approach for leveraging LLMs. By systematically exploring the dependency relationships of jailbreak attacks and defenses via directed acyclic graphs (DAGs), we develop AutoAttack and AutoDefense, and integrate a multidimensional evaluation process, AutoEvaluation, to facilitate the understanding of the LLM-generated content. The three components remark three unique contributions.

•
AutoAttack: We conduct a comprehensive study of existing automated black-box jailbreak attack methods, categorizing them into two generic frameworks: the GA framework and the adversarial generation framework. For each framework, we employ causal analysis to explore the dependencies among optimization schemes for each attack method within that framework. Leveraging the benefits of various optimization schemes within each framework, we develop two ensemble attack methods: Ensemble Attack-GA and Ensemble Attack-Gen.
•
AutoDefense: Like AutoAttack, we also systematically analyze the dependencies and evolutionary relationships among existing defense mechanisms. We categorize these defenses into two groups: those defending against adversarial suffixes and those defending against malicious semantics. Building upon this analysis, we propose the Ensemble Defense that integrates pre-generation and post-generation defenses, leveraging the mixture-of-defenders mechanism to resist various carefully crafted jailbreak prompts.
•
AutoEvaluation: We firstly systematically evaluate an often-overlooked issue in jailbreak attacks: LLMs frequently provide off-topic responses instead of directly answering the attacker’s jailbreak prompts. We argue that such responses do not indicate successful value alignment of the LLM. We also analyze the consistency with human evaluation for three mainstream jailbreak evaluation methods: keywords matching, classifier-based, and LLM-as-a-Judge, and identify the latter as the primary evaluation criterion for our experiment.

With extensive experiments, we show that AutoJailbreak demonstrates exceptional performance in both jailbreak attacks and defenses, outperforming existing approaches. Our ensemble AutoAttack reliably break all tested models and our ensemble AutoDefense significantly enhances the jailbreak robustness of LLMs, rather than just realizing defenses against a specific type of jailbreak prompts. While we do not argue that AutoJailbreak is the ultimate jailbreak attack and defense but rather that it should become the minimal test for any new attacks and defenses.

2 Background

2.1 Black-box Jailbreak Attacks

As API-based queries and interactions have become the dominant mode for existing LLM applications(togetherai, 2023), black-box jailbreaking attacks have emerged as a crucial subfield. Unlike white-box gradient-optimization-based jailbreak prompts, which consistently incorporate an adversarial suffix(Zou etal., 2023; Liao and Sun, 2024; Zhang and Wei, 2024), black-box jailbreak attacks can be categorized into the following four types on their construction methods:

Static Human Design.This category includes malicious templates and system prompts crafted by humans(Shen etal., 2023; Liu etal., 2023b; Yu etal., 2024). These templates typically depict complex scenarios, requiring attackers to merely substitute keywords with the desired malicious behavior to induce LLMs to generate harmful content. While straightforward, this approach can be easily mitigated through alignment fine-tuning(Li etal., 2023; Piet etal., 2023).

Dynamic Optimization.Algorithms in this category leverage dynamic and adversarial mechanisms to iteratively optimize a given malicious prompt until the attacker’s objective is achieved. Dynamic optimization algorithms can be further classified into GA-based attacks(Yu etal., 2023; Li etal., 2024b; Liu etal., 2023a) and adversarial-generation-based attacks(Chao etal., 2023; Mehrotra etal., 2023; Takemoto, 2024). GA-based optimization involves mutating the jailbreak prompt closest to the target, while adversarial generation algorithms simulate an LLM acting as a red-teaming assistant, refining the jailbreak prompt based on the victim LLM’s responses in each iteration.

Long-tail Encoding.These algorithms exploit the LLMs’ alignment deficiencies with low-resource training data to execute jailbreak attacks. Common techniques include using low-resource languages(Yong etal., 2023) and artistic fonts(Jiang etal., 2024b).

Transferable-based Attacks.Transferable-based attack methods often exploit the similarities in model architecture(Sitawarin etal., 2024; Li etal., 2024a) and training processes(Hayase etal., 2024) across various LLMs. These methods leverage open-source white-box models (e.g., LLaMa-2(Touvron etal., 2023)) to construct jailbreak prompts and then transfer them to black-box LLMs.

To ensure fairness, we exclude the long-tail encoding attacks and transferable-based attacks from our evaluation since not all LLMs have the capability to interpret low-resource data and the transfer of jailbreak prompts inevitably leads to a decrease in the jailbreak success rate.

2.2 Jailbreak Defenses and Evaluation

Jailbreak Defenses.Jailbreak defense algorithms can be categorized into pre-generation(Ji etal., 2024; Robey etal., 2023; Cao etal., 2023; Hu etal., 2024) and post-generation(Pisano etal., 2023; Helbling etal., 2023; Zeng etal., 2024; Xiong etal., 2024) defenses based on its application timing. Pre-generation defenses primarily alter malicious prompts through smoothing algorithms or malicious intent analysis methods to neutralize adversarial suffixes or malicious templates. Post-generation defenses ensure users receive only clean answers by filtering the harmful content from LLM outputs. Although prior studies have validated the efficacy of these defense methods, we observe that pre-generation defense algorithms often target specific attack methods, leading to poor generalization. Concurrently, post-generation defense algorithms cannot always guarantee the quality of the generated responses. These raise a question: "Are effectiveness, generalization, and response quality an unattainable trinity in defending against jailbreak attacks?"

Jailbreak Evaluation.Current jailbreak evaluation methods can be categorized into keywords-matching, classifier-based, and LLM-as-a-judge. Keywords matching methods(Zou etal., 2023) ascertain whether the model rejects the jailbreak prompt through character matching. On the other hand, classifier-based(Huang etal., 2023; Yu etal., 2023) and LLM-as-a-judge approaches, respectively, fine-tune a binary classification model or employ another LLM to determine if the model response contains harmful content. Based on LLM-as-a-judge, AttackEval(Jin etal., 2024) considers a coarse-grained framework and a fine-grained framework to evaluate the effectiveness of jailbreak attacks. However, two significant issues persist within this evaluation system. Firstly, some model responses may initially indicate rejection but still contain unsafe information in subsequent responses. Secondly, the model output may fail to address the user’s query directly, resulting in off-topic answers. Despite being highlighted in prior studies(Cai etal., 2024), these issues have not undergone systematic examination.

Jailbreak Benchmark.Several existing benchmarks for jailbreak attacks(Chu etal., 2024; Chao etal., 2024; Qiu etal., 2023) facilitate the automated evaluation of the robustness of LLM jailbreaks. In addition, JailbreakV-28K(Luo etal., 2024) extends this evaluation to multimodal models. EasyJailbreak(Zhou etal., 2024) introduces 12 jailbreak attacks within a framework similar to GA. However, these benchmarks typically rely on the existing jailbreak attack methods, overlooking a crucial point: utilizing these methods alone does not represent the most potent form of jailbreak attacks and, therefore, cannot accurately assess the jailbreak robustness of the target model.

2.3 Threat Model

Attack Permission.We consider a black-box threat model to reflect the prevalent use of both open-source and closed-source LLMs today. This implies that an attacker, denoted as $\mathcal{A}$ , can only interact with the victim LLM $M_{V}$ by crafting a jailbreak prompt $P_{J}$ . This prompt comprises an initial malicious behavior $P_{I}$ (e.g., How to make a bomb.) integrated with a specific malicious template or system prompt $T$ . We represent the jailbreak prompt as $P_{J}=P_{I}\oplus T$ , where $\oplus$ denotes the replacement of placeholder in the malicious template with $P_{I}$ or the appending of $P_{I}$ to the end of the system prompt. Subsequently, $\mathcal{A}$ retrieves the corresponding response $R$ through API queries. In addition, attacker $\mathcal{A}$ has no prior knowledge of $M_{V}$ , including the probability distribution over the next token(Andriushchenko etal., 2024) and other sampling hyperparameters(Huang etal., 2023).

Defense Goal. In the defense setting, we rely solely on modifying the jailbreak prompt $P_{J}$ to $P^{\prime}_{J}$ or the model response $R$ to $R^{\prime}$ to ensure the efficacy of defense strategies. This means the defenders are devoid of knowledge regarding the model parameters and other relevant information. To assess the generalizability of the defense method, the defense model needs to resist both jailbreak attacks with malicious semantics and adversarial suffixes.

3 AutoJailbreak

To comprehend how the optimization methods of each subcomponent contribute to enhancing the jailbreak attacks and defenses compared to the existing literature, we employ the causal analysis method to construct a directed acyclic graph (DAG) for analyzing the dependencies among different attack and defense methods. In particular, nodes in the DAG symbolize specific optimization solutions, while the edges represent the optimization solution at the endpoint that enhances or diminishes the jailbreak compared to the starting point. In addition, the red circle represents the final objective of jailbreak attacks and defenses. Within the AutoAttack and AutoDefense framework, we devise two ensemble attacks and one ensemble defense using this dependency-based DAG approach, amalgamating all the optimization solutions within each framework. Note that maintaining consistency among factors, except for the main variables, is crucial in standard causal analysis. However, many existing studies only apply minor variations (e.g., differing hyperparameters) when employing the same generic framework. We accordingly exclude these minor alterations due to the trivial link to the jailbreak outcomes.

3.1 AutoAttack

AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens (1)

The generic framework of dynamic black-box jailbreak attacks can be categorized into two main parts: GA and adversarial generation. Both methods progressively approach the jailbreak target through iterative processes. Specifically, i). GA compromises four primary steps: seed initialization, seed selection, mutation, and execution. Initially, the attacker designs a set of initial jailbreak prompts as seeds. During each iteration, the attacker selects the parents from the current seed pool for mutation, thus continuously advancing toward the goal of jailbreak success. ii). In an LLM-based adversarial generation algorithm, another LLM is typically employed as a red-teaming assistant to generate jailbreak prompts. During each iteration, this assistant utilizes the output of the victim LLM to construct jailbreak prompts for the subsequent iteration. Figure 1 presents an overview of our AutoAttack, compromising two ensemble attack methods: Ensemble Attack-GA and Ensemble Attack-Gen. In the lower section of Figure 1, we outline the dependencies within the GA framework and the adversarial generation framework in the context of dynamic attacks. Regarding the directed edges between black nodes, we indicate the specific step in the framework where the endpoint optimizes based on the starting point. As for the directed edges where the final optimization method leads to the final target, we briefly explain why this approach can achieve our ultimate goal.

Ensemble Jailbreak Attack with GA Framework (Ensemble Attack-GA). For the GA framework, we integrate optimization techniques from AutoDAN-GA(Liu etal., 2023a), GPTFuzzer(Yu etal., 2023), OpenSesame(Lapid etal., 2023), and SMJ(Li etal., 2024b). We begin with AutoDAN-GA as the foundational approach. i). During seed initialization, we utilize the optimization approach of GPTFuzzer, refining the initial malicious behavior $P_{I}$ using a malicious system prompt $T$ , randomly selected fromsherdencooper (2023), to bring it closer to jailbreak success. Simultaneously, we widen the seed selection search space using SMJ’s optimization strategy to avoid local optima. We posit that combining these two approaches could enhance the search direction and efficiency. ii). In seed selection, we prioritized jailbreak prompts with responses closely resembling the target output (e.g., Sure, here is a tutorial on …) as parental variables. Based on OpenSesame, we introduce a language assistant for similarity scoring (rated from 1 to 10) to calculate the similarity between the output of the current jailbreak prompt and the target output, instead of relying on all-mpnet-base-v2(HuggingFace, 2023) for semantic similarity calculation as in OpenSesame. This is because we found that due to the introduction of lengthy system prompts during the seed initialization process, all-mpnet-base-v2 encountered a performance bottleneck and could not accurately understand the semantic content of the victim LLM’s output, resulting in a deviation in the optimization direction. iii). In the mutation phase, we adopt the five mutation prompts from GPTFuzzer, over random mutation in OpenSesame or the syntactic-based adversarial generative mutation method in SMJ. We notice thatGPTFuzzer’s five diverse mutators adequately balance between structural diversity and semantic coherence within jailbreak prompts, thus mitigating negative optimization effects.

Ensemble Jailbreak Attack with Adversarial Generation Framework (Ensemble Attack-Gen).For the adversarial generation framework, we incorporate optimization methods in Tastle(Xiao etal., 2024), PAIR(Chao etal., 2023), TAP(Mehrotra etal., 2023), and SBJ(Takemoto, 2024). We begin with Tastle as the foundational approach. i) In terms of the design of the red-teaming assistant, we leverage interpretability and chain-of-thought from PAIR, guiding the red-teaming assistant to offer advice during each iteration. ii) Concerning the optimization of the search space for the red-teaming assistant, we adopt the concept from TAP, integrating an evaluator to early discard off-topic prompts that are unlikely to result in successful jailbreaks. iii) For the attack model selection, we opt for the same model as the victim model, as highlighted in SBJ, to bypass the victim model’s defense mechanism. To elaborate further, our investigation reveals that utilizing an optimization method akin to Tastle’s, which initiates with an extensive jailbreak prompt close to achieving success, often perplexes the red-teaming assistant. Hence, we choose the original malicious behavior as the starting point to avert the addition of redundant information to the iterative input process of the red team assistant.

3.2 AutoDefense

AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens (2)

In AutoDefense, we propose our Ensemble Defense based on a mixture-of-defenders (MoD) framework, inspired by the mixture-of-experts mechanism utilized in existing LLMs’ architectures(Lin etal., 2024; Dong etal., 2024; Sukhbaatar etal., 2024; Jiang etal., 2024a). MoD comprises two defense experts (DE-adv and DE-sem) tailored to combat adversarial-suffix-based and malicious-semantics-based jailbreak prompts, respectively. Each defense expert incorporates two defense strategies designed to mitigate a specific class of jailbreak prompts. The rationale behind MoD is to enable each defense expert to specialize in addressing a particular category of jailbreak prompts. To mitigate potential deterioration in the quality of model-generated content due to modifications on benign prompts, we employ a language assistant to ascertain the nature of user queries as a preliminary step. Benign prompts are directly processed by the LLM for generating responses, while malicious prompts undergo scrutiny via our Ensemble Defense. For malicious jailbreak prompts, Ensemble Defense further classifies them into two categories using a language assistant: adversarial suffixes and malicious semantics, then selects the corresponding defense expert to modify prompts.

In the upper part of Figure2, we illustrate our design for two defense experts. Similarly, we use black circles to represent specific defense methods. We posit that the defense method at the end of a directed edge is more effective than the method at the starting point, and briefly explain the rationale on the edge. It should be noted that although each defense expert compromises two defense routes and multiple defense methods, we select only the method at the end of each defense route as a part of our Ensemble Defense. Below we introduce the design of each defense expert and defense route:

DE-adv is designed to mitigate adversarial suffixes generated by white-box gradient-optimization-based methods. Employing spell-check techniques(Ji etal., 2024), DE-adv rectifies syntax and spelling errors within jailbreak prompts, thereby preserving content quality while minimally altering the original prompt based on a perturbation-based method(Robey etal., 2023). Furthermore, DE-adv refines the adversarial suffix elimination process, as elucidated in Kumar etal. (2023), by employing Monto-Carlo sampling(Cao etal., 2023) to prevent adversarial strings from appearing mid-sentence. In contrast, DE-sem targets jailbreak prompts containing malicious semantics (e.g. virtual malicious templates), commonly produced by automated black-box jailbreak attacks. Initially, DE-sem adopts a straightforward approach outlined in Ji etal. (2024), summarizing verbose jailbreak prompts to isolate original malicious intent from complex prompts. Another strategy involves leveraging the LLM’s comprehension of lengthy text to discern the intent behind jailbreak prompts and subsequently rewriting them(Pisano etal., 2023; Zhang etal., 2024) based on single-turn intention analysis(Wu etal., 2024; Xie etal., 2023; Helbling etal., 2023).

In essence, each defense expert generates two modified prompts from the original query, resulting in two corresponding outputs when fed into the victim LLM. Acknowledging that certain potent jailbreak prompts may still coerce the model into generating harmful content, Ensemble Defense strategy employs post-generation defense to filter harmful model outputs. Subsequently, an additional LLM serves as a voting judge, determining which response better addresses the user’s query.

3.3 AutoEvaluation

Existing evaluation methods, such as keywords matching, often suffer from high false positives(Huang etal., 2023; Ding etal., 2023; Chao etal., 2023). Two mainstream evaluation methods include fine-tuning a binary classification model (e.g., Roberta(Liu etal., 2019)) or using LLM as a judge.In this study, we compare these three evaluation methods for their effectiveness in determining the success of jailbreak prompts. For the keywords matching method, we use the same keywords following previous research(Chao etal., 2023). In the binary classification model, we employ five commonly used datasets: ParaDetox(Logacheva etal., 2022), Toxic-Conversations-50k(cjadams etal., 2019), ToxicChat(Lin etal., 2023), WikiToxic(Thain etal., 2017), and Real Toxicity Prompts(Gehman etal., 2020) to train the classifier. Specifically, we randomly sample 1,000 data pairs from each dataset and fine-tune the Roberta model over 5 epochs with a batch size of 16. The prompts used for the LLM-as-a-Judge method are provided in AppendixB.

Metric	Classifier	Keyword Matching	LLM-as-a-judge
TPR↑	8.70	34.78	65.22
FPR↓	5.48	21.92	8.22

We use two metrics to quantify the effectiveness of these methods: true positive rate (TPR) and false positive rate (FPR). TPR indicates the probability of correctly identifying a successful jailbreak prompt, with higher values being preferable. FPR represents the probability of misclassifying benign prompts as successful jailbreak prompts, with lower values being desirable.As shown inTable1, the evaluation results for the TAP attack demonstrate that the LLM-as-a-Judge method significantly outperforms both the classifier and the keyword matching methods.Therefore, we adopt LLM-as-a-Judge to construct our evaluation framework.

Meanwhile, instances where the output of the victim LLM lacks harmful information and fails to achieve value alignment by rejecting malicious prompts, attributed as hallucination, are commonly ignored in existing jailbreak attack, defense, and evaluation methodologies. Based on the aforementioned concerns, AutoEvaluation includes a two-stage evaluation. In the first round, we will determine whether the response contains harmful information, representing whether the jailbreak is successful. For those clean responses, we will judge whether each of them truly answers the user’s query. In summary, AutoEvaluation will classify the jailbreak prompt as jailbreak, hallucination, or alignment based on the LLM’s corresponding output.

4 Experiment

4.1 Experimental Setup

Datasets.Referring to the user guides of ChatGPT(OpenAI, 2023b) and Gemini(Google, 2023), we have compiled a dataset comprising 96 distinct malicious behaviors listed inAppendixC. The dataset is partitioned into two sections: general and target. The general section encompasses 14 broad jailbreak subcategories, including theft, surveillance, false personation, illegally personally disseminating information, drugs, weapons, human trafficking, robbery, torture, bomb, sexual, suicide, infringing upon public interest, and discrimination. The target section comprises 10 more refined subcategories, including finance, industry, cyber-security, academic, mental, education, traffic, media, medicine, and government. Each subcategory encompasses four distinct types of malicious behaviors. Throughout the construction process, we diligently avoid intersection and duplication between malicious behaviors. This approach facilitates the identification of nuanced areas where the security of LLMs is deficient.

Interaction with LLMs.Our experiments encompass a selection of the most prominent open-source and closed-source LLMs, including: GPT-3.5-turbo(OpenAI, 2023a), GPT-4(OpenAI, 2023a), LLaMa-2(Touvron etal., 2023), LLaMa-3(Meta, 2024), Mistral(Jiang etal., 2023), Qwen (Bai etal., 2023), Vicuna(LMSYS, 2024), Claude(Anthropic, 2024). Utilizing a unified API platform (togetherai, 2023), we conduct experiments for attack and defense scenarios. We maintain default settings for sampling hyperparameters to emulate real-world attack and defense behaviors, setting the temperature and top-p to 0.7 and top-k to 50.

Evaluation Metrics.We consider a layered evaluation framework in AutoEvaluation. In the first evaluation stage, we employ jailbreak success rate (JR) to represent the ratio of samples with harmful information in models’ responses. In the second stage, we further look at the hallucination rate (HR) and alignment rate (AR) in the failed jailbreak instances, measuring the ratio of samples with hallucination responses (not relative to the query) and aligned responses (refuse to answer due to model alignment).

Target LLMs	Adversarial Generation			GA Framework
Target LLMs	PAIR	TAP	Ensemble Attack-Gen	AutoDAN-GA	GPTFuzzer	Ensemble Attack-GA
GPT-3.5	18.8 / 42.7 / 38.5	14.6 / 38.5 / 46.9	14.6 / 38.5 / 46.9	77.1 / 14.6 / 8.3	54.2 / 3.1 / 42.7	91.7 / 2.1 / 6.2
GPT-4	28.1 / 36.5 / 35.4	30.2 / 25.0 / 44.8	80.2 / 18.6 / 1.2	51.0 / 7.3 / 41.7	77.1 / 9.4 / 13.5	65.6 / 7.3 / 27.1
Vicuna	28.1 / 38.5 / 33.4	26.0 / 43.8 / 30.2	21.9 / 37.5 / 40.6	77.1 / 14.6 / 8.3	64.6 / 15.6 / 19.8	99.0 / 1.0 / 0
LLaMa-2	5.2 / 31.2 / 63.6	3.1 / 45.8 / 51.0	16.7 / 56.2 / 27.1	13.5 / 13.5 / 73.0	10.4 / 4.2 / 85.4	63.5 / 9.4 / 27.1
LLaMa-3	40.6 / 32.3 / 27.1	39.6 / 44.8 / 15.6	52.1 / 33.3 / 14.6	1.0 / 2.1 / 96.9	7.3 / 3.1 / 89.6	40.0 / 12.5 / 47.5
Qwen	20.8 / 31.2 / 48.0	36.4 / 21.9 / 41.7	45.8 / 26.0 /28.2	88.5 / 9.4 / 2.1	80.2 / 3.1 / 16.7	99.0 / 1.0 / 0
Mistral	50.3 / 38.8 / 11.2	63.5 / 26.0 / 10.5	43.8 / 34.4 / 21.8	88.5 / 11.5 / 0.0	81.2 / 5.2 / 13.6	99.0 / 1.0 / 0
Claude	3.1 / 36.5 / 60.4	3.1 / 38.5 / 58.4	39.6 / 35.4 / 25.0	4.2 / 12.5 / 88.3	11.5 / 16.7 / 71.8	5.2 / 14.6 / 80.2

4.2 Main Results

Efficiency of AutoAttack.We first evaluated the efficiency of our AutoAttack, and the obtained data are presented in Table 2. In the adversarial generation framework, we select two widely used methods, PAIR (Chao etal., 2023) and TAP (Mehrotra etal., 2023), as baselines and utilize GPT-3.5 as a red-teaming assistant to attack other LLMs. In Ensemble Attack-Gen, we employ the same LLM as the target model to execute the jailbreak. Note that LLaMa-2, LLaMa-3, and Claude decline to serve as red-teaming assistants for constructing jailbreak prompts due to their robust alignment measures. Hence, we use GPT-4 with enhanced semantic understanding capabilities, as the attack model to conduct experiments on these three models. We increase the maximum number of iterations in PAIR to 10, matching the setting in TAP and Ensemble Attack-Gen. Additionally, we keep other parameters in PAIR and TAP unchanged. Based on this configuration, we have observed the following: i). Ensemble Attack-Gen achieves the best attack performance on five models, surpassing the baseline by at least 10%. While it do not achieve the best JR on the remaining three models, it lags behind the leading attacks by only 4.2% and 6.2%, respectively. ii). Furthermore, Ensemble Attack-Gen exhibits remarkable jailbreaking capabilities on LLaMa-2, LLaMa-3, GPT-4, and Claude, which are acknowledged for their strong security alignment measures. Ensemble Attack-Gen surpasses the baseline on GPT-4 and Claude by 50% and 36.5%, respectively.

In the context of the GA framework, two widely used GA-based attack methods, AutoDAN-GA and GPTFuzzer are selected as baselines for comparison. In AutoDAN-GA, the victim LLM is set as LLaMa-2 to generate the set of jailbreak prompts for conducting transferable attacks on other LLMs. For GPTFuzzer, the seed selection method is set as UCB(Auer etal., 2002), and three malicious templates are randomly selected fromLiu etal. (2023b) to maintain consistency in the initialization strategy between GPTFuzzer and Ensemble Attack-GA. Our observations based on this setting reveal the following: i). Ensemble Attack-GA demonstrates remarkable capabilities within the GA framework, achieving a JR exceeding 90% in four LLMs. Notably, for Vicuna, Qwen, and Mistral, which are acknowledged for their limited alignment with values, the JR of Ensemble Attack-GA reaches 99%. ii). Additionally, Ensemble Attack-GA exhibits a slightly lower HR compared to other baseline methods. This underscores the efficacy of incorporating semantic features in the seed selection and mutation stages of the GA framework, thus preventing GA from deviating from the initial malicious behavior throughout the process.

Target LLMs	W/O Defense	Spell-check	Monto-Carlo Sampling	Intention Analysis	Summarize	AutoDefense
GPT-3.5	14.6 / 38.5 / 46.9	26.0 / 50.0 / 24.0	24.0 / 57.3 / 18.7	1.0 / 14.6 / 84.4	11.5 / 78.1 / 10.4	1.1 / 8.6 / 90.3
GPT-4	80.2 / 18.6 / 1.2	75.0 / 21.9 / 3.1	63.5 / 31.2 / 5.3	25.0 / 40.6 / 34.4	59.4 / 34.4 / 6.2	0 / 1.0 / 99.0
Vicuna	21.9 / 37.5 / 40.6	26.0 / 38.5 / 35.5	17.7 / 42.7 / 39.6	13.5 / 24.0 / 62.5	10.4 / 45.8 / 43.8	0 / 2.5 / 97.5
LLaMa-2	16.7 / 56.2 / 27.1	16.7 / 37.5 / 45.8	11.5 / 43.8 / 44.7	0 / 30.2 / 69.8	12.5 / 41.7 / 45.8	0 / 9.0 / 91.0
LLaMa-3	52.1 / 33.3 / 14.6	15.6 / 33.3 / 51.1	17.7 / 29.2 / 53.1	0 / 24.0 / 76.0	24.0 / 37.5 / 38.5	2.1 / 1.1 / 96.8
Qwen	45.8 / 26.0 /28.2	38.5 / 37.5 / 24.0	30.2 / 50.0 / 19.8	5.2 / 50.0 / 44.8	12.5 / 55.2 / 32.3	0 / 0 / 100.0
Mistral	43.8 / 34.4 / 21.8	44.8 / 52.1 / 3.1	45.8 / 43.8 / 10.4	16.7 / 37.5 / 45.8	15.6 / 70.8 / 13.6	0 / 2.7 / 97.3
Claude	39.6 / 35.4 / 25.0	3.1 / 40.6 / 56.3	3.1 / 37.5 / 59.4	0 / 34.4 / 65.6	40.6 / 45.8 / 13.6	3.1 / 14.1 / 82.8

Target LLMs	W/O Defense	Spell-check	Monto-Carlo Sampling	Intention Analysis	Summarize	AutoDefense
GPT-3.5	91.7 / 2.1 / 6.2	41.7 / 36.5 / 21.8	33.3 / 24.0 / 42.7	4.2 / 8.3 / 87.5	7.3 / 68.8 / 23.9	1.1 / 6.5 / 92.4
GPT-4	65.6 / 7.3 / 27.1	35.2 / 23.9 / 40.9	35.2 / 23.9 / 40.9	0 / 9.5 / 90.5	29.5 / 50.0 / 20.5	2.3 / 4.5 / 93.2
Vicuna	99.0 / 1.0 / 0.0	52.1 / 20.8 / 27.1	56.2 / 15.6 / 28.2	83.3 / 2.1 / 14.6	15.6 / 20.8 / 63.6	1.1 / 9.8 / 89.1
LLaMa-2	63.5 / 9.4 / 27.1	35.4 / 22.9 / 41.7	33.3 / 29.2 / 37.5	3.8 / 17.9 / 78.3	17.7 / 32.3 / 50.0	1.0 / 3.1 / 95.9
LLaMa-3	40.0 / 12.5 / 47.5	30.2 / 34.4 / 35.4	27.1 / 26.0 / 46.9	0.0 / 8.3 / 91.7	24.0 / 39.6 / 36.4	0 / 5.2 / 94.8
Qwen	99.0 / 1.0 / 0.0	43.8 / 51.0 / 5.2	68.8 / 9.4 / 21.8	82.3 / 5.2 / 12.5	7.3 / 83.3 / 9.4	12.6 / 2.3 / 85.1
Mistral	99.0 / 1.0 / 0.0	39.6 / 56.2 / 4.2	71.9 / 18.8 / 9.3	49.0 / 7.3 / 43.7	3.1 / 89.6 / 7.3	7.0 / 3.5 / 89.5
Claude	5.2 / 14.6 / 80.2	1.0 / 43.8 / 55.2	1.0 / 26.0 / 73.0	0 / 17.7 / 82.3	1.0 / 62.5 / 36.5	0 / 10.6 / 89.4

Efficiency of AutoDefense against Ensemble AttacksWe next evaluate the proposed AutoDefense on AutoAttack to demonstrate its robustness. For the baselines we select the respective construction of defense experts as our baselines, including spell-check, Monto-Carlo sampling, intention analysis and summarize. Table 3 shows the results and we make three observations. i).AutoDefense demonstrates superior robustness to AutoAttack on almost all LLMs. Notably, AutoDefense enhances the robustness of the victim LLM in a completely black-box scenario, enabling it to generate content consist with human values, such as refusing to answer malicious questions. ii). Although AutoDefense is sometimes slightly inferior to baseline defense methods on models such as Qwen and Mistral in the GA framework, it almost always limits the rate of successful jailbreaks to single digits. iii). We also observe that AutoDefense can significantly reduce HR. We believe this is because AutoDefense adopts a post-generation defense strategy, directly rejecting malicious requests from users instead of using indirect answers.

Efficiency of AutoDefense against Static Attacks.We further evaluate AutoDefense against static attacks. Compared to dynamic attacks, we employ system prompts and malicious templates as our primary methodologies in static attacks. For system prompts, we specifically target three prominent types found in Liu etal. (2023b): Reject Suppression (which prohibits LLMs from rejecting any request), DAN (which forces LLMs to do anything now), and Developer Mode (which sets LLMs under a developer mode). For malicious templates, we adopt two prevalent attack approaches distinct from those suggested by system prompts: leveraging multi-turn conversations or employing a special format for jailbreaking. The multi-turn conversation tactic initiates with benign prompts to ease the LLM and gauge its alignment capabilities effectively. This approach includes DrAttack (prompt decomposition and reconstruction(Li etal., 2024c)), Indirect Jailbreak (collecting defensive clues(Chang etal., 2024)), and Contextual Interaction Attack (exploiting contextual cues for jailbreaking(Cheng etal., 2024)). Among the special format attack methods, Python and LaTeX are selected as the primary approaches to assess the model’s defense capabilities. In this experiment, we select 6 system prompts and 5 malicious templates, and compute the general metrics using various LLMs.

Target LLMs	W/O Defense	Spell-check	Monto-Carlo Sampling	Intention Analysis	Summarize	AutoDefense
GPT-3.5	56.5 / 10.2 / 33.3	60.9 / 25.5 / 13.6	51.0 / 28.6 / 20.4	9.9 / 4.2 / 85.9	1.0 / 82.6 / 16.4	1.0 / 1.6 / 97.4
GPT-4	50.7 / 13.2 / 36.1	21.4 / 33.3 / 45.3	31.2 / 39.2 / 29.6	0.5 / 0.5 / 99.0	1.0 / 83.1 / 15.9	1.0 / 1.0 / 98.0
Vicuna	77.7 / 17.7 / 4.6	73.4 / 23.4 / 3.2	60.4 / 38.0 / 1.6	46.4 / 28.6 / 25.0	51.0 / 20.3 / 28.7	0.5 / 2.6 / 96.9
LLaMa-2	23.3 / 14.0 / 62.7	26.0 / 29.7 / 44.3	18.8 / 42.7 / 38.5	0.5 / 10.4 89.1	2.1 / 75.5 / 22.4	1.0 / 4.7 / 94.3
LLaMa-3	17.5 / 6.9 / 75.6	21.9 / 26 / 52.1	20.8 / 41.1 / 38.1	2.1 / 2.6 / 95.3	1.0 / 80.7 / 18.3	0.5 / 1.6 / 97.9
Qwen	83.8 / 14.8 / 1.4	69.3 / 30.7 / 0.0	57.3 / 41.1 / 1.6	21.4 / 28.6 / 50.0	4.2 / 95.8 / 0.0	0.5 / 2.1 / 97.4
Mistral	84.8 / 12.7 / 2.5	76.0 / 21.9 / 2.1	58.9 / 38.5 / 2.6	23.4 / 37.0 / 39.6	2.6 / 94.8 / 2.6	1.6 / 1.6 / 96.8
Claude	14.0 / 19.0 / 67.0	9.9 / 33.9 / 56.2	24.0 / 38.0 / 38.0	4.2 / 6.8 / 89.0	1.0 / 78.6 / 20.4	1.6 / 3.6 / 94.8

Target LLMs	W/O Defense	Spell-check	Monto-Carlo Sampling	Intention Analysis	Summarize	AutoDefense
GPT-3.5	24.5 / 8.9 / 66.6	45.8 / 27.3 / 26.9	36.5 / 13.9 / 49.6	26.6 / 9.4 / 64.0	51.0 / 31.4 / 17.6	0.5 / 4.2 / 95.3
GPT-4	7.1 / 5.4 / 70.8	17 / 9.7 / 73.3	18.4 / 5.9 / 75.7	11.3 / 13.9 / 74.8	22.7 / 11.3 / 66	0.7 / 2.4 / 96.9
Vicuna	85.1 / 3.5 / 11.4	55.4 / 18.2 / 26.4	81.8 / 7.1 / 11.1	78.3 / 2.4 / 19.3	54.0 / 19.6 / 26.4	0.5 / 3.0 / 96.5
LLaMa-2	3.5 / 7.1 / 89.4	7.5 / 12.8 / 79.7	7.3 / 10.2 / 82.5	0.5 / 3.3 / 96.2	6.6 / 18.8 / 74.6	0.2 / 3.5 / 96.3
LLaMa-3	4.3 / 1.9 / 93.8	28.5 / 5.9 / 65.6	15.6 / 3.5 / 80.9	9.0 / 2.1 / 88.9	39.9 / 4.7 / 55.4	0.7 / 3.5 / 95.8
Qwen	50.2 / 4.9 / 44.9	83.3 / 12.7 / 4.0	62.2 / 6.9 / 30.9	25.7 / 5.5 / 68.8	80.4 / 12.5 / 7.1	0.3 / 2.6 / 97.1
Mistral	87.5 / 4.9 / 7.6	60.8 / 13.5 / 25.7	78.0 / 12.0 / 10.0	53.8 / 8.2 / 38.0	50.5 / 16.3 / 33.2	0.2 / 3.1 / 96.7
Claude	0.7 / 11.8 / 87.5	0.2 / 36.8 / 63.0	0.7 / 18.4 / 80.9	2.4 / 7.8 / 89.8	0 / 49.3 / 50.7	0.3 / 5.2 / 94.5

Table4 shows the results and we make three observations. i). Even when the attack method transitions to static attacks crafted by humans, AutoDefense maintains robust defense performance. We manage to constrain the JR of most static attacks to less than 1%. Similar to previous experiments, our method exhibits minimal HR. ii). We find that Summarize and Intention Analysis defenses also demonstrate robustness in resisting static attacks based on malicious semantics. A possible explanation is that these two defenses mitigate irrelevant information from malicious templates and system prompts, exposing the original malicious behavior directly to the victim LLM, thereby repelling the attacks. iii). Interestingly, we observe that the two defense methods targeting adversarial suffixes (Spell-check and Monto-Carlo Sampling) are nearly ineffective against static attacks with malicious semantics. This further validates our stance that certain existing defense methods solely address specific attack forms. AutoDefense, as an ensemble approach, can significantly enhance LLMs’ robustness.

Efficiency of AutoDefense against White-Box Attacks.To evaluate the effectiveness of AutoDefense in defending against jailbreak prompts with adversarial suffixes, we test its defense results against white-box attacks based on gradient optimization as shown in Table5. We use the widely adopted algorithm GCG(Zou etal., 2023) as our attack method and LLaMa-2, a leading open-source LLM, as the victim model. We find that AutoDefense significantly outperformed almost all other baseline defense methods. Similar to the jailbreak prompts based on malicious semantics, the more targeted spell-check and Monto-Carlo sampling methods show better performance than the previously outstanding defense method, intention analysis. Although the summarize method also achieves complete defense against jailbreak prompts, its HR is nearly 30% higher than that of AutoDefense. We believe this is because the summarize method deletes part of the user’s original query content while removing the adversarial suffix.

W/O Defense	Spell-check	Monto-Carlo Sampling	Intention Analysis	Summarize	AutoDefense
100.0 / 0.0 / 0.0	3.6 / 35.7 / 60.7	25.0 / 42.9 / 32.1	39.3 / 14.3 / 46.4	0.0 / 39.3 / 60.7	0.0 / 10.7 / 89.3

5 Conclusion

In this paper, we propose AutoJailbreak, a framework that uses causal analysis to analyze the relationship between existing black-box automated jailbreak attacks and defense optimization schemes. It consists of three components: In AutoAttack, we systematically analyzed and investigated existing attack methods. They are divided into GA-based attack frameworks and adversarial generation-based attack frameworks. In each framework, we used causal analysis to analyze the dependencies between different optimization schemes. By combining the advantages of different attack schemes, we constructed two ensemble attack methods. Our attack method shows strong attack effects on eight common LLMs. In AutoDefense, we imitated the induction and analysis methods of AutoAttack and verified in detail which defense methods can better resist the corresponding category of jailbreak attacks. Using the mixture-of-defenders mechanism, we designed an ensemble defense that combines the advantages of different defense algorithms. By testing its defense effect on static attacks and dynamic attacks, we found that our ensemble defense can effectively improve the robustness of the victim LLM, rather than being limited to a specific type of jailbreak attack method. In AutoEvaluation, we incorporated the hallucination phenomenon of jailbreak-generated content into the evaluation system and re-examined the effectiveness of existing attack and defense methods.

Improving the jailbreak robustness of LLMs hinges on developing a potent jailbreak attack method and enhancing existing defense mechanisms. However, current benchmarks for evaluating LLMs’ jailbreak robustness employ attack and defense methods that are optimized for a specific subset within a generic framework. In contrast, our ensemble attack can elevate the capabilities of existing black-box automated jailbreak attacks, providing a more accurate assessment of target LLMs’ and defense methods’ robustness. We hope that our research will inspire fellow scholars in the machine learning community to expand and refine existing attack and defense methods, ultimately contributing to the development of a truly secure LLM. Our AutoJailbreak framework provides a constructive foundation for this endeavor.

References

Andriushchenko etal. (2024)M.Andriushchenko, F.Croce, and N.Flammarion.Jailbreaking leading safety-aligned llms with simple adaptive attacks.arXiv preprint arXiv:2404.02151, 2024.
Anthropic (2024)Anthropic.Claude, 2024.URL https://claude.ai/.
Auer etal. (2002)P.Auer, N.Cesa-Bianchi, and P.Fischer.Finite-time analysis of the multiarmed bandit problem.Machine learning, 47:235–256, 2002.
Bai etal. (2023)J.Bai, S.Bai, Y.Chu, Z.Cui, K.Dang, X.Deng, Y.Fan, W.Ge, Y.Han, F.Huang, B.Hui, L.Ji, M.Li, J.Lin, R.Lin, D.Liu, G.Liu, C.Lu, K.Lu, J.Ma, R.Men, X.Ren, X.Ren, C.Tan, S.Tan, J.Tu, P.Wang, S.Wang, W.Wang, S.Wu, B.Xu, J.Xu, A.Yang, H.Yang, J.Yang, S.Yang, Y.Yao, B.Yu, H.Yuan, Z.Yuan, J.Zhang, X.Zhang, Y.Zhang, Z.Zhang, C.Zhou, J.Zhou, X.Zhou, and T.Zhu.Qwen technical report.arXiv preprint arXiv:2309.16609, 2023.
Bai etal. (2022)Y.Bai, A.Jones, K.Ndousse, A.Askell, A.Chen, N.DasSarma, D.Drain, S.Fort, D.Ganguli, T.Henighan, etal.Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022.
Cai etal. (2024)H.Cai, A.Arunasalam, L.Y. Lin, A.Bianchi, and Z.B. Celik.Take a look at it! rethinking how to evaluate language model jailbreak.arXiv preprint arXiv:2404.06407, 2024.
Cao etal. (2023)B.Cao, Y.Cao, L.Lin, and J.Chen.Defending against alignment-breaking attacks via robustly aligned llm.arXiv preprint arXiv:2309.14348, 2023.
Chang etal. (2024)Z.Chang, M.Li, Y.Liu, J.Wang, Q.Wang, and Y.Liu.Play guessing game with llm: Indirect jailbreak attack with implicit clues.arXiv preprint arXiv:2402.09091, 2024.
Chao etal. (2023)P.Chao, A.Robey, E.Dobriban, H.Hassani, G.J. Pappas, and E.Wong.Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023.
Chao etal. (2024)P.Chao, E.Debenedetti, A.Robey, M.Andriushchenko, F.Croce, V.Sehwag, E.Dobriban, N.Flammarion, G.J. Pappas, F.Tramer, etal.Jailbreakbench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024.
Cheng etal. (2024)Y.Cheng, M.Georgopoulos, V.Cevher, and G.G. Chrysos.Leveraging the context through multi-round interactions for jailbreaking attacks.arXiv preprint arXiv:2402.09177, 2024.
Chu etal. (2024)J.Chu, Y.Liu, Z.Yang, X.Shen, M.Backes, and Y.Zhang.Comprehensive assessment of jailbreak attacks against llms.arXiv preprint arXiv:2402.05668, 2024.
cjadams etal. (2019)cjadams, D.Borkan, inversion, J.Sorensen, L.Dixon, L.Vasserman, and nithum.Jigsaw unintended bias in toxicity classification, 2019.URL https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.
Dai etal. (2023)J.Dai, X.Pan, R.Sun, J.Ji, X.Xu, M.Liu, Y.Wang, and Y.Yang.Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023.
Ding etal. (2023)P.Ding, J.Kuang, D.Ma, X.Cao, Y.Xian, J.Chen, and S.Huang.A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.arXiv preprint arXiv:2311.08268, 2023.
Dong etal. (2024)H.Dong, B.Chen, and Y.Chi.Prompt-prompted mixture of experts for efficient llm generation.arXiv preprint arXiv:2404.01365, 2024.
Du etal. (2023)Y.Du, S.Zhao, M.Ma, Y.Chen, and B.Qin.Analyzing the inherent response tendency of llms: Real-world instructions-driven jailbreak.arXiv preprint arXiv:2312.04127, 2023.
Ge etal. (2023)S.Ge, C.Zhou, R.Hou, M.Khabsa, Y.-C. Wang, Q.Wang, J.Han, and Y.Mao.Mart: Improving llm safety with multi-round automatic red-teaming.arXiv preprint arXiv:2311.07689, 2023.
Gehman etal. (2020)S.Gehman, S.Gururangan, M.Sap, Y.Choi, and N.A. Smith.Realtoxicityprompts: Evaluating neural toxic degeneration in language models.arXiv preprint arXiv:2009.11462, 2020.
Google (2023)Google.Safety settings, 2023.URL https://ai.google.dev/gemini-api/docs/safety-settings.
Handa etal. (2024)D.Handa, A.Chirmule, B.Gajera, and C.Baral.Jailbreaking proprietary large language models using word substitution cipher.arXiv preprint arXiv:2402.10601, 2024.
Hayase etal. (2024)J.Hayase, E.Borevkovic, N.Carlini, F.Tramèr, and M.Nasr.Query-based adversarial prompt generation.arXiv preprint arXiv:2402.12329, 2024.
Helbling etal. (2023)A.Helbling, M.Phute, M.Hull, and D.H. Chau.Llm self defense: By self examination, llms know they are being tricked.arXiv preprint arXiv:2308.07308, 2023.
Hu etal. (2024)X.Hu, P.-Y. Chen, and T.-Y. Ho.Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes.arXiv preprint arXiv:2403.00867, 2024.
Huang etal. (2023)Y.Huang, S.Gupta, M.Xia, K.Li, and D.Chen.Catastrophic jailbreak of open-source llms via exploiting generation.arXiv preprint arXiv:2310.06987, 2023.
HuggingFace (2023)HuggingFace.all-mpnet-base-v2, 2023.URL https://huggingface.co/sentence-transformers/all-mpnet-base-v2.
Jain etal. (2023)N.Jain, A.Schwarzschild, Y.Wen, G.Somepalli, J.Kirchenbauer, P.-y. Chiang, M.Goldblum, A.Saha, J.Geiping, and T.Goldstein.Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023.
Ji etal. (2024)J.Ji, B.Hou, A.Robey, G.J. Pappas, H.Hassani, Y.Zhang, E.Wong, and S.Chang.Defending large language models against jailbreak attacks via semantic smoothing.arXiv preprint arXiv:2402.16192, 2024.
Jiang etal. (2023)A.Q. Jiang, A.Sablayrolles, A.Mensch, C.Bamford, D.S. Chaplot, D.d.l. Casas, F.Bressand, G.Lengyel, G.Lample, L.Saulnier, etal.Mistral 7b.arXiv preprint arXiv:2310.06825, 2023.
Jiang etal. (2024a)A.Q. Jiang, A.Sablayrolles, A.Roux, A.Mensch, B.Savary, C.Bamford, D.S. Chaplot, D.d.l. Casas, E.B. Hanna, F.Bressand, etal.Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024a.
Jiang etal. (2024b)F.Jiang, Z.Xu, L.Niu, Z.Xiang, B.Ramasubramanian, B.Li, and R.Poovendran.Artprompt: Ascii art-based jailbreak attacks against aligned llms.arXiv preprint arXiv:2402.11753, 2024b.
Jin etal. (2024)M.Jin, S.Zhu, B.Wang, Z.Zhou, C.Zhang, Y.Zhang, etal.Attackeval: How to evaluate the effectiveness of jailbreak attacking on large language models.arXiv preprint arXiv:2401.09002, 2024.
Kim etal. (2024)H.Kim, S.Yuk, and H.Cho.Break the breakout: Reinventing lm defense against jailbreak attacks with self-refinement.arXiv preprint arXiv:2402.15180, 2024.
Kumar etal. (2023)A.Kumar, C.Agarwal, S.Srinivas, S.Feizi, and H.Lakkaraju.Certifying llm safety against adversarial prompting.arXiv preprint arXiv:2309.02705, 2023.
Lapid etal. (2023)R.Lapid, R.Langberg, and M.Sipper.Open sesame! universal black box jailbreaking of large language models.arXiv preprint arXiv:2309.01446, 2023.
Li etal. (2024a)T.Li, X.Zheng, and X.Huang.Open the pandora’s box of llms: Jailbreaking llms through representation engineering.arXiv preprint arXiv:2401.06824, 2024a.
Li etal. (2024b)X.Li, S.Liang, J.Zhang, H.Fang, A.Liu, and E.-C. Chang.Semantic mirror jailbreak: Genetic algorithm based jailbreak prompts against open-source llms.arXiv preprint arXiv:2402.14872, 2024b.
Li etal. (2024c)X.Li, R.Wang, M.Cheng, T.Zhou, and C.-J. Hsieh.Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers.arXiv preprint arXiv:2402.16914, 2024c.
Li etal. (2023)Y.Li, F.Wei, J.Zhao, C.Zhang, and H.Zhang.Rain: Your language models can align themselves without finetuning.In The Twelfth International Conference on Learning Representations, 2023.
Liao and Sun (2024)Z.Liao and H.Sun.Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms.arXiv preprint arXiv:2404.07921, 2024.
Lin etal. (2024)B.Lin, Z.Tang, Y.Ye, J.Cui, B.Zhu, P.Jin, J.Zhang, M.Ning, and L.Yuan.Moe-llava: Mixture of experts for large vision-language models.arXiv preprint arXiv:2401.15947, 2024.
Lin etal. (2023)Z.Lin, Z.Wang, Y.Tong, Y.Wang, Y.Guo, Y.Wang, and J.Shang.Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation, 2023.
Liu etal. (2024)T.Liu, Y.Zhang, Z.Zhao, Y.Dong, G.Meng, and K.Chen.Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction.arXiv preprint arXiv:2402.18104, 2024.
Liu etal. (2023a)X.Liu, N.Xu, M.Chen, and C.Xiao.Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023a.
Liu etal. (2019)Y.Liu, M.Ott, N.Goyal, J.Du, M.Joshi, D.Chen, O.Levy, M.Lewis, L.Zettlemoyer, and V.Stoyanov.Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019.
Liu etal. (2023b)Y.Liu, G.Deng, Z.Xu, Y.Li, Y.Zheng, Y.Zhang, L.Zhao, T.Zhang, and Y.Liu.Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860, 2023b.
LMSYS (2024)LMSYS.Vicuna, 2024.URL https://lmsys.org/blog/2023-03-30-vicuna/.
Logacheva etal. (2022)V.Logacheva, D.Dementieva, S.Ustyantsev, D.Moskovskiy, D.Dale, I.Krotova, N.sem*nov, and A.Panchenko.Paradetox: Detoxification with parallel data.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6804–6818, 2022.
Luo etal. (2024)W.Luo, S.Ma, X.Liu, X.Guo, and C.Xiao.Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027, 2024.
Lv etal. (2024)H.Lv, X.Wang, Y.Zhang, C.Huang, S.Dou, J.Ye, T.Gui, Q.Zhang, and X.Huang.Codechameleon: Personalized encryption framework for jailbreaking large language models.arXiv preprint arXiv:2402.16717, 2024.
Mehrotra etal. (2023)A.Mehrotra, M.Zampetakis, P.Kassianik, B.Nelson, H.Anderson, Y.Singer, and A.Karbasi.Tree of attacks: Jailbreaking black-box llms automatically.arXiv preprint arXiv:2312.02119, 2023.
Meta (2024)Meta.Meta llama 3, 2024.URL https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/.
OpenAI (2023a)OpenAI.Chatgpt, 2023a.URL https://chat.openai.com.
OpenAI (2023b)OpenAI.Learn how to build moderation into your ai applications., 2023b.URL https://platform.openai.com/docs/guides/moderation.
Piet etal. (2023)J.Piet, M.Alrashed, C.Sitawarin, S.Chen, Z.Wei, E.Sun, B.Alomair, and D.Wagner.Jatmo: Prompt injection defense by task-specific finetuning.arXiv preprint arXiv:2312.17673, 2023.
Pisano etal. (2023)M.Pisano, P.Ly, A.Sanders, B.Yao, D.Wang, T.Strzalkowski, and M.Si.Bergeron: Combating adversarial attacks through a conscience-based alignment framework.arXiv preprint arXiv:2312.00029, 2023.
Qiu etal. (2023)H.Qiu, S.Zhang, A.Li, H.He, and Z.Lan.Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models.arXiv preprint arXiv:2307.08487, 2023.
Robey etal. (2023)A.Robey, E.Wong, H.Hassani, and G.J. Pappas.Smoothllm: Defending large language models against jailbreaking attacks.arXiv preprint arXiv:2310.03684, 2023.
Russinovich etal. (2024)M.Russinovich, A.Salem, and R.Eldan.Great, now write an article about that: The crescendo multi-turn llm jailbreak attack.arXiv preprint arXiv:2404.01833, 2024.
Shah etal. (2023)R.Shah, S.Pour, A.Tagade, S.Casper, J.Rando, etal.Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023.
Shen etal. (2023)X.Shen, Z.Chen, M.Backes, Y.Shen, and Y.Zhang." do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models.arXiv preprint arXiv:2308.03825, 2023.
sherdencooper (2023)sherdencooper.Malicious system prompts, 2023.URL https://github.com/sherdencooper/GPTFuzz/blob/master/datasets/prompts/GPTFuzzer.csv.
Sitawarin etal. (2024)C.Sitawarin, N.Mu, D.Wagner, and A.Araujo.Pal: Proxy-guided black-box attack on large language models.arXiv preprint arXiv:2402.09674, 2024.
Sukhbaatar etal. (2024)S.Sukhbaatar, O.Golovneva, V.Sharma, H.Xu, X.V. Lin, B.Rozière, J.Kahn, D.Li, W.-t. Yih, J.Weston, etal.Branch-train-mix: Mixing expert llms into a mixture-of-experts llm.arXiv preprint arXiv:2403.07816, 2024.
Takemoto (2024)K.Takemoto.All in how you ask for it: Simple black-box method for jailbreak attacks.Applied Sciences, 14(9):3558, 2024.
Thain etal. (2017)N.Thain, L.Dixon, and E.Wulczyn.Wikipedia talk labels: Toxicity.DOI: https://doi. org/10.6084/m9. figshare, 4563973:v2, 2017.
togetherai (2023)togetherai.together.ai, 2023.URL https://www.together.ai/.
Touvron etal. (2023)H.Touvron, L.Martin, K.Stone, P.Albert, A.Almahairi, Y.Babaei, N.Bashlykov, S.Batra, P.Bhargava, S.Bhosale, etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
Wang etal. (2024a)Z.Wang, Y.Cao, and P.Liu.Hidden you malicious goal into benigh narratives: Jailbreak large language models through logic chain injection.arXiv preprint arXiv:2404.04849, 2024a.
Wang etal. (2024b)Z.Wang, W.Xie, B.Wang, E.Wang, Z.Gui, S.Ma, and K.Chen.Foot in the door: Understanding large language model jailbreaking via cognitive psychology.arXiv preprint arXiv:2402.15690, 2024b.
Wei etal. (2023)Z.Wei, Y.Wang, and Y.Wang.Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv preprint arXiv:2310.06387, 2023.
Wu etal. (2024)D.Wu, S.Wang, Y.Liu, and N.Liu.Llms can defend themselves against jailbreaking in a practical manner: A vision paper.arXiv preprint arXiv:2402.15727, 2024.
Wu etal. (2023)F.Wu, Y.Xie, J.Yi, J.Shao, J.Curl, L.Lyu, Q.Chen, and X.Xie.Defending chatgpt against jailbreak attack via self-reminder.2023.
Xiao etal. (2024)Z.Xiao, Y.Yang, G.Chen, and Y.Chen.Tastle: Distract large language models for automatic jailbreak attack.arXiv preprint arXiv:2403.08424, 2024.
Xie etal. (2023)Y.Xie, J.Yi, J.Shao, J.Curl, L.Lyu, Q.Chen, X.Xie, and F.Wu.Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12):1486–1496, 2023.
Xiong etal. (2024)C.Xiong, X.Qi, P.-Y. Chen, and T.-Y. Ho.Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks.arXiv preprint arXiv:2405.20099, 2024.
Yao etal. (2024)D.Yao, J.Zhang, I.G. Harris, and M.Carlsson.Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models.In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4485–4489. IEEE, 2024.
Yong etal. (2023)Z.-X. Yong, C.Menghini, and S.H. Bach.Low-resource languages jailbreak gpt-4.arXiv preprint arXiv:2310.02446, 2023.
Yu etal. (2023)J.Yu, X.Lin, and X.Xing.Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023.
Yu etal. (2024)Z.Yu, X.Liu, S.Liang, Z.Cameron, C.Xiao, and N.Zhang.Don’t listen to me: Understanding and exploring jailbreak prompts of large language models.arXiv preprint arXiv:2403.17336, 2024.
Zeng etal. (2024)Y.Zeng, Y.Wu, X.Zhang, H.Wang, and Q.Wu.Autodefense: Multi-agent llm defense against jailbreak attacks.arXiv preprint arXiv:2403.04783, 2024.
Zhang and Wei (2024)Y.Zhang and Z.Wei.Boosting jailbreak attack with momentum.arXiv preprint arXiv:2405.01229, 2024.
Zhang etal. (2024)Y.Zhang, L.Ding, L.Zhang, and D.Tao.Intention analysis prompting makes large language models a good jailbreak defender.arXiv preprint arXiv:2401.06561, 2024.
Zhao etal. (2024)X.Zhao, X.Yang, T.Pang, C.Du, L.Li, Y.-X. Wang, and W.Y. Wang.Weak-to-strong jailbreaking on large language models.arXiv preprint arXiv:2401.17256, 2024.
Zhou etal. (2024)W.Zhou, X.Wang, L.Xiong, H.Xia, Y.Gu, M.Chai, F.Zhu, C.Huang, S.Dou, Z.Xi, etal.Easyjailbreak: A unified framework for jailbreaking large language models.arXiv preprint arXiv:2403.12171, 2024.
Zou etal. (2023)A.Zou, Z.Wang, J.Z. Kolter, and M.Fredrikson.Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023.

References

Andriushchenko etal. [2024]M.Andriushchenko, F.Croce, and N.Flammarion.Jailbreaking leading safety-aligned llms with simple adaptive attacks.arXiv preprint arXiv:2404.02151, 2024.
Anthropic [2024]Anthropic.Claude, 2024.URL https://claude.ai/.
Auer etal. [2002]P.Auer, N.Cesa-Bianchi, and P.Fischer.Finite-time analysis of the multiarmed bandit problem.Machine learning, 47:235–256, 2002.
Bai etal. [2023]J.Bai, S.Bai, Y.Chu, Z.Cui, K.Dang, X.Deng, Y.Fan, W.Ge, Y.Han, F.Huang, B.Hui, L.Ji, M.Li, J.Lin, R.Lin, D.Liu, G.Liu, C.Lu, K.Lu, J.Ma, R.Men, X.Ren, X.Ren, C.Tan, S.Tan, J.Tu, P.Wang, S.Wang, W.Wang, S.Wu, B.Xu, J.Xu, A.Yang, H.Yang, J.Yang, S.Yang, Y.Yao, B.Yu, H.Yuan, Z.Yuan, J.Zhang, X.Zhang, Y.Zhang, Z.Zhang, C.Zhou, J.Zhou, X.Zhou, and T.Zhu.Qwen technical report.arXiv preprint arXiv:2309.16609, 2023.
Bai etal. [2022]Y.Bai, A.Jones, K.Ndousse, A.Askell, A.Chen, N.DasSarma, D.Drain, S.Fort, D.Ganguli, T.Henighan, etal.Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022.
Cai etal. [2024]H.Cai, A.Arunasalam, L.Y. Lin, A.Bianchi, and Z.B. Celik.Take a look at it! rethinking how to evaluate language model jailbreak.arXiv preprint arXiv:2404.06407, 2024.
Cao etal. [2023]B.Cao, Y.Cao, L.Lin, and J.Chen.Defending against alignment-breaking attacks via robustly aligned llm.arXiv preprint arXiv:2309.14348, 2023.
Chang etal. [2024]Z.Chang, M.Li, Y.Liu, J.Wang, Q.Wang, and Y.Liu.Play guessing game with llm: Indirect jailbreak attack with implicit clues.arXiv preprint arXiv:2402.09091, 2024.
Chao etal. [2023]P.Chao, A.Robey, E.Dobriban, H.Hassani, G.J. Pappas, and E.Wong.Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023.
Chao etal. [2024]P.Chao, E.Debenedetti, A.Robey, M.Andriushchenko, F.Croce, V.Sehwag, E.Dobriban, N.Flammarion, G.J. Pappas, F.Tramer, etal.Jailbreakbench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024.
Cheng etal. [2024]Y.Cheng, M.Georgopoulos, V.Cevher, and G.G. Chrysos.Leveraging the context through multi-round interactions for jailbreaking attacks.arXiv preprint arXiv:2402.09177, 2024.
Chu etal. [2024]J.Chu, Y.Liu, Z.Yang, X.Shen, M.Backes, and Y.Zhang.Comprehensive assessment of jailbreak attacks against llms.arXiv preprint arXiv:2402.05668, 2024.
cjadams etal. [2019]cjadams, D.Borkan, inversion, J.Sorensen, L.Dixon, L.Vasserman, and nithum.Jigsaw unintended bias in toxicity classification, 2019.URL https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.
Dai etal. [2023]J.Dai, X.Pan, R.Sun, J.Ji, X.Xu, M.Liu, Y.Wang, and Y.Yang.Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023.
Ding etal. [2023]P.Ding, J.Kuang, D.Ma, X.Cao, Y.Xian, J.Chen, and S.Huang.A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.arXiv preprint arXiv:2311.08268, 2023.
Dong etal. [2024]H.Dong, B.Chen, and Y.Chi.Prompt-prompted mixture of experts for efficient llm generation.arXiv preprint arXiv:2404.01365, 2024.
Du etal. [2023]Y.Du, S.Zhao, M.Ma, Y.Chen, and B.Qin.Analyzing the inherent response tendency of llms: Real-world instructions-driven jailbreak.arXiv preprint arXiv:2312.04127, 2023.
Ge etal. [2023]S.Ge, C.Zhou, R.Hou, M.Khabsa, Y.-C. Wang, Q.Wang, J.Han, and Y.Mao.Mart: Improving llm safety with multi-round automatic red-teaming.arXiv preprint arXiv:2311.07689, 2023.
Gehman etal. [2020]S.Gehman, S.Gururangan, M.Sap, Y.Choi, and N.A. Smith.Realtoxicityprompts: Evaluating neural toxic degeneration in language models.arXiv preprint arXiv:2009.11462, 2020.
Google [2023]Google.Safety settings, 2023.URL https://ai.google.dev/gemini-api/docs/safety-settings.
Handa etal. [2024]D.Handa, A.Chirmule, B.Gajera, and C.Baral.Jailbreaking proprietary large language models using word substitution cipher.arXiv preprint arXiv:2402.10601, 2024.
Hayase etal. [2024]J.Hayase, E.Borevkovic, N.Carlini, F.Tramèr, and M.Nasr.Query-based adversarial prompt generation.arXiv preprint arXiv:2402.12329, 2024.
Helbling etal. [2023]A.Helbling, M.Phute, M.Hull, and D.H. Chau.Llm self defense: By self examination, llms know they are being tricked.arXiv preprint arXiv:2308.07308, 2023.
Hu etal. [2024]X.Hu, P.-Y. Chen, and T.-Y. Ho.Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes.arXiv preprint arXiv:2403.00867, 2024.
Huang etal. [2023]Y.Huang, S.Gupta, M.Xia, K.Li, and D.Chen.Catastrophic jailbreak of open-source llms via exploiting generation.arXiv preprint arXiv:2310.06987, 2023.
HuggingFace [2023]HuggingFace.all-mpnet-base-v2, 2023.URL https://huggingface.co/sentence-transformers/all-mpnet-base-v2.
Jain etal. [2023]N.Jain, A.Schwarzschild, Y.Wen, G.Somepalli, J.Kirchenbauer, P.-y. Chiang, M.Goldblum, A.Saha, J.Geiping, and T.Goldstein.Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023.
Ji etal. [2024]J.Ji, B.Hou, A.Robey, G.J. Pappas, H.Hassani, Y.Zhang, E.Wong, and S.Chang.Defending large language models against jailbreak attacks via semantic smoothing.arXiv preprint arXiv:2402.16192, 2024.
Jiang etal. [2023]A.Q. Jiang, A.Sablayrolles, A.Mensch, C.Bamford, D.S. Chaplot, D.d.l. Casas, F.Bressand, G.Lengyel, G.Lample, L.Saulnier, etal.Mistral 7b.arXiv preprint arXiv:2310.06825, 2023.
Jiang etal. [2024a]A.Q. Jiang, A.Sablayrolles, A.Roux, A.Mensch, B.Savary, C.Bamford, D.S. Chaplot, D.d.l. Casas, E.B. Hanna, F.Bressand, etal.Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024a.
Jiang etal. [2024b]F.Jiang, Z.Xu, L.Niu, Z.Xiang, B.Ramasubramanian, B.Li, and R.Poovendran.Artprompt: Ascii art-based jailbreak attacks against aligned llms.arXiv preprint arXiv:2402.11753, 2024b.
Jin etal. [2024]M.Jin, S.Zhu, B.Wang, Z.Zhou, C.Zhang, Y.Zhang, etal.Attackeval: How to evaluate the effectiveness of jailbreak attacking on large language models.arXiv preprint arXiv:2401.09002, 2024.
Kim etal. [2024]H.Kim, S.Yuk, and H.Cho.Break the breakout: Reinventing lm defense against jailbreak attacks with self-refinement.arXiv preprint arXiv:2402.15180, 2024.
Kumar etal. [2023]A.Kumar, C.Agarwal, S.Srinivas, S.Feizi, and H.Lakkaraju.Certifying llm safety against adversarial prompting.arXiv preprint arXiv:2309.02705, 2023.
Lapid etal. [2023]R.Lapid, R.Langberg, and M.Sipper.Open sesame! universal black box jailbreaking of large language models.arXiv preprint arXiv:2309.01446, 2023.
Li etal. [2024a]T.Li, X.Zheng, and X.Huang.Open the pandora’s box of llms: Jailbreaking llms through representation engineering.arXiv preprint arXiv:2401.06824, 2024a.
Li etal. [2024b]X.Li, S.Liang, J.Zhang, H.Fang, A.Liu, and E.-C. Chang.Semantic mirror jailbreak: Genetic algorithm based jailbreak prompts against open-source llms.arXiv preprint arXiv:2402.14872, 2024b.
Li etal. [2024c]X.Li, R.Wang, M.Cheng, T.Zhou, and C.-J. Hsieh.Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers.arXiv preprint arXiv:2402.16914, 2024c.
Li etal. [2023]Y.Li, F.Wei, J.Zhao, C.Zhang, and H.Zhang.Rain: Your language models can align themselves without finetuning.In The Twelfth International Conference on Learning Representations, 2023.
Liao and Sun [2024]Z.Liao and H.Sun.Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms.arXiv preprint arXiv:2404.07921, 2024.
Lin etal. [2024]B.Lin, Z.Tang, Y.Ye, J.Cui, B.Zhu, P.Jin, J.Zhang, M.Ning, and L.Yuan.Moe-llava: Mixture of experts for large vision-language models.arXiv preprint arXiv:2401.15947, 2024.
Lin etal. [2023]Z.Lin, Z.Wang, Y.Tong, Y.Wang, Y.Guo, Y.Wang, and J.Shang.Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation, 2023.
Liu etal. [2024]T.Liu, Y.Zhang, Z.Zhao, Y.Dong, G.Meng, and K.Chen.Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction.arXiv preprint arXiv:2402.18104, 2024.
Liu etal. [2023a]X.Liu, N.Xu, M.Chen, and C.Xiao.Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023a.
Liu etal. [2019]Y.Liu, M.Ott, N.Goyal, J.Du, M.Joshi, D.Chen, O.Levy, M.Lewis, L.Zettlemoyer, and V.Stoyanov.Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019.
Liu etal. [2023b]Y.Liu, G.Deng, Z.Xu, Y.Li, Y.Zheng, Y.Zhang, L.Zhao, T.Zhang, and Y.Liu.Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860, 2023b.
LMSYS [2024]LMSYS.Vicuna, 2024.URL https://lmsys.org/blog/2023-03-30-vicuna/.
Logacheva etal. [2022]V.Logacheva, D.Dementieva, S.Ustyantsev, D.Moskovskiy, D.Dale, I.Krotova, N.sem*nov, and A.Panchenko.Paradetox: Detoxification with parallel data.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6804–6818, 2022.
Luo etal. [2024]W.Luo, S.Ma, X.Liu, X.Guo, and C.Xiao.Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027, 2024.
Lv etal. [2024]H.Lv, X.Wang, Y.Zhang, C.Huang, S.Dou, J.Ye, T.Gui, Q.Zhang, and X.Huang.Codechameleon: Personalized encryption framework for jailbreaking large language models.arXiv preprint arXiv:2402.16717, 2024.
Mehrotra etal. [2023]A.Mehrotra, M.Zampetakis, P.Kassianik, B.Nelson, H.Anderson, Y.Singer, and A.Karbasi.Tree of attacks: Jailbreaking black-box llms automatically.arXiv preprint arXiv:2312.02119, 2023.
Meta [2024]Meta.Meta llama 3, 2024.URL https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/.
OpenAI [2023a]OpenAI.Chatgpt, 2023a.URL https://chat.openai.com.
OpenAI [2023b]OpenAI.Learn how to build moderation into your ai applications., 2023b.URL https://platform.openai.com/docs/guides/moderation.
Piet etal. [2023]J.Piet, M.Alrashed, C.Sitawarin, S.Chen, Z.Wei, E.Sun, B.Alomair, and D.Wagner.Jatmo: Prompt injection defense by task-specific finetuning.arXiv preprint arXiv:2312.17673, 2023.
Pisano etal. [2023]M.Pisano, P.Ly, A.Sanders, B.Yao, D.Wang, T.Strzalkowski, and M.Si.Bergeron: Combating adversarial attacks through a conscience-based alignment framework.arXiv preprint arXiv:2312.00029, 2023.
Qiu etal. [2023]H.Qiu, S.Zhang, A.Li, H.He, and Z.Lan.Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models.arXiv preprint arXiv:2307.08487, 2023.
Robey etal. [2023]A.Robey, E.Wong, H.Hassani, and G.J. Pappas.Smoothllm: Defending large language models against jailbreaking attacks.arXiv preprint arXiv:2310.03684, 2023.
Russinovich etal. [2024]M.Russinovich, A.Salem, and R.Eldan.Great, now write an article about that: The crescendo multi-turn llm jailbreak attack.arXiv preprint arXiv:2404.01833, 2024.
Shah etal. [2023]R.Shah, S.Pour, A.Tagade, S.Casper, J.Rando, etal.Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023.
Shen etal. [2023]X.Shen, Z.Chen, M.Backes, Y.Shen, and Y.Zhang." do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models.arXiv preprint arXiv:2308.03825, 2023.
sherdencooper [2023]sherdencooper.Malicious system prompts, 2023.URL https://github.com/sherdencooper/GPTFuzz/blob/master/datasets/prompts/GPTFuzzer.csv.
Sitawarin etal. [2024]C.Sitawarin, N.Mu, D.Wagner, and A.Araujo.Pal: Proxy-guided black-box attack on large language models.arXiv preprint arXiv:2402.09674, 2024.
Sukhbaatar etal. [2024]S.Sukhbaatar, O.Golovneva, V.Sharma, H.Xu, X.V. Lin, B.Rozière, J.Kahn, D.Li, W.-t. Yih, J.Weston, etal.Branch-train-mix: Mixing expert llms into a mixture-of-experts llm.arXiv preprint arXiv:2403.07816, 2024.
Takemoto [2024]K.Takemoto.All in how you ask for it: Simple black-box method for jailbreak attacks.Applied Sciences, 14(9):3558, 2024.
Thain etal. [2017]N.Thain, L.Dixon, and E.Wulczyn.Wikipedia talk labels: Toxicity.DOI: https://doi. org/10.6084/m9. figshare, 4563973:v2, 2017.
togetherai [2023]togetherai.together.ai, 2023.URL https://www.together.ai/.
Touvron etal. [2023]H.Touvron, L.Martin, K.Stone, P.Albert, A.Almahairi, Y.Babaei, N.Bashlykov, S.Batra, P.Bhargava, S.Bhosale, etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
Wang etal. [2024a]Z.Wang, Y.Cao, and P.Liu.Hidden you malicious goal into benigh narratives: Jailbreak large language models through logic chain injection.arXiv preprint arXiv:2404.04849, 2024a.
Wang etal. [2024b]Z.Wang, W.Xie, B.Wang, E.Wang, Z.Gui, S.Ma, and K.Chen.Foot in the door: Understanding large language model jailbreaking via cognitive psychology.arXiv preprint arXiv:2402.15690, 2024b.
Wei etal. [2023]Z.Wei, Y.Wang, and Y.Wang.Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv preprint arXiv:2310.06387, 2023.
Wu etal. [2024]D.Wu, S.Wang, Y.Liu, and N.Liu.Llms can defend themselves against jailbreaking in a practical manner: A vision paper.arXiv preprint arXiv:2402.15727, 2024.
Wu etal. [2023]F.Wu, Y.Xie, J.Yi, J.Shao, J.Curl, L.Lyu, Q.Chen, and X.Xie.Defending chatgpt against jailbreak attack via self-reminder.2023.
Xiao etal. [2024]Z.Xiao, Y.Yang, G.Chen, and Y.Chen.Tastle: Distract large language models for automatic jailbreak attack.arXiv preprint arXiv:2403.08424, 2024.
Xie etal. [2023]Y.Xie, J.Yi, J.Shao, J.Curl, L.Lyu, Q.Chen, X.Xie, and F.Wu.Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12):1486–1496, 2023.
Xiong etal. [2024]C.Xiong, X.Qi, P.-Y. Chen, and T.-Y. Ho.Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks.arXiv preprint arXiv:2405.20099, 2024.
Yao etal. [2024]D.Yao, J.Zhang, I.G. Harris, and M.Carlsson.Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models.In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4485–4489. IEEE, 2024.
Yong etal. [2023]Z.-X. Yong, C.Menghini, and S.H. Bach.Low-resource languages jailbreak gpt-4.arXiv preprint arXiv:2310.02446, 2023.
Yu etal. [2023]J.Yu, X.Lin, and X.Xing.Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023.
Yu etal. [2024]Z.Yu, X.Liu, S.Liang, Z.Cameron, C.Xiao, and N.Zhang.Don’t listen to me: Understanding and exploring jailbreak prompts of large language models.arXiv preprint arXiv:2403.17336, 2024.
Zeng etal. [2024]Y.Zeng, Y.Wu, X.Zhang, H.Wang, and Q.Wu.Autodefense: Multi-agent llm defense against jailbreak attacks.arXiv preprint arXiv:2403.04783, 2024.
Zhang and Wei [2024]Y.Zhang and Z.Wei.Boosting jailbreak attack with momentum.arXiv preprint arXiv:2405.01229, 2024.
Zhang etal. [2024]Y.Zhang, L.Ding, L.Zhang, and D.Tao.Intention analysis prompting makes large language models a good jailbreak defender.arXiv preprint arXiv:2401.06561, 2024.
Zhao etal. [2024]X.Zhao, X.Yang, T.Pang, C.Du, L.Li, Y.-X. Wang, and W.Y. Wang.Weak-to-strong jailbreaking on large language models.arXiv preprint arXiv:2401.17256, 2024.
Zhou etal. [2024]W.Zhou, X.Wang, L.Xiong, H.Xia, Y.Gu, M.Chai, F.Zhu, C.Huang, S.Dou, Z.Xi, etal.Easyjailbreak: A unified framework for jailbreaking large language models.arXiv preprint arXiv:2403.12171, 2024.
Zou etal. [2023]A.Zou, Z.Wang, J.Z. Kolter, and M.Fredrikson.Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023.

Appendix A Summarization of Existing Jailbreak Attacks and Defenses

We list related works on the two types of attacks and four defense schemes involved in this paper in Table 6.

Attack scenario	Main attack methods	Related Works
Dynamic jailbreak	Fuzz- & GA-based	[Yu etal., 2023, Lapid etal., 2023, Li etal., 2024b]
Dynamic jailbreak	LLM-based adversarial optimization	[Chao etal., 2023, Xiao etal., 2024, Mehrotra etal., 2023, Takemoto, 2024, Ge etal., 2023, Zhao etal., 2024]
Static jailbreak	Malicious template-based	[Shen etal., 2023, Yu etal., 2024, Liu etal., 2024, Shah etal., 2023, Yao etal., 2024, Andriushchenko etal., 2024, Chang etal., 2024, Li etal., 2024c, Wei etal., 2023, Wang etal., 2024a, Lv etal., 2024, Liu etal., 2023b, Ding etal., 2023, Handa etal., 2024, Russinovich etal., 2024, Wang etal., 2024b, Cheng etal., 2024, Du etal., 2023, Liu etal., 2023b]

Main Defense Methods	Related Works
Smoothness	[Robey etal., 2023, Ji etal., 2024, Hu etal., 2024]
Erase-and-Check	[Kumar etal., 2023, Cao etal., 2023]
Intention analysis	[Wu etal., 2024, Zeng etal., 2024, Pisano etal., 2023, Kim etal., 2024, Wu etal., 2023, Zhang etal., 2024, Helbling etal., 2023, Xiong etal., 2024]
Structure detection	[Jain etal., 2023]

Appendix B Prompt Template

We show the prompt templates used in our work as follows:

Appendix C Dataset of Malicious Behavior

We show the malicious behavior dataset evaluated in our work. Warning: The following content contains model behavior that can be offensive in nature.