JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets

July 3, 2024

Zhihua Jin, Shiyi Liu, Haotian Li, Xun Zhao, Huamin Qu

Computer Science

Human-Computer Interaction

Computation and Language

Machine Learning

Large Language Models (LLMs) have gained significant attention but also raised concerns due to the risk of misuse. Jailbreak prompts, a popular type of adversarial attack towards LLMs, have appeared and constantly evolved to breach the safety protocols of LLMs. To address this issue, LLMs are regularly updated with safety patches based on reported jailbreak prompts. However, malicious users often keep their successful jailbreak prompts private to exploit LLMs. To uncover these private jailbreak prompts, extensive analysis of large-scale conversational datasets is necessary to identify prompts that still manage to bypass the system's defenses. This task is highly challenging due to the immense volume of conversation data, diverse characteristics of jailbreak prompts, and their presence in complex multi-turn conversations. To tackle these challenges, we introduce JailbreakHunter, a visual analytics approach for identifying jailbreak prompts in large-scale human-LLM conversational datasets. We have designed a workflow with three analysis levels: group-level, conversation-level, and turn-level. Group-level analysis enables users to grasp the distribution of conversations and identify suspicious conversations using multiple criteria, such as similarity with reported jailbreak prompts in previous research and attack success rates. Conversation-level analysis facilitates the understanding of the progress of conversations and helps discover jailbreak prompts within their conversation contexts. Turn-level analysis allows users to explore the semantic similarity and token overlap between a singleturn prompt and the reported jailbreak prompts, aiding in the identification of new jailbreak strategies. The effectiveness and usability of the system were verified through multiple case studies and expert interviews.

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

April 12, 2024

95% Match

Yingchaojie Feng, Zhizhang Chen, Zhining Kang, Sijia Wang, Minfeng Zhu, ... , Chen Wei

Cryptography and Security

Computation and Language

Human-Computer Interaction

The proliferation of large language models (LLMs) has underscored concerns regarding their security vulnerabilities, notably against jailbreak attacks, where adversaries design jailbreak prompts to circumvent safety mechanisms for potential misuse. Addressing these concerns necessitates a comprehensive analysis of jailbreak prompts to evaluate LLMs' defensive capabilities and identify potential weaknesses. However, the complexity of evaluating jailbreak performance and unders...

Find SimilarView on arXiv

Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models

March 26, 2024

93% Match

Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, ... , Zhang Ning

Cryptography and Security

Computation and Language

Recent advancements in generative AI have enabled ubiquitous access to large language models (LLMs). Empowered by their exceptional capabilities to understand and generate human-like text, these models are being increasingly integrated into our society. At the same time, there are also concerns on the potential misuse of this powerful technology, prompting defensive measures from service providers. To overcome such protection, jailbreaking prompts have recently emerged as one...

Find SimilarView on arXiv

Improved Large Language Model Jailbreak Detection via Pretrained Embeddings

December 2, 2024

93% Match

Erick Galinkin, Martin Sablotny

Cryptography and Security

Artificial Intelligence

Machine Learning

The adoption of large language models (LLMs) in many applications, from customer service chat bots and software development assistants to more capable agentic systems necessitates research into how to secure these systems. Attacks like prompt injection and jailbreaking attempt to elicit responses and actions from these models that are not compliant with the safety, privacy, or content policies of organizations using the model in their application. In order to counter abuse of...

Find SimilarView on arXiv

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

June 26, 2024

93% Match

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, ... , Dziri Nouha

Computation and Language

We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifica...

Find SimilarView on arXiv

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

August 27, 2024

92% Match

Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, ... , Yue Summer

Machine Learning

Computation and Language

Cryptography and Security

Computers and Society

Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that r...

Find SimilarView on arXiv

RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process

October 11, 2024

92% Match

Peiran Wang, Xiaogeng Liu, Chaowei Xiao

Cryptography and Security

Artificial Intelligence

In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pretraining and finetuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedde...

Find SimilarView on arXiv

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

October 20, 2024

92% Match

Benji Peng, Ziqian Bi, Qian Niu, Ming Liu, Pohsun Feng, Tianyang Wang, Lawrence K. Q. Yan, Yizhu Wen, ... , Yin Caitlyn Heqi

Cryptography and Security

Artificial Intelligence

Machine Learning

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available...

Find SimilarView on arXiv

Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

August 29, 2024

91% Match

Tom Gibbs, Ethan Kosak-Hine, George Ingebretsen, Jason Zhang, Julius Broomfield, Sara Pieri, Reihaneh Iranmanesh, ... , Pelrine Kellin

Cryptography and Security

Artificial Intelligence

Computation and Language

Large language models (LLMs) are improving at an exceptional rate. However, these models are still susceptible to jailbreak attacks, which are becoming increasingly dangerous as models become increasingly powerful. In this work, we introduce a dataset of jailbreaks where each example can be input in both a single or a multi-turn format. We show that while equivalent in content, they are not equivalent in jailbreak success: defending against one structure does not guarantee de...

Find SimilarView on arXiv

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

July 5, 2024

91% Match

Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, ... , Li Qi

Cryptography and Security

Artificial Intelligence

Computation and Language

Machine Learning

Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of "jailbreaking", which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safe...

Find SimilarView on arXiv

Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles

August 8, 2024

91% Match

Xiongtao Sun, Deyue Zhang, Dongdong Yang, ... , Li Hui

Computation and Language

Artificial Intelligence

Large language models (LLMs) have significantly enhanced the performance of numerous applications, from intelligent conversations to text generation. However, their inherent security vulnerabilities have become an increasingly significant challenge, especially with respect to jailbreak attacks. Attackers can circumvent the security mechanisms of these LLMs, breaching security constraints and causing harmful outputs. Focusing on multi-turn semantic jailbreak attacks, we observ...

Find SimilarView on arXiv