Deception in LLMs: Self-Preservation and...

Deception Abilities Emerged in Large Language Models

July 31, 2023

93% Match

Thilo Hagendorff

Computation and Language

Artificial Intelligence

Machine Learning

Large language models (LLMs) are currently at the forefront of intertwining artificial intelligence (AI) systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understa...

Find SimilarView on arXiv

Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies

January 28, 2025

93% Match

Manojkumar Parmar, Yuvaraj Govindarajulu

Machine Learning

Artificial Intelligence

Computation and Language

Cryptography and Security

Large Language Models (LLMs) have achieved remarkable progress in reasoning, alignment, and task-specific performance. However, ensuring harmlessness in these systems remains a critical challenge, particularly in advanced models like DeepSeek-R1. This paper examines the limitations of Reinforcement Learning (RL) as the primary approach for reducing harmful outputs in DeepSeek-R1 and compares it with Supervised Fine-Tuning (SFT). While RL improves reasoning capabilities, it fa...

Find SimilarView on arXiv

Output Length Effect on DeepSeek-R1's Safety in Forced Thinking

March 2, 2025

92% Match

Xuying Li, Zhuo Li, ... , Bian Victor

Computation and Language

Artificial Intelligence

Large Language Models (LLMs) have demonstrated strong reasoning capabilities, but their safety under adversarial conditions remains a challenge. This study examines the impact of output length on the robustness of DeepSeek-R1, particularly in Forced Thinking scenarios. We analyze responses across various adversarial prompts and find that while longer outputs can improve safety through self-correction, certain attack types exploit extended generations. Our findings suggest tha...

Find SimilarView on arXiv

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

January 10, 2024

92% Match

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, ... , Perez Ethan

Cryptography and Security

Artificial Intelligence

Computation and Language

Machine Learning

Software Engineering

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we tra...

Find SimilarView on arXiv

The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1

February 18, 2025

92% Match

Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, ... , Wang Xin Eric

Computers and Society

Artificial Intelligence

The rapid development of large reasoning models, such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models~(LLMs). However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging establis...

Find SimilarView on arXiv

Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models

February 18, 2025

92% Match

Rubing Li, João Sedoc, Arun Sundararajan

Computation and Language

Artificial Intelligence

When encountering increasingly frequent performance improvements or cost reductions from a new large language model (LLM), developers of applications leveraging LLMs must decide whether to take advantage of these improvements or stay with older tried-and-tested models. Low perceived switching frictions can lead to choices that do not consider more subtle behavior changes that the transition may induce. Our experiments use a popular game-theoretic behavioral economics model of...

Find SimilarView on arXiv

Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation

May 7, 2024

91% Match

Atharvan Dogra, Ameet Deshpande, John Nay, Tanmay Rajpurohit, ... , Ravindran Balaraman

Computation and Language

Recent developments in large language models (LLMs), while offering a powerful foundation for developing natural language agents, raise safety concerns about them and the autonomous agents built upon them. Deception is one potential capability of AI agents of particular concern, which we refer to as an act or statement that misleads, hides the truth, or promotes a belief that is not true in its entirety or in part. We move away from the conventional understanding of deception...

Find SimilarView on arXiv

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

November 6, 2023

91% Match

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, ... , Han Bo

Machine Learning

Cryptography and Security

Despite remarkable success in various applications, large language models (LLMs) are vulnerable to adversarial jailbreaks that make the safety guardrails void. However, previous studies for jailbreaks usually resort to brute-force optimization or extrapolations of a high computation cost, which might not be practical or effective. In this paper, inspired by the Milgram experiment that individuals can harm another person if they are told to do so by an authoritative figure, we...

Find SimilarView on arXiv

Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles

November 24, 2023

91% Match

Sonali Singh, Faranak Abri, Akbar Siami Namin

Human-Computer Interaction

Cryptography and Security

With the recent advent of Large Language Models (LLMs), such as ChatGPT from OpenAI, BARD from Google, Llama2 from Meta, and Claude from Anthropic AI, gain widespread use, ensuring their security and robustness is critical. The widespread use of these language models heavily relies on their reliability and proper usage of this fascinating technology. It is crucial to thoroughly test these models to not only ensure its quality but also possible misuses of such models by potent...

Find SimilarView on arXiv

Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models

February 7, 2024

91% Match

Linge Guo

Computation and Language

Artificial Intelligence

This research critically navigates the intricate landscape of AI deception, concentrating on deceptive behaviours of Large Language Models (LLMs). My objective is to elucidate this issue, examine the discourse surrounding it, and subsequently delve into its categorization and ramifications. The essay initiates with an evaluation of the AI Safety Summit 2023 (ASS) and introduction of LLMs, emphasising multidimensional biases that underlie their deceptive behaviours.The literat...

Find SimilarView on arXiv

Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models

Deception Abilities Emerged in Large Language Models

Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies

Output Length Effect on DeepSeek-R1's Safety in Forced Thinking

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1

Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models

Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles

Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models