Mechanistic Interpretability for AI Safe...

Open Problems in Mechanistic Interpretability

January 27, 2025

94% Match

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, ... , McGrath Tom

Machine Learning

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many sci...

Find SimilarView on arXiv

The Cognitive Revolution in Interpretability: From Explaining Behavior to Interpreting Representations and Algorithms

August 11, 2024

94% Match

Adam Davies, Ashkan Khakzar

Artificial Intelligence

Artificial neural networks have long been understood as "black boxes": though we know their computation graphs and learned parameters, the knowledge encoded by these weights and functions they perform are not inherently interpretable. As such, from the early days of deep learning, there have been efforts to explain these models' behavior and understand them internally; and recently, mechanistic interpretability (MI) has emerged as a distinct research area studying the feature...

Find SimilarView on arXiv

Position Paper: Toward New Frameworks for Studying Model Representations

February 6, 2024

94% Match

Satvik Golechha, James Dao

Machine Learning

Artificial Intelligence

Mechanistic interpretability (MI) aims to understand AI models by reverse-engineering the exact algorithms neural networks learn. Most works in MI so far have studied behaviors and capabilities that are trivial and token-aligned. However, most capabilities are not that trivial, which advocates for the study of hidden representations inside these networks as the unit of analysis. We do a literature review, formalize representations for features and behaviors, highlight their i...

Find SimilarView on arXiv

Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience

June 3, 2024

93% Match

Martina G. Vilas, Federico Adolfi, ... , Roig Gemma

Artificial Intelligence

Machine Learning

Neurons and Cognition

Inner Interpretability is a promising emerging field tasked with uncovering the inner mechanisms of AI systems, though how to develop these mechanistic theories is still much debated. Moreover, recent critiques raise issues that question its usefulness to advance the broader goals of AI. However, it has been overlooked that these issues resemble those that have been grappled with in another field: Cognitive Neuroscience. Here we draw the relevant connections and highlight les...

Find SimilarView on arXiv

A Mechanistic Explanatory Strategy for XAI

November 2, 2024

93% Match

Marcin Rabiza

Machine Learning

Artificial Intelligence

Despite significant advancements in XAI, scholars note a persistent lack of solid conceptual foundations and integration with broader scientific discourse on explanation. In response, emerging XAI research draws on explanatory strategies from various sciences and philosophy of science literature to fill these gaps. This paper outlines a mechanistic strategy for explaining the functional organization of deep learning systems, situating recent advancements in AI explainability ...

Find SimilarView on arXiv

Causal Abstraction in Model Interpretability: A Compact Survey

October 26, 2024

93% Match

Yihao Zhang

Machine Learning

Artificial Intelligence

Computation and Language

The pursuit of interpretable artificial intelligence has led to significant advancements in the development of methods that aim to explain the decision-making processes of complex models, such as deep learning systems. Among these methods, causal abstraction stands out as a theoretical framework that provides a principled approach to understanding and explaining the causal mechanisms underlying model behavior. This survey paper delves into the realm of causal abstraction, exa...

Find SimilarView on arXiv

Causal Analysis of Agent Behavior for AI Safety

March 5, 2021

93% Match

Grégoire Déletang, Jordi Grau-Moya, Miljan Martic, Tim Genewein, Tom McGrath, Vladimir Mikulik, Markus Kunesch, ... , Ortega Pedro A.

Artificial Intelligence

Machine Learning

As machine learning systems become more powerful they also become increasingly unpredictable and opaque. Yet, finding human-understandable explanations of how they work is essential for their safe deployment. This technical report illustrates a methodology for investigating the causal mechanisms that drive the behaviour of artificial agents. Six use cases are covered, each addressing a typical question an analyst might ask about an agent. In particular, we show that each ques...

Find SimilarView on arXiv

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

July 27, 2022

92% Match

Tilman Räuker, Anson Ho, ... , Hadfield-Menell Dylan

Machine Learning

Artificial Intelligence

Computation and Language

Computer Vision and Pattern ...

The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to identify problems, fix bugs, and improve basic understanding. In particular, "i...

Find SimilarView on arXiv

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models

July 2, 2024

92% Match

Daking Rai, Yilun Zhou, Shi Feng, ... , Yao Ziyu

Artificial Intelligence

Computation and Language

Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. However, there has not been work that comprehensively reviews these insights and challenges, particularly as a guide for newcomers t...

Find SimilarView on arXiv

Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications

March 17, 2020

92% Match

Wojciech Samek, Grégoire Montavon, Sebastian Lapuschkin, ... , Müller Klaus-Robert

Machine Learning

Artificial Intelligence

Computer Vision and Pattern ...

Neural and Evolutionary Comp...

Machine Learning

With the broader and highly successful usage of machine learning in industry and the sciences, there has been a growing demand for Explainable AI. Interpretability and explanation methods for gaining a better understanding about the problem solving abilities and strategies of nonlinear Machine Learning, in particular, deep neural networks, are therefore receiving increased attention. In this work we aim to (1) provide a timely overview of this active emerging field, with a fo...

Find SimilarView on arXiv

Mechanistic Interpretability for AI Safety -- A Review

Open Problems in Mechanistic Interpretability

The Cognitive Revolution in Interpretability: From Explaining Behavior to Interpreting Representations and Algorithms

Position Paper: Toward New Frameworks for Studying Model Representations

Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience

A Mechanistic Explanatory Strategy for XAI

Causal Abstraction in Model Interpretability: A Compact Survey

Causal Analysis of Agent Behavior for AI Safety

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models

Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications