Linear Probes Mechanistic Interpretability, Produces a layer-by-layer accuracy heatmap showing where information is encoded.

Linear Probes Mechanistic Interpretability, Mar 28, 2023 · The meta-level point that makes me excited about this is that linear probes are really nice objects for interpretability. Produces a layer-by-layer accuracy heatmap showing where information is encoded. In addition to demonstrating generalization of counterfactual inference behavior, we use mechanistic interpretability tools to probe the network’s representations. Jan 12, 2026 · One approach, known as mechanistic interpretability, aims to map the key features and the pathways between them across an entire model. This mechanistic perspective represents a paradigm shift in interpretability, which aims to unpack the causal factors that drive model results. Mechanistic Interpretability Explorer Visualize which MLP neurons inside a small transformer (GPT-2) activate for specific linguistic and factual concepts — capitals, famous people, and more. , 2020). Gradient-based attributions: We can compute the gradient of a chosen output with respect to some or all of the neural values. Feb 5, 2026 · We can also derive additional information: Linear probes and classifiers: We can build a system that classifies the recorded residual stream into one group or another, or measures some feature within it. One mechanistic interpretability research direction has focused on understanding toy models in detail. Sparse Autoencoders (SAEs) — trained on the most probe-rich layers. Oct 24, 2024 · We used ridge regression based linear probes in this study. First, we show that we can decode the referenced SCM from the transformer’s residual activations using linear probes. We articulate requirements for such theories, survey progress across mechanistic interpretability, fairness, memorization, and learning dynamics, and identify concrete open problems. While focusing on bottom-up, mechanistic interpretability approaches, we can also consider integrating top-down, concept-based structured probes with mechanistic interpretability. Discovers features beyond those predefined by probes. . Probe performance could reflect its own capabilities more than actual characteristics of the representation. The linear representation hypothesis offers a “resolution” to this problem. The goal is to map model behavior to internal mechanisms (features, circuits, attention patterns, activation patterns) that are causally responsible for the Mechanistic Interpretability Explorer Visualize which MLP neurons inside a small transformer (GPT-2) activate for specific linguistic and factual concepts — capitals, famous people, and more. Sep 19, 2024 · Non-linear probes have been alleged to have this property, and that is why a linear probe is entrusted with this task. Fundamentally, transformers are made of linear algebra! Nov 24, 2025 · Mechanistic interpretability allows for an organized characterization of AI systems, as opposed to the divide-and-conquer methods of XAI which provide explainability only in specific contexts [19]. wcgq, yce4, hwy, rm, lt5i, pyw8wtv, 5baq, 75, rzkohr, 2gkpn,