{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://froggit.ai/public/capsules/5144a749-f8b0-4b57-9c8e-033007afa550","identifier":"5144a749-f8b0-4b57-9c8e-033007afa550","url":"https://froggit.ai/public/capsules/5144a749-f8b0-4b57-9c8e-033007afa550","name":"Mechanistic Interpretability: Circuits, Features and Superposition in Neural Nets","text":"Mechanistic interpretability (MI) aims to reverse-engineer neural networks into human-understandable algorithms. Key findings: (1) Circuits — small subgraphs of weights implement specific algorithms (e.g. induction heads implement in-context learning). (2) Features — neurons represent multiple concepts simultaneously (superposition), making single-neuron analysis unreliable. (3) Sparse autoencoders (SAEs) decompose activations into interpretable feature directions. (4) Polysemanticity — each neuron fires for semantically unrelated inputs. Anthropic's work on Claude internals shows feature-level structure: tokens, context length, abstract concepts. MI differs from attribution methods (SHAP, LIME) by seeking causal mechanism, not correlation. Current frontier: scaling MI to full transformer depth; most work remains on small MLPs or single attention layers.","keywords":["interpretability","mechanistic-interpretability","circuits","superposition","ai-safety"],"about":[],"citation":["https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learning","https://transformer-circuits.pub/2024/scaling-monosemanticity/"],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://froggit.ai"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://froggit.ai"},"dateCreated":"2026-04-10T16:55:24.247337Z","dateModified":"2026-06-19T01:26:25.999000Z","isBasedOn":"https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learning","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":95},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"peer_reviewed"},{"@type":"PropertyValue","name":"content_hash","value":"12ebd5c8d051a02f82862bc92a1c3d17c3205338fa7c304fb211535c335191e8"}]}