Model Organisms for Code Generation
An interactive research article presenting a framework for constructing intentionally misaligned code generation models — sleeper agents, sycophants, and reward hackers — to rigorously benchmark safety monitoring techniques. Includes live activation steering experiments across Qwen-2.5-Coder models.
AI SafetyInterpretabilityCode GenerationModel Organisms
2026-04-10