GPT 5.5 tops the leaderboards but hallucinates 86% of the time. Claude Opus 4.7 hallucinates 52% of the time. Grok is at 17%. For a coding agent, hallucinations don't just give wrong answers — they break your actual tools. Benchmarks are the wrong metric.