In a key benchmark, Claude Sonnet 4.5, a next-generation language model, runs autonomously for 30 hours on a single task.
Anthropic evaluated the model’s programming capabilities using a benchmark called SWE-bench Verified. Sonnet 4.5 set a new industry record with a 82% score. The next two highest scores were also ...