Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models.
I have been using Sonnet 4.6 more than Opus, because I'm mostly doing agent-assisted development and not fully agent-driven development. This announcement does not make me positive, I have found that the more models are optimized for fully agentic development, the worse they get at assisted development and often start doing too much despite very strict/specific instructions.
I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.
I think you should try an OpenAI model like GPT 5.5. It is better at following instructions and boundaries set during prompt. It feels like a more capable "agent assistant" than Claude models but without loss of intelligence.
Most of my work involves "Agentic engineering" instead of fire-and-forget. I like to stay involved during the planning as well as review and ask a lot more questions from the agent than I've seen others doing. In a way, I'm using the agent in a sort of "hyper auto-complete" mode to fill in the blanks (rather big blanks) once I've set out the requirements, scope and design (sometimes specific module boundaries). This works best for me.
There’s no way to justify their valuations if they get downgraded to a pair programming tool. They need fully agentic stuff to work and replace human engineers to even come close.
Offhand, I’m not even certain whether a model like that could justify the constant retraining we’re doing on the agentic models.
It doesn’t make a lot of sense to spend millions or billions on training to reduce hallucinations by 0.3% if your model assumes a human is in the loop to course-correct them.
If LLMs can boost their productivity even by an average of 5% (studies from ~2024 put it in the ~30% range depending on task) that is ~1.5 - 2.5T in value annually. Even if the AI industry can capture a fraction of that, that is a huuuge monetization opportunity.
Note, at 5% productivity boost, humans are not just in the loop, they are the loop. AGI or large-scale replacement of humans is not even needed, but the financial opportunity is already immense, and it scales with how much human productivity can be improved (i.e. how much work can be offloaded to LLMs.)
Now, I don't think AGI will happen soon (or has already happened, depending on how you define it) but I do think humans will be a much smaller part of the loop and large-scale job displacement will happen once companies figure out how to properly use AI.
At this point, the financial upside for the AI industry is extremely high but will be limited by the social turmoil that will inevitably ensue (which we're already seeing brewing in the data center backlash.)
That's a really good point. I think if there wasn't the insane amount of money involved and these were treated as tools instead, they would probably be MORE productive. I think a person working hand in hand with an AI instead of delegating is the sweet spot of making things fast while also not losing understanding or control of the system. You are absolutely right that these companies can't justify their valuations if they do that though. I just got a new mac to run models locally, and so far the results have been positive with some small hiccups. I'm thinking the future of this tech will likely be better tooling with better IDE integrations rather than "Claude plz make me a SaaS kthx"
My two cents is that the way to square this circle is that the valuations should be lower and they should be spending a lot less on constant retraining.
Unfortunately (from my perspective) it seems like the US companies are increasingly stuck in their current model. I think it's a competitive disadvantage.
But obviously most of the real insiders seem to disagree with me, so I'm probably wrong :)
The insiders disagree because they are benefiting greatly from the insane valuations, right?
Chinese models are quickly commodifying frontier inference, the US Gov is preventing domestic SOTA models access to the public and without those models why would consumers still spend $200/month to use the best models?
It’s such a mess and isn’t inspiring confidence as a non-investor.
I find these nefarious intention theories shallow. It can both be the case that the endstate is them owning the means of production without that being the intended guiding goal. Companies can chase profit without being Leninistic boogeymen.
I've been using Kimi K2.6 lately (don't have 2.7 available through blessed work channels yet) for tasks where I already know what it is I want to do and I want to just step through the process in pieces, and it's fine. Do I have to correct it maybe a bit more than Opus? Yeah, but the real cutoff would be between "I have to read every line" and "I can just trust it without reading every line" and for me neither model hits that mark, and I expect it to be a while yet for that. Is it as good as Opus if I want to spit ball about architecture and then convert that to code? No, but I don't have that problem all the time, and it's there if I do need it.
And now in a heavy coding week rather than bumping up against my spend limit by late Wednesday or Thursday I'm comfortably below it all week.
That said if anything I feel like I have to reign in K2.6 much more than Opus, actually. If I want to just ask it a question without it inferring some coding task to immediately start doing, it takes a lot more care to prevent it from just running off half-cocked off of an only 3/4s-cocked idea of my own. I use "plan" mode with both but it's somewhat more defensive with K2.6 than Opus.
> I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.
I've moved completely to local models that I run with my M1 Mac Studio (64gb ram) some time ago. But for the rare times when I feel the local, quantized Qwen3.6 isn't enough, I just connect to Openrouter and use something like Kimi, GLM or Deepseek for a fraction of the price of Anthropic et al.
Tokens and speed are a factor but does it require less back and forth to get things right? Being "fast and cheap but wrong" still has a cost that an otherwise "expensive and slow" exchange does not
I actually use sonnet 4.6 for my day to day coding too. It consumes much less token and good enough. Opus is just too token consuming for it to be useful to me.
I haven't. Thanks for the heads up will give it a try!
I use opus to comment on code design quite often though. It became a pattern that I made a skill for me to ask for second opinions https://news.ycombinator.com/item?id=48733092
Would love to hear your feedback if you don't mind!
I've been largely disappointed how much the Claude models ignore custom instructions, and sometimes even prompts on the chat interface. It sometimes feels like talking to a wall, or as if there was a third person in the chatroom whose messages I can't see.
I can't help but feel this is intentional towards the 'Agentic' workflow.
I think this seems purposeful, as there's 2 opposing forces at play:
- Have a model that follows the users instructions
- Have a model that follows the system prompt instructions more
For the 'safety' argument (Re: Fable), they need these models to have basically a 2-tier instruction system, but given LLMs aren't great with actual Logic unless they program it out to test, this runs afoul and we get one or the other.
Feels like optimizing for either precision or recall, but can't have both
A suppose a solution might be going with a customizable harness like pi and merging Anthropic’s system prompt with a personalized custom one to remove all contractions
People keep making comments about fable like this? You could only use it for what like a week? How is that at all enough time to evaluate? Opus 4.6 didnt suffer from this problems for a hot minute and then when newer models were released it got worse. I think they change a ton behind the scenes and allocate compute however they want, so the model you use today may behave much differently than how it behaved yesterday
> You could only use it for what like a week? How is that at all enough time to evaluate?
By observing how in 4 workdays it achieved more than Opus in ~11 days. I am my team's backend lead and the Fable finally turned the tide on my overwhelming backlog. Back to Opus and I have to treat it like special-education kid multiple times a day.
The ~72 hours I had access to Fable were by far the most productive I've had in months. Re-wrote massive parts of my codebase and caught a ton of bugs and logic issues that had silently slipped through before. I went over my subscription limit and immediately kept paying the API price to keep going. It was that good.
It was a pretty stark difference. I had the opposite problem where it did too much and overshot what I wanted from it so I certainly assume that if it had stuck around it would have gotten tuned back a bit pretty quickly.
Heh, it's not crazy if you're here in the Bay: I know multiple people who more-or-less disappeared for days when Fable came out because they were running their benchmarks, and only emerged blinking into the sunlight when the USG banned it. That's just how things are here now, most people are normal but there are some serious LLM dope addicts out and about.
I've been seeing LLMs act lazy from the very beginning. They got a little better but smaller models really only want to have a single task given to them. Mythos at least does work. RIP
Try to run your prompts through Claude to pinpoint any ambiguous parts that can be interpreted in multiple ways, or self-contradictory sections. I typically resolve any prompt-ignoring issues with that.
I've been saying for ages that since Opus 4.6 models are increasingly smarter but further unhelpful as assistants.
Fable was amazing as a vibecoder but as an assistant it can't resist jumping into implementation and filling chats of pointless jargon.
It's really grim if you're looking for assistance instead of an implementor.
GPT 5.5 Pro and Fable are gorgeous bullshitters that pretend to be right (often convincingly because they are very smart) even when they are wrong and I need tons of energy to process their information.
I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.
It isn’t a dream, it’s a reality for some of us here and it will be increasingly so for everyone else. Amazingly, USG intervening slowed the dynamic greatly (fortunately?)
The problem is obviously who will be left. There’s a lot of scifi to catch up on.
Yep, this is why experiences and ratings of models vary so wildly.
I recently migrated a very large web app to Tailwind and Opus kept screwing up over and over, refactoring and changing the design, the more complex the component became.
I ended up asking Haiku to do it and it managed to do everything correctly, pretty much without intervention.
> I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.
I've taken to instructing the agent to manage the subagent, and the principal agent's sole job is to ensuring the subagent follows instructions to the letter.
The cost per task chart is telling me that I should _never_ use Sonnet 5 above medium effort level - Opus always performs better for a given cost. So I guess the takeaway is that if Sonnet 5 medium isn't good enough for you, switch models, not effort levels.
- For Claude.ai subscriptions I think Sonnet is much cheaper than Opus. This is why there was a "Sonnet only" usage bar for Max tier for the longest time.
- For some tasks the sheer amount of raw input tokens is the most important. For example multimodal computer use tasks. You can't make them any more efficient on Opus by turning down the reasoning, so a cheaper model like Sonnet is useful for them
You're referring to the Agentic search, but if you look at the Agentic computer use the cost is basically halved.
However, I am also confused about market positioning. Too expensive to perform daily tasks - open souce models are much cheaper - and not frontier model to address complex real world problems.
Yeah, I was looking at the same chart and was very surprised at where the curve is relative to opus... Feels like sonnet 5 is "what if opus had an extra-low effort level"?
While I appreciate, they publish this information, it's increasingly hard to keep track of it all. I've lost the mental model of how different models at different effort levels perform and what tasks they are good at.
In practice, I tend to just use the default on Claude Code that works well enough. But I wonder to what degree other users really play around with these settings to optimize for their project.
The arguable caveat is Sonnet may run faster (although this isn't known for sure, due to more tokens being used for the same task), so you can potentially get more done in a synchronous iterative workflow
I don't really believe this however, because so much time is spent fixing up after models, that a slower but more intelligent model is a net time saver in my experience.
That's just one benchmark, though. Tab to the next one and Sonnet 5 performs better as effort goes up just as you'd expect. I imagine the suggestion is that performance vs effort tradeoff is task dependent.
What is a "task" in real-world terms? If it will be $15/million output tokens, and high/xhigh is somewhere in the $7.50/task range. Does that mean a single task is using 500k tokens. That seems like it would start to add up fast.
I'm struggling to understand why I'd ever use this instead of just using a lower effort level for opus given on many of the benchmarks listed the cost per task rises above opus at anything higher than medium effort.
Only thing I can think of is for when someone is out of opus credits. Of course there are API billing use cases but I'd probably still just use opus on low.
Speed is a huge reason. Sometimes you just need some simple tasks get done fast, and waiting 30-60 seconds for opus to even start thinking can really slow things down.
Opus with low reasoning effort would be faster than Sonnet with high reasoning. So that won't exactly help.
I think it would just be what those models are optimized to perform
Wow, seems worse even on price/performance than GLM 5.2, which is only 744b parameters.
From the system card: "On CyberGym vulnerability discovery, Claude Sonnet 5 is less capable than Sonnet 4.6, and far less capable than Opus 4.8 and Mythos 5
As with the other evaluations in this section, these results were achieved with all safeguards turned off. When run with our default mitigations, Sonnet 5 scored a 0 on CyberGym"
I have tried to rewrite an article with GLM-5.2 and with Sonnet 4.6. Completely different results as LLM is non-deterministic. But GLM-5.2 made a lot of subtle mistakes that needed to be corrected by hand. On the opposite, Sonnet found and corrected all mistakes in the second round.
Similar situation was with planning and coding. GLM-5.2 seems to be good “on paper” but the real usage results was different.
And I am not an attorney for Claude or GLM-5.2… :)
But as I’ve been using LLM models daily since Nov 2022 I have realized that all common tests have to be confirmed in your project - there is no “one model rules them all” - you need to dig out a specific model from that LLM haystack with thousands of models.
Benchmarks help but they start to be similar to fuel consumption specs in car ads - real consumption is different for everybody :)
Finally, a viable business strategy - sell security-oblivious code monkeys for cheap, then charge premium rates for agents capable of cleaning up the mess.
Not to single you out, parent commenter, but I really hope the quality of discourse on HN will move past these basic comparisons eventually. It seems like every thread on every model release has the exact same comments.
"Wow, X models is Y% better or worse than Claude Z model on T benchmark"
"That's irrelevant, they're just benchmaxing."
"Not useable for daily coding or agentic workloads, the vibes are totally wrong."
"It's almost as good, and costs a lot less, so I will absolutely use it."
"I cannot imagine justifying using these, as the step change means open models lower costs do not make up for the productivity loss"
I'm an unhappy Anthropic customer and really rooting for open models and non-gatekept intelligence, but how do we move on from this now meme-like model release discourse rigamarole. I do not know what that would be. I don't design LLMs nor benchmarks, and I genuinely appreciate that people do their best to provide information, even if non-perfect here. I'm sure most of you who actively read these comment pages on announcements must feel similarly, though, right?
I'm not sure what else can be said? I've found benchmarks to be a very weak signal for how good/bad the model is, but it's the #1 thing the companies highlight.
20 minutes after the announcement there's no real useful statement that can be made about it.
Seems to be another great incremental update to the workhorse, nice!
I've been using Sonnet instead of Opus for almost all coding tasks for a while now. A little elbow grease to break down tasks and you can spend a lot less money for just about the same output quality.
Yeah I think people are sleeping on the smaller/faster models like Sonnet. As long as you have a detailed plan or small, well scoped individual tasks Sonnet can implement just fine. Opus will still do better at more open ended tasks or completely "vibe coding." Or spec/plan with Opus, and have Sonnet implement.
> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.
Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.
And Opus 4.8 is still cheaper for a higher pass rate (much less open weight models like GLM 5.2) so not sure why I'd use Sonnet except on the low effort level for I suppose trivial tasks where I want it to work only 50% of the time judging by the graph. The pricing doesn't really make any sense.
"Lower ability to perform cybersecurity-related tasks" makes me super concerned it will leave my codebase like Swiss cheese for any American granny with access to Fable 5, when we non-American Brits, or rest-of-worlders, don't have access to it to clean our codebases.
100% this. I read these caveats in new models and all I hear is "we made sure this model has no idea about computer security." Such a weird thing to brag about.
"dangerous cyber skills, such as developing software exploits" is very plainly referring to the same thing you are, but is more precise industry terminology rather than the loaded slang "hack".
I think you misunderstood what their vision is, or rather what their possible futures are. They are many steps ahead of almost everyone, both in wargaming possibilities and the actual realized path. What doesn’t make sense to you may be the only safe option for them.
> What doesn’t make sense to you may be the only safe option for them
thats true because their point of view makes no sense for us. dario is all in on lesswrong machine god theory and really believes they need to create a super intelligence before anyone else. that means doing as much as possible to slow down others progress and accelerate your own. but the fact that they believe its the only option doesnt make it true for the rest of us.
Never said otherwise, but it changes nothing. Their beliefs got them to this point on the timeline and that in itself cannot be ignored (or should I say, it should inform our priors...?) You can like or dislike them or what they do or don't do, but you must respect them regardless of that, purely because of their track record.
I don't think so. During the time I was using Fable 5, I was getting it to clean security bugs that Opus 4.8 had introduced ... bugs which weren't localised to a single PHP file but were caused by cascading data flow through multiple PHP files. I'm not an expert on security but I know I wouldn't have found these myself. I knew from day one of Fable's release that it would do thorough security audits and fix loads of flaws, even offering up PoCs to help show that it fixed them, as long as I didn't explicitly ask it to do a security audit. I just said, "My codebase is a mess," and it went on for an hour doing a thorough security audit and helping plug numerous holes. This was before the "fix my code" story came out.
They spent months hyping up Mythos and ended up with it banned. I’d assume they want to both differentiate their products and appeal to regulators here
They will release it eventually. Once they see the Chinese models are close to Mythos level they will release it before, so it will be "revolutionary".
Victim of the same hype generated by Dario. Now everyone has to walk on eggshells, do limited releases to trusted partners, and nerf their cybersecurity capabilities lest they get deemed “too powerful to release”.
Flowers for Algernon. And, sadly, expect this from now on. You saw it with OpenAI releasing Sol/Terra/Luna with a chart showing how they weren't quite as good as Mythos. It's all messaging to the USG to try to avoid/minimize arbitrary review from multiple agencies. 'Hey, it's smart, but look how stupid it is at "cyber."'
Why do you think they are bragging? Anthropic has long been the company to give us by far the most in-depth information about their models, both positive and negative. I read this as them just stating a fact about this model that users would want to know.
Of course. But is it really impossible that Dario’s directive to the marketing team is “try not to make us look bad, but also be honest about our models’ capabilities, so people can stay informed”?
Anthropomorphic, most in-depth? That's laughable given how closed down they've been over the years. If you want in-depth, DeepSeek actually still publishes papers of their methods for anyone to implement leading to being by far the most cost efficient model provider for the performance.
I was talking about reporting on testing and capabilities. Yes, open models provide a greater amount of information about the development of the model and how to run it yourself, but I am quite confident that literally no AI company, open or closed, conducts and reports so thoroughly on testing about the capabilities of their models.
>Our safety assessments found that Sonnet 5 shows an overall lower rate of undesirable behaviors than Sonnet 4.6, and is generally safer to use in agentic contexts.
which is obviously painting that as a good thing. So reading the next sentence as "in other good news" is reasonable.
While I'm still not sure I would characterize that as bragging, you're right that that is a fair interpretation. However, another Fair interpretation of that is something along the lines of "the downside or cost of this positive thing is this following negative thing."
There's two classes of models now - the cybersecurity ones that none of us are getting, and the 'safe' models released for general consumption. This is letting us know which side of the divide it sits on.
Surely the Chinese government will see US gov's intervention and say "Government control of business is stupid, our industry will have more independence from CCP control for the benefit of the world".
this seems rather counter-productive, wouldn't a model with less cybersecurity capabilities be more likely to produce insecure code? Not to mention, Chinese models don't have these restrictions and can be used to exploit said unsecure code.
I supposed I shouldn't be surprised at how the trump admin is approaching AI regulation, counter-productive is really all they do
One of the best queries I've done with an LLM recently was: Create a plan for improving the robustness and resilience of this code, particularly to untrusted inputs.
Gemini wouldn't do a security audit. But it came up with a great set of mitigations and identified an extant XSS flaw in the process of improving robustness.
There's an awful lot of good that can come from proactive, defensive use of LLMs. I realize there's also a lot of pain when the difficulty of exploit finding drops suddenly, but in the long term we may all benefit from the defensive side of this.
Restricting the models isn’t about restricting offensive capabilities. They were already very well aligned to reduce that risk.
This recent government interference is about trying to preserve US offensive cyberwarfare and cyberespionage capabilities. It’s not about “bad actors”. It’s about defensive capabilities becoming pervasive and cheap, which would kneecap us cyberoffensive capability.
It’s like making seatbelts illegal so that police chases can be more effective.
> Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.
What exactly do you want Anthropic to say here? "This model, the one we are about to give to the entire world for cheap, is really good at hacking"? Saying Sonnet is terrible at cybersecurity is the most reasonable thing they can say, out of a lot of bad options.
> And Opus 4.8 is still cheaper for a higher pass rate
Unless it spams as much as Opus, I doubt it. Opus 4.8 literally spams text like puke. On a longer run especially if you get cache misses here and there the bulk of the cost is all the extra context it adds.
Important to note: "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."
"We can raise prices in two ways: (1) raise the price per token and (2) increase the number of tokens we generate on your behalf. We promise not to do (2) maliciously. Promise."
I think the incentives are less bad since a good chunk of usage comes from subscription plans.
There was a fairly major regression in Claude Code performance for some time when they changed the system prompt to try and make it less verbose (saving tokens). And if I'm not misremembering, another incident where they changed the default effort from high to medium not long ago.
Wonder if the whole cyber paranoia leads to their models ultimately generating less secure code. After all, if it has the ability to generate safe code, it would imply that it knows something about cybersecurity, which could surely be used to hack all the banks in the world.
Trying to censor nudity in image generation models caused all kinds of problems with anatomy in image models. I’m sure these models will have similar issues with security.
Great timing. I just started using Claude Sonnet as a long term reverse engineering project[0] for a game I used to play as a kid. The cheaper tokens but sufficiently smart with hard verification makes it a perfect combo for the task
$5/$25 for Opus 4.8 vs $3/$15 doesnt seem cheaper enough to be too worth it. It depends how much better it is than e.g. Mimo, but I imagine Mimo and co to be too cost efficient in the lower tier to be overtaken by Sonnet for most tasks.
Seems like the way to go for any smaller models is to only use the low reasoning levels, and for anything where you'd want it to reason harder, to just use a larger model.
In effect, high reasoning only makes sense when you're using the frontier model and need extra performance (higher levels of reasoning are never pareto optimal unless you're at the largest model size).
My experience with using low reasoning effort has been nothing but a waste of time. Claude often keeps guessing, not calling tools to ground itself, and basically at the end I end up wasting the same amount of tokens or just switch to Opus on xhigh. It's been a terrible experience.
Not to sound like an LLM, but that seems exactly right to me. Use it as a cheaper, high-functioning task subagent and lower reasoning for a master Opus session. As long as not every portion of your task requires maximum intelligence, you should come out ahead.
It's a good question, but for multiturn conversations even cached context adds up quickly. My experience has been that spawning off subagents for defined tasks in a large overall plan generally makes me come out ahead.
Judging from those cost-performance graphs, Sonnet doesn't make sense to run at anything higher than a medium reasoning level, since Opus 4.8 low reasoning outclasses it for the price.
This line as a selling point is also pretty funny:
> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.
Gemma 4, Kimi K2.5, MiniMax M2.5, gpt-oss, GLM 5, Qwen3 Coder Next, DeepSeek V3.2, Devstral 2, are all available on AWS Bedrock and all are about Haiku level
Why did the other reply to this get flagged as dead? It was a comment about how someone would come out saying that Sonnet 5 would be better on the pelican test and therefore it has to be good. But I guess HN loves pelican SVGs so much that you're not allowed to criticize it.
This is much more interesting of a model at $2/$10 (their launch pricing) than at full price. There are many competing models at around this level of performance.
I also like that the difference between low, medium, high, xhigh seems more spread, which is actually a good thing for people trying to tune applications. Running Sonnet 5 on low with the launch pricing makes this potentially a better fit than Haiku or open source models for some tasks. I don't think it will make sense at full price.
Really if they wanted a standout model that would really take the wind out of GLM's sails, they should have made this the new Haiku, priced at Haiku levels with this performance.
I'd love if they would include speed (though I know there are difficulties involved). At this point the quality of Opus 4.8 is no longer my limiting factor, it's the speed, so a faster model would be great.
Ironically, the key message of today's release is that Sonnet 5 is far less capable than Opus 4.8 and Mythos 5. It's a funny development is the past few weeks
The reality is that Fable will eventually be obsolete and Sonnet / Opus will surpass it. Fable did cost 2x as much as Opus, so I assume it involves a much higher cost for what it did, but I wouldn't be surprised if Fable will be obsoleted by Opus or even Sonnet sooner or later at less cost.
Have you considered getting better at coding so you can build stuff yourself instead of waiting for models you might not be able to get access to anymore?
In that, it seems sonnet 5 on high costs more than opus 4.8 at a lower pass rate. Am I reading this correctly?
Edit: It looks like the key value proposition of the updated model is that it is much better than Sonnet 4.6.
Wheras, Sonnet 5 delivers great value (by browsercomp benchmarks and compared to opus) when running in low and medium.
So: Sonnet 4.6 should ~never have been run for low, medium or high when Opus 4.8 has been available. Whoops, I think I have some skills that delegate easy stuff to Sonnet.
---
I remember Anthropic pivoting everyone's default model to Opus but had not seen it put so starkly before.
I am a bit confused on the subscription `/usage` screen. It splits out sonnet usage, and I'd presumed that would have contributed to a lower use of subscription Quota.
But if this is correct, Sonnet usage was basically like smoking unfiltered cigarettes.
I agree with this assessment, IMO my takeaway from this is "Generally run Sonnet on low, otherwise use Opus". It's kind of like an "extra low" setting of Opus. (depends on the application for sure).
It would be good if Anthropic provided some kind of feedback or even toggle to auto-route requests for models being used at thinking levels that would be a better value using a different model.
Sort of like, getting an automatic upgrade at a car rental or hotel if there is availability.
LRMs are plateauing for sure, not that there won't be gains to be had in the future, but it's not like the era of rapid progress that was the past year any more.
I agree that the rapid improvement from like 2023-24 era is over (from a perspective of going from a 3/10 to a 7/10, you can’t then go to a 11/10). There was just so much more space to grow back then.
But isn’t Fable supposed to be another step change? I never used it, myself.
Tbh, at this point I think top tier models are smart “enough” (I’m sure this will look antiquated in a year), and the way to give me MORE noticeable improvement is to make them much faster rather than much smarter. Or even a way to automatically and accurately pick faster models when it makes sense. I know that IDE’s have Auto modes, but it’s not something that I trust right now to pick smart+fast instead of picking “maybe smart enough”+”cheaper for harness owner”
Opus 4.8 beats Sonnet 5 on the pareto frontier in several of their graphs (Agentic Search, Agentic Computer Use).
In other words, for certain tasks, Opus 4.8 is cheaper than Sonnet 5, and does better than Sonnet 5.
I've noticed this pattern on a lot of benchmarks. You can try to emulate a bigger model by ramping up the test time compute (max reasoning, more turns, model fusion etc.), but you can't reach the same quality level, and you often exceed the cost you would have paid by just using a bigger model.
tldr: if you're doing something hard, just use a bigger model.
Not the original commenter, but personally I noticed my quota usage didn’t feel like it was being spent at a much lower rate when using Sonnet even on a relatively low thinking budget and based on a few comments here it seems I might not be the only one. Has anyone else noticed this? Wasn’t it different in the past? I thought I would be getting to use Sonnet much much more than Opus but it did not feel that way despite being on 20x plan.
Sonnet 5 is not currently available in the EU region on Bedrock, whereas previous models were and still are. I wonder if this is only due to early stages of the rollout or if this is due to recent US restrictions.
Unfortunately that means I won't be using it at work for now.
The use of the "cheaper models" in big AI companies are next to useless as they don't even score as well as the open/super cheap Chinese models. Only the frontier big models like Fable and Opus have value.
But does it burn tokens just like Opus? That's the feeling I have nowadays. Regardless of what model I choose, the 5-hour limit gets exhausted in the first hour or so.
"Claude Sonnet 5 is available everywhere today at an introductory price of $2 per million input tokens and $10 per million output tokens through August 31, 2026. It then moves to standard pricing at $3 per million input tokens and $15 per million output tokens.2"
"Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."
If we trust them, then it is roughly the same as sonnet 4.6
That seems to only be true for the "Agentic Search" benchmark. That benchmark in particular is a bit weird, because Sonnet 4.6 effort levels had a relatively small effect, so Sonnet 5 med is basically comparable to all effort levels of Sonnet 4.6.
interesting footnotes: "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer... can map to more tokens: roughly 1.0–1.35× depending on the content type." AKA expect higher costs on Sonnet 5 vs Sonnet 4.6 for the same tasks.
So many things to think about regarding these "benchmarks":
- Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement?
- Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model?
- Would it be more useful to move toward a comparative rather than absolute ranking?
The jump in reasoning quality is noticeable. What's interesting is how it handles ambiguous instructions now — it seems to ask fewer clarifying questions and just makes a reasonable judgment call. That's a double-edged sword depending on your use case.
I believe that’s gonna be meta for agentic coding this year for enterprises. Cost optimized models approaching SOTA capabilities on software engineering but without cybersec training.
Bro that is financial engineering, not real revenue growth. They engineered the switch to usage based pricing and a price hike timed the quarter before they wanted to go public, long enough to juice their numbers but not long enough for them not to be able to manage backlash and have to walk things back. Then they tried to extrapolate that manufactured bump to make it look like they have record shattering revenue growth.
Anthropic's run on the model and product side of things is highly impressive. They got Sam A. punching the air consistently, which is well-deserved and self-inflicted above all.
Wdym? They've been knocking it out of the park on marketing, but Claude Code is still a meme, and Opus is getting trashed by GPT5.5 meanwhile you can't even use their "dominant" model, and anecdotal reports from when people could use Fable, when they weren't getting silently poisoned, was that it was only marginally better than GPT 5.5 in terms of SWE smarts, mostly being better in terms of pleasantness to interact with and design taste.
Like I said, Anthropic's marketing is killing it, they've got people freely(?) shilling for them on public forums so even if they have shit developer relations and community relations and a model that's mostly worse while being more expensive, they can ride a wave of misinformation.
Anybody notice that they did not include Sonnet 5 Max in the "Agentic Search results", when comparing to Opus 4.8 ...
Based upon the "Agentic Computer usage", Sonnet 5 Max was going to be off "Agentic Search results" chart. lol ...
In short, Sonnet 5 Low/Medium is more cost efficient, if its a task below Opus 4.8 Medium. For the rest its expensive and your better off using Opus 4.8.
Because it’s a massive improvement over the previous model, and cheaper?
You are reading too much into the graph and ignoring the threshold of usefulness for real world tasks. By that logic Sonnet 4.5 would have never been worth using.
Am i missing something? Because your making my point. Its only worth it compared to Opus 4.8, if the tasks your running requires Opus 4.8 low (or non-existing lower).
For the rest the gap in pricing vs efficiency is so small, that there is no point in using Sonnet. I am looking at their own cost comparisons vs efficiency...
The point is that Sonnet at medium or even low will be smart enough for most daily tasks. You’re defining “worth using” as if you always need the highest performance possible, which is what these benchmarks measure, but most work doesn’t need it. You’ll pay more to get the same result. Sonnet 4.5 is very popular as a main model currently, this is a free upgrade.
I use Haiku a lot for agent workflows, if I can get better output at similar prices, Sonnet 5 will replace it completely.
I don't pay so I'm glad for the upgrade. I usually use Gemini, Mistral Le Chat (Vibe...) or Deepseek as they have way more generous free limits and I can basically spam forever.
I think they mean per dollar in the perf/$charts, not per marketing class. I.e. the new model is a complete Pareto failure in said perf/$ charts with the sole exception of Sonnet 5 low, which is dumb enough to not have comparison at all. Opus 4.8 delivers a better outcome per dollar, regardless what the underlying size of the models is.
I'd generously assume this is something about the specific category of agentic task presented in the chart... but it does raise the question "then why is that category the one they chose to highlight here".
For agentic computer use Sonnet 5 low performs better than Sonnet 4.6 medium at just under half the cost, and better than Opus 4.8 low at 25% off. Their success rates are not that far off.
Agentic search is a different story, but even there it still dominates 4.6 (as in, for everything Sonnet 4.6 can do, Sonnet 5 can do it as well or better at the same or lower cost).
Yes, Opus 4.8 dominates Sonnet 5 over its entire range in both categories, but Opus's lower range is limited and there is a valid regime on the lower end where Sonnet 5 use makes economic sense. This is not the case for Sonnet 4.6 where Opus 4.8 dominates it completely on both charts.
Edit -- reading your response closer I think we're saying the same things, maybe just disagreeing on whether that lower end is valuable or not.
Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models.
I have been using Sonnet 4.6 more than Opus, because I'm mostly doing agent-assisted development and not fully agent-driven development. This announcement does not make me positive, I have found that the more models are optimized for fully agentic development, the worse they get at assisted development and often start doing too much despite very strict/specific instructions.
I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.
I think you should try an OpenAI model like GPT 5.5. It is better at following instructions and boundaries set during prompt. It feels like a more capable "agent assistant" than Claude models but without loss of intelligence.
Most of my work involves "Agentic engineering" instead of fire-and-forget. I like to stay involved during the planning as well as review and ask a lot more questions from the agent than I've seen others doing. In a way, I'm using the agent in a sort of "hyper auto-complete" mode to fill in the blanks (rather big blanks) once I've set out the requirements, scope and design (sometimes specific module boundaries). This works best for me.
Yeah, there's a real opportunity for one of these companies to invest time in a model that's tuned for, to use your term, agent-assisted developement.
Trouble is, everyone inside their buildings seems to believe that no one will be working like that in a year or two.
There’s no way to justify their valuations if they get downgraded to a pair programming tool. They need fully agentic stuff to work and replace human engineers to even come close.
Offhand, I’m not even certain whether a model like that could justify the constant retraining we’re doing on the agentic models.
It doesn’t make a lot of sense to spend millions or billions on training to reduce hallucinations by 0.3% if your model assumes a human is in the loop to course-correct them.
Some napkin math -- total global labor compensation is about 50% of the GDP, which puts it in the USD 50 - 60 Trillion range: https://ourworldindata.org/grapher/labor-share-of-gdp
This source claims that knowledge workers alone (probably because they are paid much more) account for 35 - 50 Trillion of that: https://github.com/danielmiessler/Substrate/blob/main/Data/K...
If LLMs can boost their productivity even by an average of 5% (studies from ~2024 put it in the ~30% range depending on task) that is ~1.5 - 2.5T in value annually. Even if the AI industry can capture a fraction of that, that is a huuuge monetization opportunity.
Note, at 5% productivity boost, humans are not just in the loop, they are the loop. AGI or large-scale replacement of humans is not even needed, but the financial opportunity is already immense, and it scales with how much human productivity can be improved (i.e. how much work can be offloaded to LLMs.)
Now, I don't think AGI will happen soon (or has already happened, depending on how you define it) but I do think humans will be a much smaller part of the loop and large-scale job displacement will happen once companies figure out how to properly use AI.
At this point, the financial upside for the AI industry is extremely high but will be limited by the social turmoil that will inevitably ensue (which we're already seeing brewing in the data center backlash.)
That's a really good point. I think if there wasn't the insane amount of money involved and these were treated as tools instead, they would probably be MORE productive. I think a person working hand in hand with an AI instead of delegating is the sweet spot of making things fast while also not losing understanding or control of the system. You are absolutely right that these companies can't justify their valuations if they do that though. I just got a new mac to run models locally, and so far the results have been positive with some small hiccups. I'm thinking the future of this tech will likely be better tooling with better IDE integrations rather than "Claude plz make me a SaaS kthx"
My two cents is that the way to square this circle is that the valuations should be lower and they should be spending a lot less on constant retraining.
Unfortunately (from my perspective) it seems like the US companies are increasingly stuck in their current model. I think it's a competitive disadvantage.
But obviously most of the real insiders seem to disagree with me, so I'm probably wrong :)
The insiders disagree because they are benefiting greatly from the insane valuations, right?
Chinese models are quickly commodifying frontier inference, the US Gov is preventing domestic SOTA models access to the public and without those models why would consumers still spend $200/month to use the best models?
It’s such a mess and isn’t inspiring confidence as a non-investor.
> no way to justify their valuations if they get downgraded to a pair programming tool
I think there is. Pair today doesn’t mean they’re locked into that forever.
And every benchmark is "build GTA-6 from nothing, as a single-page web app".
Whether they believe it or not is immaterial. It is the end-goal they want to achieve, because then they own the means of production entirely.
I find these nefarious intention theories shallow. It can both be the case that the endstate is them owning the means of production without that being the intended guiding goal. Companies can chase profit without being Leninistic boogeymen.
There is no nefariousness in owning all the means of production, it's the endgame of maximizing profit.
However the result is exactly the same, concentration of power.
Sam Allan has said some things that would make Lenin blush
As I said, working ourselves out of our jobs within the span of a few years.
I've been using Kimi K2.6 lately (don't have 2.7 available through blessed work channels yet) for tasks where I already know what it is I want to do and I want to just step through the process in pieces, and it's fine. Do I have to correct it maybe a bit more than Opus? Yeah, but the real cutoff would be between "I have to read every line" and "I can just trust it without reading every line" and for me neither model hits that mark, and I expect it to be a while yet for that. Is it as good as Opus if I want to spit ball about architecture and then convert that to code? No, but I don't have that problem all the time, and it's there if I do need it.
And now in a heavy coding week rather than bumping up against my spend limit by late Wednesday or Thursday I'm comfortably below it all week.
That said if anything I feel like I have to reign in K2.6 much more than Opus, actually. If I want to just ask it a question without it inferring some coding task to immediately start doing, it takes a lot more care to prevent it from just running off half-cocked off of an only 3/4s-cocked idea of my own. I use "plan" mode with both but it's somewhat more defensive with K2.6 than Opus.
> I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.
I've moved completely to local models that I run with my M1 Mac Studio (64gb ram) some time ago. But for the rare times when I feel the local, quantized Qwen3.6 isn't enough, I just connect to Openrouter and use something like Kimi, GLM or Deepseek for a fraction of the price of Anthropic et al.
This is the way
From my own experience, GLM-5.2 generally cost more tokens and much more slow.
I use GLM 5.2 Fast from Fireworks and its very fast. Where are you using it from?
Which inference provider do you use? (Admittedly, I currently use K2.7 a lot more currently.)
Tokens and speed are a factor but does it require less back and forth to get things right? Being "fast and cheap but wrong" still has a cost that an otherwise "expensive and slow" exchange does not
I actually use sonnet 4.6 for my day to day coding too. It consumes much less token and good enough. Opus is just too token consuming for it to be useful to me.
Have you tried '/model opusplan' I've had strong results mixing opus for planning with sonnet implementing.
I haven't. Thanks for the heads up will give it a try! I use opus to comment on code design quite often though. It became a pattern that I made a skill for me to ask for second opinions https://news.ycombinator.com/item?id=48733092 Would love to hear your feedback if you don't mind!
Fascinating! How did you learn about this?
agent-assisted development uses orders of magnitude fewer tokens than agent-driven development
the incentives aren't there sadly
Not for a business model that scales revenue by token usage. But other business models are available.
I've been moving more to Composer 2.5 for the same reason. KISS principle.
Same for me, downgraded Cursor Subscription because when i use Cursor i use 90% Composer 2.5 fast
I've been largely disappointed how much the Claude models ignore custom instructions, and sometimes even prompts on the chat interface. It sometimes feels like talking to a wall, or as if there was a third person in the chatroom whose messages I can't see.
I can't help but feel this is intentional towards the 'Agentic' workflow.
I think this seems purposeful, as there's 2 opposing forces at play: - Have a model that follows the users instructions - Have a model that follows the system prompt instructions more
For the 'safety' argument (Re: Fable), they need these models to have basically a 2-tier instruction system, but given LLMs aren't great with actual Logic unless they program it out to test, this runs afoul and we get one or the other.
Feels like optimizing for either precision or recall, but can't have both
A suppose a solution might be going with a customizable harness like pi and merging Anthropic’s system prompt with a personalized custom one to remove all contractions
You still have to manage/fight with the post-training that is baked into the model itself.
Totally agreed. I sometimes wonder if they are making the model "lazy" with each iteration, it keeps getting better at avoiding work.
This is why Fable was so good. It followed instructions and it was in no way lazy.
People keep making comments about fable like this? You could only use it for what like a week? How is that at all enough time to evaluate? Opus 4.6 didnt suffer from this problems for a hot minute and then when newer models were released it got worse. I think they change a ton behind the scenes and allocate compute however they want, so the model you use today may behave much differently than how it behaved yesterday
> You could only use it for what like a week? How is that at all enough time to evaluate?
By observing how in 4 workdays it achieved more than Opus in ~11 days. I am my team's backend lead and the Fable finally turned the tide on my overwhelming backlog. Back to Opus and I have to treat it like special-education kid multiple times a day.
The ~72 hours I had access to Fable were by far the most productive I've had in months. Re-wrote massive parts of my codebase and caught a ton of bugs and logic issues that had silently slipped through before. I went over my subscription limit and immediately kept paying the API price to keep going. It was that good.
It was a pretty stark difference. I had the opposite problem where it did too much and overshot what I wanted from it so I certainly assume that if it had stuck around it would have gotten tuned back a bit pretty quickly.
Heh, it's not crazy if you're here in the Bay: I know multiple people who more-or-less disappeared for days when Fable came out because they were running their benchmarks, and only emerged blinking into the sunlight when the USG banned it. That's just how things are here now, most people are normal but there are some serious LLM dope addicts out and about.
I've been seeing LLMs act lazy from the very beginning. They got a little better but smaller models really only want to have a single task given to them. Mythos at least does work. RIP
> or as if there was a third person in the chatroom whose messages I can't see.
If you set off a classifier, that's how it looks to Claude.
I wasn't working with anything sensitive, but it really does feel like it sometimes condenses even something low like three bullet points to two.
IMO, they were quite good with checklists even a year ago, and tried to tick off each one.
Try to run your prompts through Claude to pinpoint any ambiguous parts that can be interpreted in multiple ways, or self-contradictory sections. I typically resolve any prompt-ignoring issues with that.
No kidding. I expect to have models to use which are optimised for different use cases.
Sonnet as an autonomous agentic model is silly. We already have other models for that if you want something weaker and cheaper than Opus.
I've been saying for ages that since Opus 4.6 models are increasingly smarter but further unhelpful as assistants.
Fable was amazing as a vibecoder but as an assistant it can't resist jumping into implementation and filling chats of pointless jargon.
It's really grim if you're looking for assistance instead of an implementor.
GPT 5.5 Pro and Fable are gorgeous bullshitters that pretend to be right (often convincingly because they are very smart) even when they are wrong and I need tons of energy to process their information.
I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.
By design, unfortunately. If they are just assistants, they can't sell the dream of "we're going to replace human labor completely" to the C-suite.
It isn’t a dream, it’s a reality for some of us here and it will be increasingly so for everyone else. Amazingly, USG intervening slowed the dynamic greatly (fortunately?)
The problem is obviously who will be left. There’s a lot of scifi to catch up on.
I think that they are simply evaluated on prompt to solution benchmarks.
Yep, this is why experiences and ratings of models vary so wildly.
I recently migrated a very large web app to Tailwind and Opus kept screwing up over and over, refactoring and changing the design, the more complex the component became.
I ended up asking Haiku to do it and it managed to do everything correctly, pretty much without intervention.
> I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.
I've taken to instructing the agent to manage the subagent, and the principal agent's sole job is to ensuring the subagent follows instructions to the letter.
The cost per task chart is telling me that I should _never_ use Sonnet 5 above medium effort level - Opus always performs better for a given cost. So I guess the takeaway is that if Sonnet 5 medium isn't good enough for you, switch models, not effort levels.
There are two wrinkles to this:
- For Claude.ai subscriptions I think Sonnet is much cheaper than Opus. This is why there was a "Sonnet only" usage bar for Max tier for the longest time.
- For some tasks the sheer amount of raw input tokens is the most important. For example multimodal computer use tasks. You can't make them any more efficient on Opus by turning down the reasoning, so a cheaper model like Sonnet is useful for them
> This is why there was a "Sonnet only" usage bar for Max tier for the longest time.
it's still there. I still don't totally grok why I can't use all my tokens on Sonnet if I want to... maybe that signals something?
You're referring to the Agentic search, but if you look at the Agentic computer use the cost is basically halved.
However, I am also confused about market positioning. Too expensive to perform daily tasks - open souce models are much cheaper - and not frontier model to address complex real world problems.
Rarely used Sonnet btw.
You're the second person that has said this but I cannot understand why you are interpreting the "Agentic computer use" graph in this manner.
The graph shows that Opus is cheaper than Sonnet for the same performance. Unless I am suffering a cognitive blindness thing right now.
Wrong! Look at it better. It shows that Opus has superior performance but at higher cost.
For example: At xhigh, Sonnet5 is about 79% for $0.45 while Opus4.8 is 83% for $0.8 - roughly
Yeah, I was looking at the same chart and was very surprised at where the curve is relative to opus... Feels like sonnet 5 is "what if opus had an extra-low effort level"?
While I appreciate, they publish this information, it's increasingly hard to keep track of it all. I've lost the mental model of how different models at different effort levels perform and what tasks they are good at.
In practice, I tend to just use the default on Claude Code that works well enough. But I wonder to what degree other users really play around with these settings to optimize for their project.
What I want is a harness that knows how to optimize this kind of thing for me.
You might want to check out Amp: https://ampcode.com/
Which is your own harness and your own evals for your tasks I guess
Just use deepswe as a reference point.
The arguable caveat is Sonnet may run faster (although this isn't known for sure, due to more tokens being used for the same task), so you can potentially get more done in a synchronous iterative workflow
I don't really believe this however, because so much time is spent fixing up after models, that a slower but more intelligent model is a net time saver in my experience.
That's just one benchmark, though. Tab to the next one and Sonnet 5 performs better as effort goes up just as you'd expect. I imagine the suggestion is that performance vs effort tradeoff is task dependent.
No it doesn't? It's worse than Opus across the whole shared frontier on both plots.
i actually exclusively use Sonnet in low effort level. It's too slow otherwise and at a higher effort levels is strictly worse than Opus.
Worth noting that the default chart there is for "agentic search performance", not coding. I didn't see an effort comparison for coding specifically.
What is a "task" in real-world terms? If it will be $15/million output tokens, and high/xhigh is somewhere in the $7.50/task range. Does that mean a single task is using 500k tokens. That seems like it would start to add up fast.
I’ve found input tokens is around 5x more than output, so a task could be a couple million thinking tokens and then a few couple 100k output tokens?
I noticed that as well but with the introductory pricing, I wonder how true that is.
It would be great to see these charts with the promotional pricing just because it’s here for about two whole months.
I guess I could get Sonnet 5 to do it.
Opus 4.8 high doing better and cheaper than Sonnet 5 xhigh
> Opus always performs better for a given cost.
Assume it to get deprecated sooner rather than later.
It's very interesting. Why even release a new product that underperforms at the same price level? Why not just lock it?
I guess it's probably a lot cheaper for them to run, and it cuts costs for them. Seems disingenuous, though.
I'm struggling to understand why I'd ever use this instead of just using a lower effort level for opus given on many of the benchmarks listed the cost per task rises above opus at anything higher than medium effort.
Only thing I can think of is for when someone is out of opus credits. Of course there are API billing use cases but I'd probably still just use opus on low.
More and more I find myself trying to stop Opus from doing something stupid, and at every turn I need to tell it to stop overcomplicating things.
I think the models are being optimized for wealth extraction from users and companies, instead of solving problems.
I don't know why Opus would try to create an entire library when I told it specifically to do something simple that would take 2-3 lines of Python.
Older Opus models will likely get deprecated and then over time this is the cheapest model. That is how prices are currently increased.
Maybe it's not for you? I don't pay, so I can't even use Opus... So this is an upgrade over Sonnet 4.6 for me.
Speed is a huge reason. Sometimes you just need some simple tasks get done fast, and waiting 30-60 seconds for opus to even start thinking can really slow things down.
Opus with low reasoning effort would be faster than Sonnet with high reasoning. So that won't exactly help. I think it would just be what those models are optimized to perform
Wow, seems worse even on price/performance than GLM 5.2, which is only 744b parameters.
From the system card: "On CyberGym vulnerability discovery, Claude Sonnet 5 is less capable than Sonnet 4.6, and far less capable than Opus 4.8 and Mythos 5
As with the other evaluations in this section, these results were achieved with all safeguards turned off. When run with our default mitigations, Sonnet 5 scored a 0 on CyberGym"
I have tried to rewrite an article with GLM-5.2 and with Sonnet 4.6. Completely different results as LLM is non-deterministic. But GLM-5.2 made a lot of subtle mistakes that needed to be corrected by hand. On the opposite, Sonnet found and corrected all mistakes in the second round.
Similar situation was with planning and coding. GLM-5.2 seems to be good “on paper” but the real usage results was different.
And I am not an attorney for Claude or GLM-5.2… :)
But as I’ve been using LLM models daily since Nov 2022 I have realized that all common tests have to be confirmed in your project - there is no “one model rules them all” - you need to dig out a specific model from that LLM haystack with thousands of models.
Benchmarks help but they start to be similar to fuel consumption specs in car ads - real consumption is different for everybody :)
Finally, a viable business strategy - sell security-oblivious code monkeys for cheap, then charge premium rates for agents capable of cleaning up the mess.
I think instead they should sell super hackers and get their product banned instantly and go bankrupt
Not to single you out, parent commenter, but I really hope the quality of discourse on HN will move past these basic comparisons eventually. It seems like every thread on every model release has the exact same comments.
"Wow, X models is Y% better or worse than Claude Z model on T benchmark"
"That's irrelevant, they're just benchmaxing."
"Not useable for daily coding or agentic workloads, the vibes are totally wrong."
"It's almost as good, and costs a lot less, so I will absolutely use it."
"I cannot imagine justifying using these, as the step change means open models lower costs do not make up for the productivity loss"
I'm an unhappy Anthropic customer and really rooting for open models and non-gatekept intelligence, but how do we move on from this now meme-like model release discourse rigamarole. I do not know what that would be. I don't design LLMs nor benchmarks, and I genuinely appreciate that people do their best to provide information, even if non-perfect here. I'm sure most of you who actively read these comment pages on announcements must feel similarly, though, right?
I'm not sure what else can be said? I've found benchmarks to be a very weak signal for how good/bad the model is, but it's the #1 thing the companies highlight.
20 minutes after the announcement there's no real useful statement that can be made about it.
"It's totally obvious they quantitized Claude Z"
Seems to be another great incremental update to the workhorse, nice!
I've been using Sonnet instead of Opus for almost all coding tasks for a while now. A little elbow grease to break down tasks and you can spend a lot less money for just about the same output quality.
Yeah I think people are sleeping on the smaller/faster models like Sonnet. As long as you have a detailed plan or small, well scoped individual tasks Sonnet can implement just fine. Opus will still do better at more open ended tasks or completely "vibe coding." Or spec/plan with Opus, and have Sonnet implement.
I was surprised to learn that Sonnet generally has the same tokens per second as Opus
> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.
Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.
And Opus 4.8 is still cheaper for a higher pass rate (much less open weight models like GLM 5.2) so not sure why I'd use Sonnet except on the low effort level for I suppose trivial tasks where I want it to work only 50% of the time judging by the graph. The pricing doesn't really make any sense.
"Lower ability to perform cybersecurity-related tasks" makes me super concerned it will leave my codebase like Swiss cheese for any American granny with access to Fable 5, when we non-American Brits, or rest-of-worlders, don't have access to it to clean our codebases.
100% this. I read these caveats in new models and all I hear is "we made sure this model has no idea about computer security." Such a weird thing to brag about.
> any American granny with access to Fable 5,
Fable is effectively not available to the general public in the US either
This is code for "this model can't be used to hack other systems as effectively as Opus or Mythos."
"dangerous cyber skills, such as developing software exploits" is very plainly referring to the same thing you are, but is more precise industry terminology rather than the loaded slang "hack".
I think they don’t understand that cybersecurity skills are what prevent bad code from making it into production.
It’s like telling a chef to cook without a knife because knives can kill people.
Dario and his lackeys at Anthropic aren’t visionaries.
I think this is more aimed at the US gov't than anything. They want to be clear that it's not very good at hacking, so that the gov't won't ban it.
I'm sure they're well-aware that this also will make it worse at building secure systems, but the gov't isn't restricting releases based on that.
I think you misunderstood what their vision is, or rather what their possible futures are. They are many steps ahead of almost everyone, both in wargaming possibilities and the actual realized path. What doesn’t make sense to you may be the only safe option for them.
> What doesn’t make sense to you may be the only safe option for them
thats true because their point of view makes no sense for us. dario is all in on lesswrong machine god theory and really believes they need to create a super intelligence before anyone else. that means doing as much as possible to slow down others progress and accelerate your own. but the fact that they believe its the only option doesnt make it true for the rest of us.
Never said otherwise, but it changes nothing. Their beliefs got them to this point on the timeline and that in itself cannot be ignored (or should I say, it should inform our priors...?) You can like or dislike them or what they do or don't do, but you must respect them regardless of that, purely because of their track record.
That’s not even close to true. Unless you’re vibe coding trash that a better model might catch.
I don't think so. During the time I was using Fable 5, I was getting it to clean security bugs that Opus 4.8 had introduced ... bugs which weren't localised to a single PHP file but were caused by cascading data flow through multiple PHP files. I'm not an expert on security but I know I wouldn't have found these myself. I knew from day one of Fable's release that it would do thorough security audits and fix loads of flaws, even offering up PoCs to help show that it fixed them, as long as I didn't explicitly ask it to do a security audit. I just said, "My codebase is a mess," and it went on for an hour doing a thorough security audit and helping plug numerous holes. This was before the "fix my code" story came out.
They spent months hyping up Mythos and ended up with it banned. I’d assume they want to both differentiate their products and appeal to regulators here
They will release it eventually. Once they see the Chinese models are close to Mythos level they will release it before, so it will be "revolutionary".
It was already released. US government is the only reason it's not available to us mere mortals anymore
Due to Dario hyping it up as a world ending model. If they kept their mouths shut we'd all have it now still.
Where is gpt 5.6?
Victim of the same hype generated by Dario. Now everyone has to walk on eggshells, do limited releases to trusted partners, and nerf their cybersecurity capabilities lest they get deemed “too powerful to release”.
I'm starting to think it discovered a 0-day held hidden by our government.
Flowers for Algernon. And, sadly, expect this from now on. You saw it with OpenAI releasing Sol/Terra/Luna with a chart showing how they weren't quite as good as Mythos. It's all messaging to the USG to try to avoid/minimize arbitrary review from multiple agencies. 'Hey, it's smart, but look how stupid it is at "cyber."'
Why do you think they are bragging? Anthropic has long been the company to give us by far the most in-depth information about their models, both positive and negative. I read this as them just stating a fact about this model that users would want to know.
I'm absolutely certain that their marketing team has input on (if not owning) these announcements.
Of course. But is it really impossible that Dario’s directive to the marketing team is “try not to make us look bad, but also be honest about our models’ capabilities, so people can stay informed”?
I find it interesting how two different directly opposed messages seem to have both been interpreted as being nothing but marketing speak.
Anthropomorphic, most in-depth? That's laughable given how closed down they've been over the years. If you want in-depth, DeepSeek actually still publishes papers of their methods for anyone to implement leading to being by far the most cost efficient model provider for the performance.
I was talking about reporting on testing and capabilities. Yes, open models provide a greater amount of information about the development of the model and how to run it yourself, but I am quite confident that literally no AI company, open or closed, conducts and reports so thoroughly on testing about the capabilities of their models.
The preceding sentence is
>Our safety assessments found that Sonnet 5 shows an overall lower rate of undesirable behaviors than Sonnet 4.6, and is generally safer to use in agentic contexts.
which is obviously painting that as a good thing. So reading the next sentence as "in other good news" is reasonable.
While I'm still not sure I would characterize that as bragging, you're right that that is a fair interpretation. However, another Fair interpretation of that is something along the lines of "the downside or cost of this positive thing is this following negative thing."
There's two classes of models now - the cybersecurity ones that none of us are getting, and the 'safe' models released for general consumption. This is letting us know which side of the divide it sits on.
There's also Chinese models, which aren't trying to self-limit capabilities.
Surely the Chinese government will see US gov's intervention and say "Government control of business is stupid, our industry will have more independence from CCP control for the benefit of the world".
…as long as you don’t ask them about certain dates or squares.
Also, I wouldn’t expect Mythos-class models to be allowed to be openly released by the CCP. Thinking otherwise is pure naivety.
Well, the weights are open. De-CCP-ing them is a trivial task, about 40 minutes on modern hardware. So can be done for about $50.
this seems rather counter-productive, wouldn't a model with less cybersecurity capabilities be more likely to produce insecure code? Not to mention, Chinese models don't have these restrictions and can be used to exploit said unsecure code.
I supposed I shouldn't be surprised at how the trump admin is approaching AI regulation, counter-productive is really all they do
One of the best queries I've done with an LLM recently was: Create a plan for improving the robustness and resilience of this code, particularly to untrusted inputs.
Gemini wouldn't do a security audit. But it came up with a great set of mitigations and identified an extant XSS flaw in the process of improving robustness.
There's an awful lot of good that can come from proactive, defensive use of LLMs. I realize there's also a lot of pain when the difficulty of exploit finding drops suddenly, but in the long term we may all benefit from the defensive side of this.
Restricting the models isn’t about restricting offensive capabilities. They were already very well aligned to reduce that risk.
This recent government interference is about trying to preserve US offensive cyberwarfare and cyberespionage capabilities. It’s not about “bad actors”. It’s about defensive capabilities becoming pervasive and cheap, which would kneecap us cyberoffensive capability.
It’s like making seatbelts illegal so that police chases can be more effective.
It seems obvious to me that they put that in there in an effort to avoid another reaming out by the long, orange dick of the US government.
so it doesn't get blocked. last time they said a model was great at cyber it didnt turn out well
> Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.
What exactly do you want Anthropic to say here? "This model, the one we are about to give to the entire world for cheap, is really good at hacking"? Saying Sonnet is terrible at cybersecurity is the most reasonable thing they can say, out of a lot of bad options.
To avoid Lutnick getting on their case again.
He has the opportunity to do the funniest thing ever
They are obviously trying to avoid getting Sonnet 5 blocked.
You have to pay more for that, and/or go through some USG vetting process.
That part is likely directly addressed to the US government.
Does it mean it generates code with random security holes?
Market segmentation?
> And Opus 4.8 is still cheaper for a higher pass rate
Unless it spams as much as Opus, I doubt it. Opus 4.8 literally spams text like puke. On a longer run especially if you get cache misses here and there the bulk of the cost is all the extra context it adds.
What makes that a brag?
Important to note: "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."
"We can raise prices in two ways: (1) raise the price per token and (2) increase the number of tokens we generate on your behalf. We promise not to do (2) maliciously. Promise."
I think the incentives are less bad since a good chunk of usage comes from subscription plans.
There was a fairly major regression in Claude Code performance for some time when they changed the system prompt to try and make it less verbose (saving tokens). And if I'm not misremembering, another incident where they changed the default effort from high to medium not long ago.
Wouldn't it be more malicious for them not to mention this at all?
Wonder if the whole cyber paranoia leads to their models ultimately generating less secure code. After all, if it has the ability to generate safe code, it would imply that it knows something about cybersecurity, which could surely be used to hack all the banks in the world.
Trying to censor nudity in image generation models caused all kinds of problems with anatomy in image models. I’m sure these models will have similar issues with security.
> Wonder if the whole cyber paranoia leads to their models ultimately generating less secure code.
This may be the goal.
Great timing. I just started using Claude Sonnet as a long term reverse engineering project[0] for a game I used to play as a kid. The cheaper tokens but sufficiently smart with hard verification makes it a perfect combo for the task
[0] https://github.com/dginovker/BFME-Source-Code/
$5/$25 for Opus 4.8 vs $3/$15 doesnt seem cheaper enough to be too worth it. It depends how much better it is than e.g. Mimo, but I imagine Mimo and co to be too cost efficient in the lower tier to be overtaken by Sonnet for most tasks.
Seems like the way to go for any smaller models is to only use the low reasoning levels, and for anything where you'd want it to reason harder, to just use a larger model.
In effect, high reasoning only makes sense when you're using the frontier model and need extra performance (higher levels of reasoning are never pareto optimal unless you're at the largest model size).
My experience with using low reasoning effort has been nothing but a waste of time. Claude often keeps guessing, not calling tools to ground itself, and basically at the end I end up wasting the same amount of tokens or just switch to Opus on xhigh. It's been a terrible experience.
Not to sound like an LLM, but that seems exactly right to me. Use it as a cheaper, high-functioning task subagent and lower reasoning for a master Opus session. As long as not every portion of your task requires maximum intelligence, you should come out ahead.
Won't any input be charged uncached, and the output of the small model charged again as uncached input to the bigger model?
I don't know whether that comes out ahead compared to just staying with the better model in the first place.
It's a good question, but for multiturn conversations even cached context adds up quickly. My experience has been that spawning off subagents for defined tasks in a large overall plan generally makes me come out ahead.
I'm sure folks' mileage will vary though.
Judging from those cost-performance graphs, Sonnet doesn't make sense to run at anything higher than a medium reasoning level, since Opus 4.8 low reasoning outclasses it for the price.
This line as a selling point is also pretty funny:
> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.
When can we get a new Haiku? 4.5 came out nearly a year ago, and it's showing its age.
Look at Qwen for that level of intelligence.
needs to be on bedrock for me to use it at work
Gemma 4, Kimi K2.5, MiniMax M2.5, gpt-oss, GLM 5, Qwen3 Coder Next, DeepSeek V3.2, Devstral 2, are all available on AWS Bedrock and all are about Haiku level
Claude Sonnet 5 is built to be the most agentic Sonnet model yet.
or
The Dodge Charger is built to be the most Charger like car yet.
I didn't think they'd actually release a model that was worse than the open-weight frontier and at a higher price-point. Wow.
That's yet to be determined. I think a lot of open-weight models are benchmaxxed and their usefulness for many tasks are not represented by those.
Yes, this has been my experience. They all struggle with long-horizon tasks and eventually start going in circles.
Why did the other reply to this get flagged as dead? It was a comment about how someone would come out saying that Sonnet 5 would be better on the pelican test and therefore it has to be good. But I guess HN loves pelican SVGs so much that you're not allowed to criticize it.
If you look at the account history, it's pretty clearly an account-level thing, not a comment-level thing.
This is much more interesting of a model at $2/$10 (their launch pricing) than at full price. There are many competing models at around this level of performance.
I also like that the difference between low, medium, high, xhigh seems more spread, which is actually a good thing for people trying to tune applications. Running Sonnet 5 on low with the launch pricing makes this potentially a better fit than Haiku or open source models for some tasks. I don't think it will make sense at full price.
Really if they wanted a standout model that would really take the wind out of GLM's sails, they should have made this the new Haiku, priced at Haiku levels with this performance.
I'd love if they would include speed (though I know there are difficulties involved). At this point the quality of Opus 4.8 is no longer my limiting factor, it's the speed, so a faster model would be great.
Have you tried Opus on fast mode?
Ironically, the key message of today's release is that Sonnet 5 is far less capable than Opus 4.8 and Mythos 5. It's a funny development is the past few weeks
Seems like the cyber detection even is on Sonnet now. https://support.claude.com/en/articles/14604842-real-time-cy...
That’s nice, but we want Fable
The reality is that Fable will eventually be obsolete and Sonnet / Opus will surpass it. Fable did cost 2x as much as Opus, so I assume it involves a much higher cost for what it did, but I wouldn't be surprised if Fable will be obsoleted by Opus or even Sonnet sooner or later at less cost.
Okay I don’t care about “eventually”, I want Fable now.
Have you considered getting better at coding so you can build stuff yourself instead of waiting for models you might not be able to get access to anymore?
This is like telling someone who wants a motorcycle that they should get better at running instead.
When the motorcycle manufacturers keep making each new model worse and more expensive and the government keeps trying to ban them.
Same
System Card: https://www-cdn.anthropic.com/d9bb04416ffe1352af84721476c1fa...
Not sure what niche it's going to occupy: too expensive for it's intelligence category.
Interesting that tasks on extra high cost almost the same as Opus 4.8 with a slightly worse performance
This is on the browsercomp graph, right?
In that, it seems sonnet 5 on high costs more than opus 4.8 at a lower pass rate. Am I reading this correctly?
Edit: It looks like the key value proposition of the updated model is that it is much better than Sonnet 4.6.
Wheras, Sonnet 5 delivers great value (by browsercomp benchmarks and compared to opus) when running in low and medium.
So: Sonnet 4.6 should ~never have been run for low, medium or high when Opus 4.8 has been available. Whoops, I think I have some skills that delegate easy stuff to Sonnet.
---
I remember Anthropic pivoting everyone's default model to Opus but had not seen it put so starkly before.
I am a bit confused on the subscription `/usage` screen. It splits out sonnet usage, and I'd presumed that would have contributed to a lower use of subscription Quota.
But if this is correct, Sonnet usage was basically like smoking unfiltered cigarettes.
I agree with this assessment, IMO my takeaway from this is "Generally run Sonnet on low, otherwise use Opus". It's kind of like an "extra low" setting of Opus. (depends on the application for sure).
It would be good if Anthropic provided some kind of feedback or even toggle to auto-route requests for models being used at thinking levels that would be a better value using a different model.
Sort of like, getting an automatic upgrade at a car rental or hotel if there is availability.
LRMs are plateauing for sure, not that there won't be gains to be had in the future, but it's not like the era of rapid progress that was the past year any more.
I agree that the rapid improvement from like 2023-24 era is over (from a perspective of going from a 3/10 to a 7/10, you can’t then go to a 11/10). There was just so much more space to grow back then.
But isn’t Fable supposed to be another step change? I never used it, myself.
Tbh, at this point I think top tier models are smart “enough” (I’m sure this will look antiquated in a year), and the way to give me MORE noticeable improvement is to make them much faster rather than much smarter. Or even a way to automatically and accurately pick faster models when it makes sense. I know that IDE’s have Auto modes, but it’s not something that I trust right now to pick smart+fast instead of picking “maybe smart enough”+”cheaper for harness owner”
A great many people were predicting this would be the case a year ago and being told they were wrong and to get on the boat.
I consider myself to be in that cohort as well. :)
Is there any reason to use Sonnet instead of GLM?
Your US company banning usage of non-american models. Other than that, no.
This.
Speed. But mostly no.
Opus 4.8 beats Sonnet 5 on the pareto frontier in several of their graphs (Agentic Search, Agentic Computer Use).
In other words, for certain tasks, Opus 4.8 is cheaper than Sonnet 5, and does better than Sonnet 5.
I've noticed this pattern on a lot of benchmarks. You can try to emulate a bigger model by ramping up the test time compute (max reasoning, more turns, model fusion etc.), but you can't reach the same quality level, and you often exceed the cost you would have paid by just using a bigger model.
tldr: if you're doing something hard, just use a bigger model.
And Claude Code penalizes you for using Sonnet on the subscription plan, so there's little reason to use it.
This is what I realized, can you provide more detail on how you've observed this? The /usage screen does not make it clear.
Not the original commenter, but personally I noticed my quota usage didn’t feel like it was being spent at a much lower rate when using Sonnet even on a relatively low thinking budget and based on a few comments here it seems I might not be the only one. Has anyone else noticed this? Wasn’t it different in the past? I thought I would be getting to use Sonnet much much more than Opus but it did not feel that way despite being on 20x plan.
How so?
Sonnet 5 is not currently available in the EU region on Bedrock, whereas previous models were and still are. I wonder if this is only due to early stages of the rollout or if this is due to recent US restrictions.
Unfortunately that means I won't be using it at work for now.
The use of the "cheaper models" in big AI companies are next to useless as they don't even score as well as the open/super cheap Chinese models. Only the frontier big models like Fable and Opus have value.
But does it burn tokens just like Opus? That's the feeling I have nowadays. Regardless of what model I choose, the 5-hour limit gets exhausted in the first hour or so.
"Claude Sonnet 5 is available everywhere today at an introductory price of $2 per million input tokens and $10 per million output tokens through August 31, 2026. It then moves to standard pricing at $3 per million input tokens and $15 per million output tokens.2"
"Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."
If we trust them, then it is roughly the same as sonnet 4.6
What I starting to hate is that each model's effort level can mean completely different power.
Today sonnet 5's med level effort is equivalent to sonnet 4.6 low level effort :/
That seems to only be true for the "Agentic Search" benchmark. That benchmark in particular is a bit weird, because Sonnet 4.6 effort levels had a relatively small effect, so Sonnet 5 med is basically comparable to all effort levels of Sonnet 4.6.
Kind of hilarious how much they’re touting that it sucks at cybersecurity like it’s a feature
> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.
It seems being incompetent is a feature now...
interesting footnotes: "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer... can map to more tokens: roughly 1.0–1.35× depending on the content type." AKA expect higher costs on Sonnet 5 vs Sonnet 4.6 for the same tasks.
same happened to Opus 4.7
Based on both performance vs price charts, it seems using Opus 4.8 with med effort is almost a better choice than using Sonnet 5 at xhigh effort
So many things to think about regarding these "benchmarks":
- Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement?
- Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model?
- Would it be more useful to move toward a comparative rather than absolute ranking?
The jump in reasoning quality is noticeable. What's interesting is how it handles ambiguous instructions now — it seems to ask fewer clarifying questions and just makes a reasonable judgment call. That's a double-edged sword depending on your use case.
Why is Claude Sonnet 5 allowed to be released but OpenAI Terra not? Are they not the same class of models?
interesting how much worse the sentiment around Anthropic is getting
Seems like a combination of multiple factors:
"They took my shit away!" -- 3-day Fable 5 addicts (me)
"How dare they tell Trump no?" -- US nationalist / "my country right or wrong" types
"Great to see a closed source company fail!" -- open source boosters
"Great to see an American company fail!" -- anti-US, and/or pro-China folks
"Great to see a successful company fail!" -- anti-capitalists and/or sour-grapes crab bucket types
"Serves you right for ripping off creators!" -- copyright warriors
"They keep silently nerfing the models!" -- secret downgrade conspiracy theorists
"Quit killing the planet!" -- anti-datacenter advocates
It seems to be more them losing goodwill combined with their marketing.
I don't agree with your framing that all negativity is from crazies
I don't think all the negativity is from crazies, but big chunks of it are certainly motivated. I certainly left out numerous other categories.
"OpenAI models are better, cheaper, and more reliable" - rational people
> the computer use evaluation OSWorld-Verified. Sonnet 5 (orange line) is a strict improvement over Sonnet 4.6
cool to see, still waiting for models to get better at computer use.
Sonnet seems to be really expensive
Have you followed Anthropic at all?
I believe that’s gonna be meta for agentic coding this year for enterprises. Cost optimized models approaching SOTA capabilities on software engineering but without cybersec training.
Important to note that the cost graphs are heavily distorted. The agentic serch one for example is divided into 3 'columns': $0-$2, $2-$5 and $5-$10.
And yet, the $2-$5 section is the widest, even though it only contains a single point.
I can't even say if this is making the product look better or not, but it sure is weird. Maybe Claude just hallucinated those splits xD
Not looking great for an upcoming IPO
You’re right, it’s looking stellar. Well beyond great. Real, and unprecedented, revenue growth will do that for a company.
"Real and unprecedented revenue growth"
Bro that is financial engineering, not real revenue growth. They engineered the switch to usage based pricing and a price hike timed the quarter before they wanted to go public, long enough to juice their numbers but not long enough for them not to be able to manage backlash and have to walk things back. Then they tried to extrapolate that manufactured bump to make it look like they have record shattering revenue growth.
Anthropic's run on the model and product side of things is highly impressive. They got Sam A. punching the air consistently, which is well-deserved and self-inflicted above all.
Wdym? They've been knocking it out of the park on marketing, but Claude Code is still a meme, and Opus is getting trashed by GPT5.5 meanwhile you can't even use their "dominant" model, and anecdotal reports from when people could use Fable, when they weren't getting silently poisoned, was that it was only marginally better than GPT 5.5 in terms of SWE smarts, mostly being better in terms of pleasantness to interact with and design taste.
> Claude Code is still a meme
Claude Code generates more revenue than OpenAI...It appears to be a nice meme.
Like I said, Anthropic's marketing is killing it, they've got people freely(?) shilling for them on public forums so even if they have shit developer relations and community relations and a model that's mostly worse while being more expensive, they can ride a wave of misinformation.
It's actually a huge update for building products, given most tasks are sub-agent driven where Sonnet is used, steered by Opus.
Anybody notice that they did not include Sonnet 5 Max in the "Agentic Search results", when comparing to Opus 4.8 ...
Based upon the "Agentic Computer usage", Sonnet 5 Max was going to be off "Agentic Search results" chart. lol ...
In short, Sonnet 5 Low/Medium is more cost efficient, if its a task below Opus 4.8 Medium. For the rest its expensive and your better off using Opus 4.8.
Why even release this model?
Because it’s a massive improvement over the previous model, and cheaper?
You are reading too much into the graph and ignoring the threshold of usefulness for real world tasks. By that logic Sonnet 4.5 would have never been worth using.
Am i missing something? Because your making my point. Its only worth it compared to Opus 4.8, if the tasks your running requires Opus 4.8 low (or non-existing lower).
For the rest the gap in pricing vs efficiency is so small, that there is no point in using Sonnet. I am looking at their own cost comparisons vs efficiency...
The point is that Sonnet at medium or even low will be smart enough for most daily tasks. You’re defining “worth using” as if you always need the highest performance possible, which is what these benchmarks measure, but most work doesn’t need it. You’ll pay more to get the same result. Sonnet 4.5 is very popular as a main model currently, this is a free upgrade.
I use Haiku a lot for agent workflows, if I can get better output at similar prices, Sonnet 5 will replace it completely.
I'd narrow that to why even allow the harness to run `high` on this model?
It does not pass the "I want to wash my car, should I drive or walk"
did for me even on low non thinking effort
GIGO, as they say.
Anyone else feel like Opus 4.8 got significantly dumber over the last 2 weeks?
Ah that's why Opus has been so slow for the last couple of days.
The whole fable fiasco really soured me on Anthropic. This just looks disappointing by comparison.
there was a vibecoded prediction market–style page that was put up yesterday (?) that got the date exactly right i think
link?
maybe https://outyet.ai/models/claude-sonnet-5?
Too expensive?
Is it just me or is there a huge difference between how much one can accomplish in a 5-hour window with GPT 5.5 on xhigh versus any Claude model?
I exclusively use 5.5-xhigh-fast within Codex and find it superior to Opus 4.8.
Is this the default model for non-paying users? If so, that could be an interesting move in the competition for this segment.
Fable soon please.
In effective terms they're lowering prices.
I don't pay so I'm glad for the upgrade. I usually use Gemini, Mistral Le Chat (Vibe...) or Deepseek as they have way more generous free limits and I can basically spam forever.
What is the point if it is one Trump's brain fart away from being blocked?
I feel like this is a bit of a disappointment. Sonnet 4 was a clear step above Opus 3.x, while this is a lot muddier.
Ok thats a one month clock to the next Opus model at least, so thats a silver lining to a meh model.
"Our new model is proudly dumber now!"
What? If you're comparing their models in the same size class, Sonnet 5 is Pareto-optimal over Sonnet 4.6.
I think they mean per dollar in the perf/$charts, not per marketing class. I.e. the new model is a complete Pareto failure in said perf/$ charts with the sole exception of Sonnet 5 low, which is dumb enough to not have comparison at all. Opus 4.8 delivers a better outcome per dollar, regardless what the underlying size of the models is.
I'd generously assume this is something about the specific category of agentic task presented in the chart... but it does raise the question "then why is that category the one they chose to highlight here".
For agentic computer use Sonnet 5 low performs better than Sonnet 4.6 medium at just under half the cost, and better than Opus 4.8 low at 25% off. Their success rates are not that far off.
Agentic search is a different story, but even there it still dominates 4.6 (as in, for everything Sonnet 4.6 can do, Sonnet 5 can do it as well or better at the same or lower cost).
Yes, Opus 4.8 dominates Sonnet 5 over its entire range in both categories, but Opus's lower range is limited and there is a valid regime on the lower end where Sonnet 5 use makes economic sense. This is not the case for Sonnet 4.6 where Opus 4.8 dominates it completely on both charts.
Edit -- reading your response closer I think we're saying the same things, maybe just disagreeing on whether that lower end is valuable or not.
So they repackaged Fable and added "don't scare the government" to the prompt
AMAZING
American AI company status: We are now bragging about how bad our models are unironically.
Okay.