Nah, I ain't reading that. If they can't be bothered to get a human to write it, it can't be that important. I'm glad for them though. Or sorry that happened.
The writing is actually disgusting. It’s obviously LLM writing, but it veers atypically douchey even for LLM writing, which really makes me wonder what the prompt was.
Reading posts like this gives me that same sort of feeling as when I used to get customer support in stilted English from a CS agent named “Michael”.
I test drove it yesterday. It's pretty impressive at 8b. Runs on commodity hardware quickly.
Qwen3.6 35b a3b is still my local champion but I may use this for auto complete and small tasks. Granite has recent training data which is nice. If the other small models got fine tuned on recent data I don't know if I would use this at all, but that alone makes it pretty decent.
The 4b they released was not good for my needs but could probably handle tool calls or something
Have you tried the Gemma 4 series, out of curiosity? I haven’t run a local model in a while, but the benchmarks look good. I’d take a free local tool-use model if it was relatively consistent.
Qwen 3.6 burns it to the ground. it was not even a challenge. Gemma4 seriously fails at toolcalls and agentic works. It got all messed up after 2-3 turns of Vibecoding.
Gemma 4 31b was working ok for me; but it was consuming tons of memory on SWA checkpoints, I had to turn them way down, and as a 31b dense model is fairly slow on a Strix Halo. I did have a lot of tool calling issues on 26b-a4b, though.
I tried the Gemma 4 I think 2 and 4b. The 2b was not useful for me at all. A little too weak for my use cases
The 4b was okay. It didn't get all of my small math questions right, it didn't know about some of the libraries I use, but it was able to do some basic auto complete type stuff. For microscopic models I like the llama 3.2 3b more right now for what I do, it's a little faster and seems a little stronger for what I do. But everyone is different and I don't think I'll use it anymore this past month has been crazy for local model releases.
For me, I use them for quick auto complete or small questions. I am not a vibe/agentic coder. I know I am a relic and a Luddite because of this.
Instead of hitting stack overflow and Google I will ask questions like "can you give me an example of how to do x in library y?" Or "this error is appearing what might be happening if I checked a b and c". Or "please write unit tests for this function". Or code auto complete.
I am not looking for the world's best answer from a 3b model. I am looking for a super fast answer that reminds me of things I already know or maybe just maybe gives me a fast idea to stub something while I focus on something more important, I am going to refactor anyways. Think a low quality rubber duck
I mostly use 7-9b models for this now but llama 3.2 3b is pretty decent for not hogging resources while say I have other compute heavy operations happening on a weak computer.
Probably half the questions people ask chatgpt could get roughly the same quality of answer with a small model in my opinion. You can't fully trust an LLM anyways so the difference between 60% and 70% accuracy isn't as much are marketing makes it sound like. That said the quality of a good 7-9b model is worth it compared to a 3b if your machine can run it. Furthermore the quality of qwen 36 is crazy and makes me wonder if I will ever need an AI provider again if the trend continues.
Qwen3-Coder-Next seems to be perfect sized for coding. I tried the new and just found the verbosity not really useful for coding. But probably for more analytical tasks or writing docs.
No comparison with competitor models other than the previous granite version strongly implies that it does not compete well with other comparable models. At least this is the most reasonable assumption until data comes out to the contrary
Qwen scores above sonnet in coding benchmarks. Runs locally. In personal use it's really good. Anecdotally others have used it to vibe code or agentic code successfully. Not toy problems. Not a toy model.
Qwen3.6 raises the bar for models of its size. There really isn't a comparison in my opinion.
College SAT scores do not tell you how the dev applying for your open back end systems engineering job is going to do once they're in your workplace harness.
Nor do class standings, nor hackerrank and the like.
What will tell you is asking them to fix a thing in your codebase. Once you ask an LLM to do that, a dozen times, I'd argue it's no longer "just your opinion man", it's a context-engineered performance x applicability assessment.
And it is very predictive.
But it's also why someone doing well at job A isn't necessarily going to be great at B, or bad at A doesn't mean will necessarily be bad at B.
I've often felt we should normalize a sort of mutual try-buy period where job-change seeker and company can spend a series of days without harming one's existing employment, to derisk the mutual learning. ESPECIALLY to derisk the career change for the applicant who only gets one timeline to manage, opposed to company that considers the applicant fungible.
But back to the LLM, yeah, the only valid opinion on whether it works for you is not benchmark, it's an informed opinion from 'using it in anger'.
It's looking like running your own mini ecosystem is the way of the future to me. No data centers, just a decent GPU 16-24gb of VRAM, CPU, and 32gb of RAM.
I'm pretty sure there's someone somewhere who'll create a proper harness that's equivalent to one giant model. The difficulty is mostly local hardware has lot of memory constraints. Targeting 128GB would seem to be the current sweet spot. If we could get out of the corporate market movers of buying up all the memory, we could maybe have more.
Regardless, the people in the 80s capable of pruning programs to fit on small devices is likely happening now. I'd bet most of the Chinese firms are doing it because of the US's silly GPU games among other constraints.
Interesting to see a pivot away from MoE by both IBM and mistral while the larger classes of SOTA of models all seem to be sticking to it.
Quick vibe check of it- 8B @ Q6 - seems promising. Bit of a clinical tone, but can see that being useful for data processing and similar. You don't really want a LLM that spams you with emojis sometimes...
Makes sense, dense for small models, dense or MoE for larger ones, end up fitting various hardware setups pretty neatly, no need for MoE at smaller scale and dense too heavy at large scale.
Third line in to the article: "But there’s one result in the benchmarks I keep coming back to."
I hear this sort of thing all the time now on YouTube from media/news personalities:
“And that’s the part nobody seems to be talking about.”
"And here's what keeps me up at night."
“This is where the story gets complicated.”
“Here’s the piece that doesn’t quite fit.”
“And this is where the usual explanation starts to break down.”
“Here’s what I can’t stop thinking about.”
“The part that should worry us is not the obvious one.”
“And that’s where the real problem begins.”
“But the more interesting question is the one no one is asking.”
“And this is where things stop being simple.”
It doesn't really worry me but I think its interesting that LLM speak sounds so distinctive, and how willing these media personalities are to be so obvious in reading out on TV what the LLM spat out.
I've never studied what LLMs say in depth is it is interesting that my brain recognises the speech pattern so easily.
I think this kind of language predates widespread LLM use, and has been picked up from that kind of writing. It's a "and here's where it gets interesting" pattern that people like Malcolm Gladwell and Freakonomics have used, even if the same thing could be said in a way that makes it sound much less intriguing.
The language of drama and import without meaningful substance. Words statistically likely to be used in a segue, regardless of the preceding or subsequent point. Particularly effective when it seems like you’re getting let in on a secret. Really fatiguing to read
A writing teacher once excoriated me for saying that something was important. “Don’t tell me it’s important, show me, and let me decide, and if you do your job I’ll agree”
I don’t know how a completion can tell when it needs to do this. Mostly so far it doesn’t seem capable
I notice this very often in LinkedIn posts, and it's annoying, but I had not realized it was LLM-speak? Isn't it possible that people write like this naturally?
I don't really see reason to complain about tool use, so long as the result is cohesive, accurate and that ultimately means a human has at least read their own output before publishing. It's a bit like receiving a supposedly personal letter that starts "Dear [INSERT_FIRST_NAME_FIELD]," are you really going to read such a thing?
My opinion is that literature and art will continue pushing the envelope in the places they always pushed the envelope. LLMs will not change this, humans love making art, and they love doing it in new ways.
Corporate announcements were never the places that literature and art were pushing the envelope. They were slop before, and they're slop now.
If you really think about why MoE came into existence, its to save significant cost during training, I don't think there was any concrete evidence of performance gains for comparable MoE vs dense models. Over the years, I believe all the new techniques being employed in post training have made the models better.
I think you mean inference compute? I believe all expert weights are updated in each backward pass during MoE training. The first benefit was getting a sort of structured pruning of weights through the mechanism of expert selection so that the model didn’t need to go through ‘unnecessary’ parts of the model for a given token. This then let inference use memory more efficiently in memory constrained environments, where non-hot or less common experts could be put into slow RAM, or sometimes even streamed off storage.
But I don’t think it necessarily saved training cost; if it did, I’d be interested to learn how!
Each token is only routed through a few chosen (topk) experts during training. So not all expert weights are updated in the backward pass. Otoh, you may need more training to ensure all experts see enough tokens!
I doubt MoE is actually worth it, given how complicated high-performance expert routing and training is. But who knows, I don't.
MoE models will have far more world knowledge than dense models with the same amount of active parameters. MoE is a no-brainer if your inference setup is ultimately limited by compute or memory throughput - not total memory footprint - or alternately if it has fast, high-bandwidth access to lower-tier storage to fetch cold model weights from on demand.
qwen3.5 9b outperforms granite 4.1 30b by a huge amount (32 vs 15 on artificialanalysis benchmark)... i have no idea what made the writer of this article say so many demonstrably incorrect things
[delayed]
Nah, I ain't reading that. If they can't be bothered to get a human to write it, it can't be that important. I'm glad for them though. Or sorry that happened.
This is the official announcement: https://research.ibm.com/blog/granite-4-1-ai-foundation-mode...
It is not the researchers' fault that some slop got posted here instead.
The writing is actually disgusting. It’s obviously LLM writing, but it veers atypically douchey even for LLM writing, which really makes me wonder what the prompt was.
Reading posts like this gives me that same sort of feeling as when I used to get customer support in stilted English from a CS agent named “Michael”.
https://research.ibm.com/blog/granite-4-1-ai-foundation-mode...
Original article on IBM research
Hugging face weights: https://huggingface.co/collections/ibm-granite/granite-41-la...
I test drove it yesterday. It's pretty impressive at 8b. Runs on commodity hardware quickly.
Qwen3.6 35b a3b is still my local champion but I may use this for auto complete and small tasks. Granite has recent training data which is nice. If the other small models got fine tuned on recent data I don't know if I would use this at all, but that alone makes it pretty decent.
The 4b they released was not good for my needs but could probably handle tool calls or something
Have you tried the Gemma 4 series, out of curiosity? I haven’t run a local model in a while, but the benchmarks look good. I’d take a free local tool-use model if it was relatively consistent.
Qwen 3.6 burns it to the ground. it was not even a challenge. Gemma4 seriously fails at toolcalls and agentic works. It got all messed up after 2-3 turns of Vibecoding.
How do you run it? vllm? llama.cpp?
Can you share some parameters you enable tool calling and agentic usage?
Or, higher level, some philosophies on what approaches you are using for tuning to get better tool calling and/or agentic usage?
I'm having surprisingly good success with unsloth/Qwen3.6-27B-GGUF:Q4_K_M (love unsloth guys) on my RTX3090/24GB using opencode as the orchestrator.
It concocts some misleading paths, but the code often compiles, and I consider that a victory.
You have to watch it like you would watch a 14 year old boy who says he is doing his homework but you hear the sound effects of explosions.
Gemma 4 31b was working ok for me; but it was consuming tons of memory on SWA checkpoints, I had to turn them way down, and as a 31b dense model is fairly slow on a Strix Halo. I did have a lot of tool calling issues on 26b-a4b, though.
The Qwen models are quite solid though.
Gemma4 is definitely not used for vibe/agentic coding. Not even worth trying. But its a different weight class.
I tried the Gemma 4 I think 2 and 4b. The 2b was not useful for me at all. A little too weak for my use cases
The 4b was okay. It didn't get all of my small math questions right, it didn't know about some of the libraries I use, but it was able to do some basic auto complete type stuff. For microscopic models I like the llama 3.2 3b more right now for what I do, it's a little faster and seems a little stronger for what I do. But everyone is different and I don't think I'll use it anymore this past month has been crazy for local model releases.
can you share your use cases for 2b and 4b models?
curious how people are leveraging these models
For me, I use them for quick auto complete or small questions. I am not a vibe/agentic coder. I know I am a relic and a Luddite because of this.
Instead of hitting stack overflow and Google I will ask questions like "can you give me an example of how to do x in library y?" Or "this error is appearing what might be happening if I checked a b and c". Or "please write unit tests for this function". Or code auto complete.
I am not looking for the world's best answer from a 3b model. I am looking for a super fast answer that reminds me of things I already know or maybe just maybe gives me a fast idea to stub something while I focus on something more important, I am going to refactor anyways. Think a low quality rubber duck
I mostly use 7-9b models for this now but llama 3.2 3b is pretty decent for not hogging resources while say I have other compute heavy operations happening on a weak computer.
Probably half the questions people ask chatgpt could get roughly the same quality of answer with a small model in my opinion. You can't fully trust an LLM anyways so the difference between 60% and 70% accuracy isn't as much are marketing makes it sound like. That said the quality of a good 7-9b model is worth it compared to a 3b if your machine can run it. Furthermore the quality of qwen 36 is crazy and makes me wonder if I will ever need an AI provider again if the trend continues.
Qwen3-Coder-Next seems to be perfect sized for coding. I tried the new and just found the verbosity not really useful for coding. But probably for more analytical tasks or writing docs.
Yea, No doubt Qwen 3.6 open weights are far more strong
Why no doubt?
No comparison with competitor models other than the previous granite version strongly implies that it does not compete well with other comparable models. At least this is the most reasonable assumption until data comes out to the contrary
Qwen 36 is effectively a pocket sized frontier model. It's really surprising for me anyway
Because Qwen 3.6 pushes way above its weight. Granite 8B is impressive, but Qwen still wins on raw capability, especially for coding.
You just asserted the same thing again. Why do you say this is the case?
Qwen scores above sonnet in coding benchmarks. Runs locally. In personal use it's really good. Anecdotally others have used it to vibe code or agentic code successfully. Not toy problems. Not a toy model.
Qwen3.6 raises the bar for models of its size. There really isn't a comparison in my opinion.
Having tried it.
Qwen is really good.
Also, generally, it makes sense. 8B models are generally not very good^.
That this 8B model is decent is impressive, but that it could perform on par with a good model 4 times as large is a daydream.
^ - To be polite. The small models + tool use for coding agents are almost universally ass. Proof: my personal experience. Ive tried many of them.
So it’s just like, your opinion, man?
edit: It was a play on The Big Lebowski, folks.
College SAT scores do not tell you how the dev applying for your open back end systems engineering job is going to do once they're in your workplace harness.
Nor do class standings, nor hackerrank and the like.
What will tell you is asking them to fix a thing in your codebase. Once you ask an LLM to do that, a dozen times, I'd argue it's no longer "just your opinion man", it's a context-engineered performance x applicability assessment.
And it is very predictive.
But it's also why someone doing well at job A isn't necessarily going to be great at B, or bad at A doesn't mean will necessarily be bad at B.
I've often felt we should normalize a sort of mutual try-buy period where job-change seeker and company can spend a series of days without harming one's existing employment, to derisk the mutual learning. ESPECIALLY to derisk the career change for the applicant who only gets one timeline to manage, opposed to company that considers the applicant fungible.
But back to the LLM, yeah, the only valid opinion on whether it works for you is not benchmark, it's an informed opinion from 'using it in anger'.
the (dead) internet is full of opinions exactly like this
you tried qwen3.6 and you think it is not good?
I do not have high opinions of any ai model.
Way above its weights.
Nanobanana for scale.
The real "sleeper" might be https://huggingface.co/ibm-granite/granite-vision-4.1-4b if the benchmarks hold up for such a small model against frontier models for table & semantic k:v extraction.
Woah, is this part of the future of models? Basically little models you can use as tools.
It's looking like running your own mini ecosystem is the way of the future to me. No data centers, just a decent GPU 16-24gb of VRAM, CPU, and 32gb of RAM.
Eventually we'll have models small enough to do a single thing really well and we'll call them functions.
I'm pretty sure there's someone somewhere who'll create a proper harness that's equivalent to one giant model. The difficulty is mostly local hardware has lot of memory constraints. Targeting 128GB would seem to be the current sweet spot. If we could get out of the corporate market movers of buying up all the memory, we could maybe have more.
Regardless, the people in the 80s capable of pruning programs to fit on small devices is likely happening now. I'd bet most of the Chinese firms are doing it because of the US's silly GPU games among other constraints.
Very impressive series of SLM by IBM here.
I have been using it with their Chunkless RAG concept and it is fitting very well! (for curious https://github.com/scub-france/Docling-Studio)
I convinced that SLM are a real parto of solution for true integrated AI in process...
Interesting to see a pivot away from MoE by both IBM and mistral while the larger classes of SOTA of models all seem to be sticking to it.
Quick vibe check of it- 8B @ Q6 - seems promising. Bit of a clinical tone, but can see that being useful for data processing and similar. You don't really want a LLM that spams you with emojis sometimes...
Makes sense, dense for small models, dense or MoE for larger ones, end up fitting various hardware setups pretty neatly, no need for MoE at smaller scale and dense too heavy at large scale.
I never want LLM to span me with emojis. What is the use case for that? I find it highly annoying.
Shh people are paying for each token. Don't get them asking too many questions
Think it can be a plus in moderation. eg in openclaw it can add some character
But yea dislike that style where each heading and bullet point gets an emoji
> Full stop.
Why people don't edit out obvious sloppification and expect to still have readers left
Third line in to the article: "But there’s one result in the benchmarks I keep coming back to."
I hear this sort of thing all the time now on YouTube from media/news personalities:
“And that’s the part nobody seems to be talking about.”
"And here's what keeps me up at night."
“This is where the story gets complicated.”
“Here’s the piece that doesn’t quite fit.”
“And this is where the usual explanation starts to break down.”
“Here’s what I can’t stop thinking about.”
“The part that should worry us is not the obvious one.”
“And that’s where the real problem begins.”
“But the more interesting question is the one no one is asking.”
“And this is where things stop being simple.”
It doesn't really worry me but I think its interesting that LLM speak sounds so distinctive, and how willing these media personalities are to be so obvious in reading out on TV what the LLM spat out.
I've never studied what LLMs say in depth is it is interesting that my brain recognises the speech pattern so easily.
I listened to a lot of NPR podcasts before LLM were around, and most of them are full of these kinds of filler phrases.
I think this kind of language predates widespread LLM use, and has been picked up from that kind of writing. It's a "and here's where it gets interesting" pattern that people like Malcolm Gladwell and Freakonomics have used, even if the same thing could be said in a way that makes it sound much less intriguing.
There's even a word for it: “cliché”
How banal
The language of drama and import without meaningful substance. Words statistically likely to be used in a segue, regardless of the preceding or subsequent point. Particularly effective when it seems like you’re getting let in on a secret. Really fatiguing to read
A writing teacher once excoriated me for saying that something was important. “Don’t tell me it’s important, show me, and let me decide, and if you do your job I’ll agree”
I don’t know how a completion can tell when it needs to do this. Mostly so far it doesn’t seem capable
Maybe the solution is to cull the bad, cliché writing from the training data.
You can just instruct the LLM not to write like an LLM.
Ugh, you're making me remember the last time I listened to NPR. It's so bad.
I listen to NPR daily and I don't think I've ever heard any of them use that phrasing.
I notice this very often in LinkedIn posts, and it's annoying, but I had not realized it was LLM-speak? Isn't it possible that people write like this naturally?
Arguably it's exactly because it was used naturally so often that the LLMs parrot it so frequently.
I think LLM's have that sort of "summarise, wrap it in a bow tie, give a little dramatic punch as a preview to the next few points".
Yes. Some people are very trigger happy in attributing human slop to LLMs.
Apparently John Oliver was an LLM before they were even invented.
So are we saying it's fine that the article is written by an LLM as long as it doesn't have the tell-tale signs of LLMs?
It's more about curating the things you're publishing. Why would I bother reading what you couldn't bother to read?
I don't really see reason to complain about tool use, so long as the result is cohesive, accurate and that ultimately means a human has at least read their own output before publishing. It's a bit like receiving a supposedly personal letter that starts "Dear [INSERT_FIRST_NAME_FIELD]," are you really going to read such a thing?
An article without telltale signs of an LLM is indistinguishable from an article written by a human, so yes.
My opinion is that literature and art will continue pushing the envelope in the places they always pushed the envelope. LLMs will not change this, humans love making art, and they love doing it in new ways.
Corporate announcements were never the places that literature and art were pushing the envelope. They were slop before, and they're slop now.
Are you referring to the literal use of the expression "full stop"? I don't see it anymore in the article, maybe they edited it out?
IBM announcement: https://research.ibm.com/blog/granite-4-1-ai-foundation-mode...
If you really think about why MoE came into existence, its to save significant cost during training, I don't think there was any concrete evidence of performance gains for comparable MoE vs dense models. Over the years, I believe all the new techniques being employed in post training have made the models better.
I think you mean inference compute? I believe all expert weights are updated in each backward pass during MoE training. The first benefit was getting a sort of structured pruning of weights through the mechanism of expert selection so that the model didn’t need to go through ‘unnecessary’ parts of the model for a given token. This then let inference use memory more efficiently in memory constrained environments, where non-hot or less common experts could be put into slow RAM, or sometimes even streamed off storage.
But I don’t think it necessarily saved training cost; if it did, I’d be interested to learn how!
Each token is only routed through a few chosen (topk) experts during training. So not all expert weights are updated in the backward pass. Otoh, you may need more training to ensure all experts see enough tokens!
I doubt MoE is actually worth it, given how complicated high-performance expert routing and training is. But who knows, I don't.
MoE models will have far more world knowledge than dense models with the same amount of active parameters. MoE is a no-brainer if your inference setup is ultimately limited by compute or memory throughput - not total memory footprint - or alternately if it has fast, high-bandwidth access to lower-tier storage to fetch cold model weights from on demand.
qwen3.5 9b outperforms granite 4.1 30b by a huge amount (32 vs 15 on artificialanalysis benchmark)... i have no idea what made the writer of this article say so many demonstrably incorrect things
> models are judged by GPT-4
An interesting choice
Wish they also released an embedding model, in the line of their previous: compact (while good)...
They did:
https://huggingface.co/collections/ibm-granite/granite-embed...
311M and 97M versions.
sounds interesting. Here's hoping they release a 32B model, thats a pretty good sweet spot for feasibility of home setups.
edit: I just realised they do actually have a 30b release alongside this. Haven't tried it yet.
Try qwen 3.6. it will knock your socks off
"open source"
show me.