There is some humor in the fact that china (of all countries) is pioneering possibly the world's most important tech via open source, while we (US) are doing the exact opposite.
All great technological advancements have come through opening up technology. Just look at your iPhone. GPS, the internet, AI voice assistants, touchscreens, microprocessors, lithium-ion batteries, etc all came from gov't research (I'm counting Bell Labs' gov't mandated monopoly + research funding as gov't) that was opened up for free instead of being locked behind a patent.
Private companies will never open up a technological breakthrough to their competitors. It just doesn't make sense. If you want an entire field to advance, you have to open it up.
I've always been surprised Kimi doesn't get more attention than it does. It's always stood out to me in terms of creativity, quality... has been my favorite model for awhile
Kagi has it as an option in its Assistant thing, where there is naturally a lot of searching and summarizing results. I've liked its output there and in general when asked for prose that isn't in the list/Markdown-heavy "LLM style." It's hard to do a confident comparison, but it's seemed bold in arranging the output to flow well, even when that took surgery on the original doc(s). Sometimes the surgery's needed e.g. to connect related ideas the inputs treated as separate, or to ensure it really replies to the request instead of just dumping info that's somehow related to it.
The parent poster is probably referring to Kimi-Dev-72B¹, which is a much smaller and older model, while people are probably more familiar with the big and fairly powerful 1100B Kimi-K2.5².
Gonna give this one a go... the previous 2.5 model is used for Cursor's Composer 2 Fast. After real world tasks during a few weeks I have seen that it can be very dumb or it can be very good (better than Opus 4.7) depending on the problem you throw at it.
Sometimes in one single pass prompt/response can unblock you in issues where Opus ate $100+ in API credits and circled during hours. Other times the response is useless, but it is your responsibility as engineer to discern this.
Wow, if the benchmarks checkout with the vibes, this could almost be like a Deepseek moment with Chinese AI now being neck and neck with SOTA US lab made models
I've got a 12T model on my machine, built it myself. It's called Mytho. Too dangerous to even release a fact sheet about it. It can hack into the mainframe, enhance ultra-compressed images, grow your hair back, and make people fall in love with you.
According to the benchmarks, you are wrong. It is on track and slightly above some sota. Just the benchmarks speaking there, they can be/are gamed by all big model labs including domestic.
I have been testing it in my app all morning, and the results line up with 4.6 Sonnet. This is just a "vibe" feeling with no real testing. I'm glad we have some real competition to the "frontier" models.
While I'm skeptical of any "beats opus" claims (many were said, none turned out to be true), I still think it's insane that we can now run close-to-SotA models locally on ~100k worth of hardware, for a small team, and be 100% sure that the data stays local. Should be a no-brainer for teams that work in areas where privacy matters.
Even the smaller quantized models which can run on consumer hardware pack in an almost unfathomable amount of knowledge. I don't think I expected to be able to run a 'local Google' in my lifetime before the LLM boom.
I think this one is only about 600GB VRAM usage, so it could fit on two mac studios with 512GB vram each. That would have costed (albeit no longer available) something like less than 20k.
Yeah, but that's personal use at best, not much agentic anything happening on that hardware. Macs are great for small models at small-medium context lengths, but at > 64k (something very common with agentic usage) it struggles and slows down a lot.
The ~100k hardware is suitable for multi-user, small team usage. That's what you'd use for actual work in reasonable timeframes. For personal use, sure macs could work.
You could run it with SSD offload, earlier experiments with Kimi 2.5 on M5 hardware had it running at 2 tok/s. K2.6 has a similar amount of total and active parameters.
Could be right. I just noticed my feed is absent the usual flood of posts demoing the new hotness on 3D modeling, game design and SVG drawings of animals on vehicles.
If the benchmarks are private, how do we reproduce the results? I looked up the Humanity's Last Exam (https://agi.safe.ai/) this model uses and I can't seem to access it.
Exciting benchmarks if true. What kind of hardware do they typically run these benchmarks on? Apologies if my terminology is off, but I assume they're using an unquantized version that wouldn't run on even the beefiest MacBook?
Running it through opencode to their API and... it definitely seems like it's "overthinking" -- watching the thought process, it's been going for pages and pages and pages diagnosing and "thinking" things through... without doing anything. Sitting at 50k+ output tokens used now just going in thought circles, complete analysis paralysis.
Might be a configuration or prompt issue. I guess I'll wait and see, but I can't get use out of this now.
I pray the benchmark figures are true so I can stop paying Anthropic after screwing me over this quarter by dumbing down their models, making usage quotas ridiculously small, and demanding KYC paperwork.
What's the privacy/data security like? I can't find that on that page.
Edit: found it.
> We may use your Content to operate, maintain, improve, and develop the Services, to comply with legal obligations, to enforce our policies, and to ensure security. You may opt out of allowing your Content to be used for model improvement and research purposes by contacting us at membership@moonshot.ai. We will honor your choice in accordance with applicable law.
Accessed via OpenRouter, this one decided to wrap the SVG pelican in HTML with controls for the animation speed: https://gisthost.github.io/?ecaad98efe0f747e27bc0e0ebc669e94...
Transcript and HTML here: https://gist.github.com/simonw/ecaad98efe0f747e27bc0e0ebc669...
At this point drawing these Pelicans must be in the training data sets.
We got an overachiever, here. Kimi sounds like a teacher's pet kind of name.
There is some humor in the fact that china (of all countries) is pioneering possibly the world's most important tech via open source, while we (US) are doing the exact opposite.
All great technological advancements have come through opening up technology. Just look at your iPhone. GPS, the internet, AI voice assistants, touchscreens, microprocessors, lithium-ion batteries, etc all came from gov't research (I'm counting Bell Labs' gov't mandated monopoly + research funding as gov't) that was opened up for free instead of being locked behind a patent.
Private companies will never open up a technological breakthrough to their competitors. It just doesn't make sense. If you want an entire field to advance, you have to open it up.
additional humor is the open in openai
Maybe open source == communism
Good ol' Steve "Developers! Developers! Developers!" Ballmer said so a long time ago. What a visionary!
But China is not communist event though the rulling party the word in its name.
Oh i’m fully aware of that lol
I've always been surprised Kimi doesn't get more attention than it does. It's always stood out to me in terms of creativity, quality... has been my favorite model for awhile
Kagi has it as an option in its Assistant thing, where there is naturally a lot of searching and summarizing results. I've liked its output there and in general when asked for prose that isn't in the list/Markdown-heavy "LLM style." It's hard to do a confident comparison, but it's seemed bold in arranging the output to flow well, even when that took surgery on the original doc(s). Sometimes the surgery's needed e.g. to connect related ideas the inputs treated as separate, or to ensure it really replies to the request instead of just dumping info that's somehow related to it.
It's also one of the few models that seem capable of drawing an SVG clock
https://clocks.brianmoore.com/
[delayed]
Is it? In your link it definitely failed to draw the clock.
Dirt cheap on openrouter for how good it is, too. Really hoping that 2.6 carries on that tradition.
Maybe because it's a bit of like unleashing a chaos monkey on your codebase? I tried it locally (K2.5 72B) and couldn't get anything useful.
Huh, that's not a thing?
The parent poster is probably referring to Kimi-Dev-72B¹, which is a much smaller and older model, while people are probably more familiar with the big and fairly powerful 1100B Kimi-K2.5².
[1] https://huggingface.co/moonshotai/Kimi-Dev-72B
[2] https://huggingface.co/moonshotai/Kimi-K2.5
Yes it was good for its time, but 10 months old which is a long time ago in this space. It was also a fine-tune (albeit a good one) of Qwen-2.5 72B.
I wish they did more smaller models. Kimi Linear doesn't really count, it was more of a proof of concept thing.
Gonna give this one a go... the previous 2.5 model is used for Cursor's Composer 2 Fast. After real world tasks during a few weeks I have seen that it can be very dumb or it can be very good (better than Opus 4.7) depending on the problem you throw at it.
Sometimes in one single pass prompt/response can unblock you in issues where Opus ate $100+ in API credits and circled during hours. Other times the response is useless, but it is your responsibility as engineer to discern this.
Verdict (at least for me): use both.
Wow, if the benchmarks checkout with the vibes, this could almost be like a Deepseek moment with Chinese AI now being neck and neck with SOTA US lab made models
Its not anywhere close, and if it was nobody in the USA would be spending 7 figures on infrastructure for it.
You LLM people all here serious cases of Dunning Kruger
> Its not anywhere close
Close to what, and how are you measuring?
> nobody in the USA would be spending 7 figures on infrastructure for it
Au contraire, if AI had a moat it would pay for itself. They're funneling capital into infrastructure because they know it can't.
With the previous generation? Yes. With 10T mythos-level models? Not even close.
The psyop continues. Mythos until it’s released is vaporware. Notice how you can try kimi 2.6. Where is the same for mythos?
Mythos isn't the current generation, it's literally vaporware.
I've got a 12T model on my machine, built it myself. It's called Mytho. Too dangerous to even release a fact sheet about it. It can hack into the mainframe, enhance ultra-compressed images, grow your hair back, and make people fall in love with you.
There's no public data about Mytho.
That's because it would be too dangerous to release.
My girlfriend goes to a different school, you wouldn't know her.
Same for teleport, time travel and warp drive.
So is my P=NP proof.
According to the benchmarks, you are wrong. It is on track and slightly above some sota. Just the benchmarks speaking there, they can be/are gamed by all big model labs including domestic.
10T? Impossible! They told us the training run was under 10^26 flops.
I have been testing it in my app all morning, and the results line up with 4.6 Sonnet. This is just a "vibe" feeling with no real testing. I'm glad we have some real competition to the "frontier" models.
I have a subscription through work, I've been trialing it, so far it looks on par, if not better, than opus.
I'm pretty Kimi is what Cursor uses for their "composer 2" model. Works pretty good as a fallback when Claude runs out, but definitely a downgrade.
If only their API wasn't tied to a Google or phone login...
Beats opus 4.6! They missed claiming the frontier by a few days.
While I'm skeptical of any "beats opus" claims (many were said, none turned out to be true), I still think it's insane that we can now run close-to-SotA models locally on ~100k worth of hardware, for a small team, and be 100% sure that the data stays local. Should be a no-brainer for teams that work in areas where privacy matters.
Even the smaller quantized models which can run on consumer hardware pack in an almost unfathomable amount of knowledge. I don't think I expected to be able to run a 'local Google' in my lifetime before the LLM boom.
I think this one is only about 600GB VRAM usage, so it could fit on two mac studios with 512GB vram each. That would have costed (albeit no longer available) something like less than 20k.
Yeah, but that's personal use at best, not much agentic anything happening on that hardware. Macs are great for small models at small-medium context lengths, but at > 64k (something very common with agentic usage) it struggles and slows down a lot.
The ~100k hardware is suitable for multi-user, small team usage. That's what you'd use for actual work in reasonable timeframes. For personal use, sure macs could work.
You could run it with SSD offload, earlier experiments with Kimi 2.5 on M5 hardware had it running at 2 tok/s. K2.6 has a similar amount of total and active parameters.
Opus is clearly a sidegrade meant to help Anthropic manage cost, so I would say they may have it if it actually beats 4.6
Could be right. I just noticed my feed is absent the usual flood of posts demoing the new hotness on 3D modeling, game design and SVG drawings of animals on vehicles.
It doesn't beat Opus 4.6, no way, don't be fooled by benchmarks.
Really excited to try this one, I've been using kimi 2.5 for design and it's really good but borderline useless on backend/advanced tasks.
Also discovered that using OpenCode instead of the kimi cli, really hurts the model performance (2.5).
If the benchmarks are private, how do we reproduce the results? I looked up the Humanity's Last Exam (https://agi.safe.ai/) this model uses and I can't seem to access it.
wow - $0.95 input/$4 output. If its anywhere near opus 4.6 that's incredible.
This should erase any doubt that AI Labs are making $$$ on API inference.
Kimi 2.5 (which this is based on) is served at $0.44 input / $2 output by a ton of different providers on OpenRouter, 2.6 will certainly be similar.
That's about 11X less than Opus for similar smarts.
Famously, OpenAI and Anthropic are devoted to increasing efficiency before scaling up resource usage.
How does it erase any doubt? You’re implying Chinese things can’t be actually cheaper to produce than American which is laughable
Exciting benchmarks if true. What kind of hardware do they typically run these benchmarks on? Apologies if my terminology is off, but I assume they're using an unquantized version that wouldn't run on even the beefiest MacBook?
Running it through opencode to their API and... it definitely seems like it's "overthinking" -- watching the thought process, it's been going for pages and pages and pages diagnosing and "thinking" things through... without doing anything. Sitting at 50k+ output tokens used now just going in thought circles, complete analysis paralysis.
Might be a configuration or prompt issue. I guess I'll wait and see, but I can't get use out of this now.
https://huggingface.co/moonshotai/Kimi-K2.6
Is this the same model?
Unsloth quants: https://huggingface.co/unsloth/Kimi-K2.6-GGUF
(work in progress, no gguf files yet, header message saying as much)
Quite curious how well real usage will back the benchmarks, because even if it's only Opus ballpark, open weights Opus ballpark is seismic.
I pray the benchmark figures are true so I can stop paying Anthropic after screwing me over this quarter by dumbing down their models, making usage quotas ridiculously small, and demanding KYC paperwork.
Anthropic has done horrible PR and investors should be livid.
My theory is they pushed retail off their systems to make room for their new corporate fat cat clients. In which case, they'll do just fine.
isnt this better than qwen?
The choice of example task for Long-Horizon Coding is a bit spooky if you squint, since it's nearing the territory of LLMs improving themselves.
K2.5 was already pretty decent so I would try this. Starting at $15/month: https://www.kimi.com/membership/pricing
edit: Note that you can run it yourself with sufficient resources, or access it from other providers too: https://openrouter.ai/moonshotai/kimi-k2.6/providers
What's the privacy/data security like? I can't find that on that page.
Edit: found it.
> We may use your Content to operate, maintain, improve, and develop the Services, to comply with legal obligations, to enforce our policies, and to ensure security. You may opt out of allowing your Content to be used for model improvement and research purposes by contacting us at membership@moonshot.ai. We will honor your choice in accordance with applicable law.
Section 3 of https://www.kimi.com/user/agreement/modelUse?version=v2
You really rely on ToS from Anthropic/OpenAI to know if they use your prompts or not? It's on their servers, why wouldn't they use our data?
How are the usage limits compared to Anthropic?
Anthropic has the worst usage limits in the industry