The key part of the article is that token structure interpretation is a training time concern, not just an input/output processing concern (which still leads to plenty of inconsistency and fragmentation on its own!). That means both that training stakeholders at model development shops need to be pretty incorporated into the tool/syntax development process, which leads to friction and slowdowns. It also means that any current improvements/standardizations in the way we do structured LLM I/O will necessarily be adopted on the training side after a months/years lag, given the time it takes to do new-model dev and training.
That makes for a pretty thorny mess ... and that's before we get into disincentives for standardization (standardization risks big AI labs' moat/lockin).
I guess I fail to see why this is such a problem. Yes it would be nice if the wire format were standardized or had a standard schema description, but is writing a parser that handles several formats actually a difficult problem? Modern models could probably whip up a "libToolCallParser" with bindings for all popular languages in an afternoon. Could probably also have an automated workflow for adding any new ones with minimal fuss. An annoyance, yes, but it does not seem like a really "hard" problem. It seems more of a social problem that open source hasn't coalesced around a library that handles it easily yet or am I missing something?
There already exist products like LiteLLM that adapt tool calling to different providers. FWIW, incompatibility isn't just an opensource problem - OpenAI and Anthropic also use different syntax for tool registration and invocation.
I would guess that lack of standardization of what tools are provided by different agents is as much of a problem as the differences in syntax, since the ideal case would be for a model to be trained end-to-end for use with a specific agent and set of tools, as I believe Anthropic do. Any agent interacting with a model that wasn't specifically trained to work with that agent/toolset is going to be at a disadvantage.
Presumably the hosting services are resolving all of this in their OpenAI/Anthropic compatibility layer, which is what most tools are using. So this is really just a problem for local engines that have to do the same thing but are expected to work right away for every new model drop.
One of the most relevant posts about AI on HN this year. It's not hype-y, but it's imperative to discuss.
I find it strange that the industry hasn't converged in at least somewhat standardized format, but I guess despite all the progress we're still in the very early days...
This is one of the first tech waves where I feel like I'm on the very very groundfloor for a lot of exploration and it only feels like people have been paying closer attention in the last year. I can't imagine too many 'standard' standards becoming a standard that quickly.
It's new enough that Google seems to be throwing pasta against the wall and seeing what products and protocols stick. Antigravity for example seems too early to me, I think they just came out with another type of orchestrator, but the whole field seems to be exploring at the same time.
Everyone and their uncle is making an orchestrator now! I take a very cautious approach lately where I haven't been loading up my tools like agents, ides, browsers, phones with too much extra stuff because as soon as I switch something or something new comes out that doesn't support something I built a workflow around the tool either becomes inaccessible to me, or now a bigger learning curve than I have the patience for.
I've been a big proponent of trying to get all these things working locally for myself (I need to bite the bullet on some beefy video cards finally), and even just getting tool calls to work with some qwen models to be so counterintuitive.
The native way to skip all that is train a small thingy to map hidden state -> token/thingy you care about once per model family, or just do it once and procrustes over the state from the model you're using to whatever you made the map for.
In Greek mythology, Procrustes (/proʊˈkrʌstiːz/; Greek: Προκρούστης Prokroustes, "the stretcher [who hammers out the metal]"), also known as Prokoptas, Damastes (Δαμαστής, "subduer") or Polypemon, was a rogue smith and bandit from Attica who attacked people by stretching them or cutting off their legs, so as to force them to fit the size of an iron bed
I can't figure out if you meant that or not, it kinda fits. (No pun intended)
Useful article, I was fighting with GLM's tool calling format just last night. Stripping and sanitization to make it compatible with my UI consistently has been... fun.
Does anyone know why there hasn’t been more widespread adoption of OpenAI’s Harmony format? Or will it just take another model generation to see adoption?
MCP is the wire format between agent and tool, not the format the model itself uses to emit the call. That part (Harmony, JSON, XML-ish) is still model-specific. So the M×N the article describes is really two problems stacked — MCP only solves the lower half.
Also in practice Claude Code, Cursor and Codex handle the same MCP tool differently — required params, tool descriptions, response truncation. So MCP gives you the contract but the client UX still leaks.
But, like pancakes, usually the stack is described as building bottom-up. Can you relate the individual components to ingredients in a diner-style pancake breakfast?
Feedback: I don't usually comment on formatting, but that fat indent is too much. I applied "hide distracting items" to the graphic, and the indent is still there. Reader works.
The models only output text. Tool calls are nothing more than specially formatted text which gets parsed and interpreted by the inference server (or some other driver) into something which can be picked up by your agent loop and executed. Models are trained in a wide variety of different delimiters and escape characters to indicate their tool calls (along with things like separate thinking blocks). MCP is mostly a standard way to share with your agent loop the list of tool names and what their arguments are, which then gets passed to the inference server which then renders it down to text to feed to the model.
> Tool calls are nothing more than specially formatted text which gets parsed and interpreted by the inference server
I know this is getting off-topic, but is anybody working on more direct tool calling?
LLMs are based on neural networks, so one could create an interface where activating certain neurons triggers tool calls, with other neurons encoding the inputs; another set of neurons could be triggered by the tokenized result from the tool call.
Currently, the lack of separation between data and metadata is a security nightmare, which enables prompt injection. And yet all I've seen done about is are workarounds.
> LLMs are based on neural networks, so one could create an interface where activating certain neurons triggers tool calls, with other neurons encoding the inputs; another set of neurons could be triggered by the tokenized result from the tool call.
You can do this. It's just sticking a different classifier head on top of the model.
Before foundation models it was a standard Deep RL approach. It probably still is within that space (I haven't kept up on the research).
You don't hear about it here because if you do that then every use case needs a custom classifier head which needs to be trained on data for that use case. It negates the "single model you can use for lots of things" benefit of LLMs.
I'm a novice in this area, but my understanding is that LLM parameters ("neurons", roughly?), when processed, encode a probability for token selection/generation that is much more complex and many:one than "parameter A is used in layer B, therefore suggest token C", and not a specific "if activated then do X" outcome. Given that, how would this work?
Each text token already represents the activation of certain neurons. There is nothing "more direct." And you cannot fully separate data and metadata if you want them to influence the output. At best you can clearly distinguish them and hope that this is enough for the model to learn to treat them differently.
The key part of the article is that token structure interpretation is a training time concern, not just an input/output processing concern (which still leads to plenty of inconsistency and fragmentation on its own!). That means both that training stakeholders at model development shops need to be pretty incorporated into the tool/syntax development process, which leads to friction and slowdowns. It also means that any current improvements/standardizations in the way we do structured LLM I/O will necessarily be adopted on the training side after a months/years lag, given the time it takes to do new-model dev and training.
That makes for a pretty thorny mess ... and that's before we get into disincentives for standardization (standardization risks big AI labs' moat/lockin).
I guess I fail to see why this is such a problem. Yes it would be nice if the wire format were standardized or had a standard schema description, but is writing a parser that handles several formats actually a difficult problem? Modern models could probably whip up a "libToolCallParser" with bindings for all popular languages in an afternoon. Could probably also have an automated workflow for adding any new ones with minimal fuss. An annoyance, yes, but it does not seem like a really "hard" problem. It seems more of a social problem that open source hasn't coalesced around a library that handles it easily yet or am I missing something?
Author here. You're right, it's not a hard problem, but a particularly annoying one.
There already exist products like LiteLLM that adapt tool calling to different providers. FWIW, incompatibility isn't just an opensource problem - OpenAI and Anthropic also use different syntax for tool registration and invocation.
I would guess that lack of standardization of what tools are provided by different agents is as much of a problem as the differences in syntax, since the ideal case would be for a model to be trained end-to-end for use with a specific agent and set of tools, as I believe Anthropic do. Any agent interacting with a model that wasn't specifically trained to work with that agent/toolset is going to be at a disadvantage.
Presumably the hosting services are resolving all of this in their OpenAI/Anthropic compatibility layer, which is what most tools are using. So this is really just a problem for local engines that have to do the same thing but are expected to work right away for every new model drop.
Maybe they could vibe code some sort of, I don't know, a Web Service Description Language. That could describe how to interact with a service.
One of the most relevant posts about AI on HN this year. It's not hype-y, but it's imperative to discuss.
I find it strange that the industry hasn't converged in at least somewhat standardized format, but I guess despite all the progress we're still in the very early days...
Sounds like we need another standard. /s
This is one of the first tech waves where I feel like I'm on the very very groundfloor for a lot of exploration and it only feels like people have been paying closer attention in the last year. I can't imagine too many 'standard' standards becoming a standard that quickly.
It's new enough that Google seems to be throwing pasta against the wall and seeing what products and protocols stick. Antigravity for example seems too early to me, I think they just came out with another type of orchestrator, but the whole field seems to be exploring at the same time.
Everyone and their uncle is making an orchestrator now! I take a very cautious approach lately where I haven't been loading up my tools like agents, ides, browsers, phones with too much extra stuff because as soon as I switch something or something new comes out that doesn't support something I built a workflow around the tool either becomes inaccessible to me, or now a bigger learning curve than I have the patience for.
I've been a big proponent of trying to get all these things working locally for myself (I need to bite the bullet on some beefy video cards finally), and even just getting tool calls to work with some qwen models to be so counterintuitive.
Depending on a vendors market position, they may not want to make it easy to switch, which is what standards do, no?
The native way to skip all that is train a small thingy to map hidden state -> token/thingy you care about once per model family, or just do it once and procrustes over the state from the model you're using to whatever you made the map for.
In Greek mythology, Procrustes (/proʊˈkrʌstiːz/; Greek: Προκρούστης Prokroustes, "the stretcher [who hammers out the metal]"), also known as Prokoptas, Damastes (Δαμαστής, "subduer") or Polypemon, was a rogue smith and bandit from Attica who attacked people by stretching them or cutting off their legs, so as to force them to fit the size of an iron bed
I can't figure out if you meant that or not, it kinda fits. (No pun intended)
well yes and no, i meant https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem which, yes, is named for that stretchster
Useful article, I was fighting with GLM's tool calling format just last night. Stripping and sanitization to make it compatible with my UI consistently has been... fun.
Does anyone know why there hasn’t been more widespread adoption of OpenAI’s Harmony format? Or will it just take another model generation to see adoption?
MCP is the wire format between agent and tool, not the format the model itself uses to emit the call. That part (Harmony, JSON, XML-ish) is still model-specific. So the M×N the article describes is really two problems stacked — MCP only solves the lower half.
Also in practice Claude Code, Cursor and Codex handle the same MCP tool differently — required params, tool descriptions, response truncation. So MCP gives you the contract but the client UX still leaks.
But, like pancakes, usually the stack is described as building bottom-up. Can you relate the individual components to ingredients in a diner-style pancake breakfast?
Great article, but your site background had me trying to clean my laptop screen thinking I splashed coffee on it.
Ooops sorry
This is a real problem. The function calling format fragmentation across models makes it painful to build anything provider-agnostic.
Don't inference servers like vllm or sglang just translate these things to openai-compat API shapes?
https://mariozechner.at/posts/2025-11-30-pi-coding-agent/#to...
Clicking that directly yields: "hi orange site user, i'd prefer my stuff to stay off the radar of this particular community."
Feedback: I don't usually comment on formatting, but that fat indent is too much. I applied "hide distracting items" to the graphic, and the indent is still there. Reader works.
This sounds like a problem that LLMs were built to solve.
Not fast enough and increases attack surface
Am I misunderstanding, or isn't this supposed to be the point of MCP?
The models only output text. Tool calls are nothing more than specially formatted text which gets parsed and interpreted by the inference server (or some other driver) into something which can be picked up by your agent loop and executed. Models are trained in a wide variety of different delimiters and escape characters to indicate their tool calls (along with things like separate thinking blocks). MCP is mostly a standard way to share with your agent loop the list of tool names and what their arguments are, which then gets passed to the inference server which then renders it down to text to feed to the model.
> Tool calls are nothing more than specially formatted text which gets parsed and interpreted by the inference server
I know this is getting off-topic, but is anybody working on more direct tool calling?
LLMs are based on neural networks, so one could create an interface where activating certain neurons triggers tool calls, with other neurons encoding the inputs; another set of neurons could be triggered by the tokenized result from the tool call.
Currently, the lack of separation between data and metadata is a security nightmare, which enables prompt injection. And yet all I've seen done about is are workarounds.
> LLMs are based on neural networks, so one could create an interface where activating certain neurons triggers tool calls, with other neurons encoding the inputs; another set of neurons could be triggered by the tokenized result from the tool call.
You can do this. It's just sticking a different classifier head on top of the model.
Before foundation models it was a standard Deep RL approach. It probably still is within that space (I haven't kept up on the research).
You don't hear about it here because if you do that then every use case needs a custom classifier head which needs to be trained on data for that use case. It negates the "single model you can use for lots of things" benefit of LLMs.
I'm a novice in this area, but my understanding is that LLM parameters ("neurons", roughly?), when processed, encode a probability for token selection/generation that is much more complex and many:one than "parameter A is used in layer B, therefore suggest token C", and not a specific "if activated then do X" outcome. Given that, how would this work?
Each text token already represents the activation of certain neurons. There is nothing "more direct." And you cannot fully separate data and metadata if you want them to influence the output. At best you can clearly distinguish them and hope that this is enough for the model to learn to treat them differently.
Are there tokens reserved for tool calls? If yes, I can see the equivalence. If not, not so much.
Yes, typically the tags used for tool calls get their own special tokens, e.g. https://huggingface.co/google/gemma-4-E4B-it/blob/main/token...