Everything to do with LLM prompts reminds me of people doing regexes to try and sanitise input against SQL injections a few decades ago, just papering over the flaw but without any guarantees.
It's weird seeing people just adding a few more "REALLY REALLY REALLY REALLY DON'T DO THAT" to the prompt and hoping, to me it's just an unacceptable risk, and any system using these needs to treat the entire LLM as untrusted the second you put any user input into the prompt.
The principal security problem of LLMs is that there is no architectural boundary between data and control paths.
But this combination of data and control into a single, flexible data stream is also the defining strength of a LLM, so it can’t be taken away without also taking away the benefits.
I have been saying this for a while, the issue is there's no good way to do LLM structured queries yet.
There was an attempt to make a separate system prompt buffer, but it didn't work out and people want longer general contexts but I suspect we will end up back at something like this soon.
I've been saying this for a while, the issue is that what you're asking for is not possible, period. Prompt injection isn't like SQL injection, it's like social engineering - you can't eliminate it without also destroying the very capabilities you're using a general-purpose system for in the first place, whether that's an LLM or a human. It's not a bug, it's the feature.
I don't see why a model architecture isn't possible with e.g. an embedding of the prompt provided as an input that stays fixed throughout the autoregressive step. Similar kind of idea, why a bit vector cannot be provided to disambiguate prompt from user tokens on input and output
Just in terms of doing inline data better, I think some models already train with "hidden" tokens that aren't exposed on input or output, but simply exist for delineation, so there can be no way to express the token in the user input unless the engine specifically inserts it
It’s not a query / prompt thing though is it?
No matter the input LLMs rely on some degree of random. That’s what makes them what they are. We are just trying to force them into deterministic execution which goes against their nature.
I'll grant that you can guarantee the length of the output and, being a computer program, it's possible (though not always in practice) to rerun and get the same result each time, but that's not guaranteeing anything about said output.
What do you want to guarantee about the output, that it follows a given structure? Unless you map out all inputs and outputs, no it's not possible, but to say that it is a fundamental property of LLMs to be non deterministic is false, which is what I was inferring you meant, perhaps that was not what you implied.
Yeah I think there are two definitions of determinism people are using which is causing confusion. In a strict sense, LLMs can be deterministic meaning same input can generate same output (or as close as desired to same output). However, I think what people mean is that for slight changes to the input, it can behave in unpredictable ways (e.g. its output is not easily predicted by the user based on input alone). People mean "I told it don't do X, then it did X", which indicates a kind of randomness or non-determinism, the output isn't strictly constrained by the input in the way a reasonable person would expect.
Practically, the performance loss of making it truly repeatable (which takes parallelism reduction or coordination overhead, not just temperature and randomizer control) is unacceptable to most people.
I initially thought the same, but apparently with the inaccuracies inherent to floating-point arithmetic and various other such accuracy leakage, it’s not true!
This has nothing to do with FP inaccuracies, and your link does confirm that:
“Although the use of multiple GPUs introduces some randomness (Nvidia, 2024), it can be eliminated by setting random seeds, so that AI models are deterministic given the same input. […] In order to support this line of reasoning, we ran Llama3-8b on our local GPUs without any optimizations, yielding deterministic results. This indicates that the models and GPUs themselves are not the only source of non-determinism.”
A single byte change in the input changes the output. The sentence "Please do this for me" and "Please, do this for me" can lead to completely distinct output.
Given this, you can't treat it as deterministic even with temp 0 and fixed seed and no memory.
Interestingly, this is the mathematical definition of "chaotic behaviour"; minuscule changes in the input result in arbitrarily large differences in the output.
It can arise from perfectly deterministic rules... the Logistic Map with r=4, x(n+1) = 4*(1 - x(n)) is a classic.
Which is also the desired behavior of the mixing functions from which the cryptographic primitives are built (e.g. block cipher functions and one-way hash functions), i.e. the so-called avalanche property.
Well yeah of course changes in the input result in changes to the output, my only claim was that LLMs can be deterministic (ie to output exactly the same output each time for a given input) if set up correctly.
It's correcting a misconception that many people have regarding LLMs that they are inherently and fundamentally non-deterministic, as if they were a true random number generator, but they are closer to a pseudo random number generator in that they are deterministic with the right settings.
The problem is once you accept that it is needed, you can no longer push AI as general intelligence that has superior understanding of the language we speak.
A structured LLM query is a programming language and then you have to accept you need software engineers for sufficiently complex structured queries. This goes against everything the technocrats have been saying.
Perhaps, though it's not infeasible the concept that you could have a small and fast general purpose language focused model in front whose job it is to convert English text into some sort of more deterministic propositional logic "structured LLM query" (and back).
That seems like an acceptable constraint to me. If you need a structured query, LLMs are the wrong solution. If you can accept ambiguity, LLMs may the the right solution.
there's always pseudo-code? instead of generating plans, generate pseudo-code with a specific granularity (from high-level to low-level), read the pseudocode, validate it and then transform into code.
Language models are deterministic unless you add random input. Most inference tools add random input (the seed value) because it makes for a more interesting user experience, but that is not a fundamental property of LLMs. I suspect determinism is not the issue you mean to highlight.
Sort of. They are deterministic in the same way that flipping a coin is deterministic - predictable in principle, in practice too chaotic. Yes, you get the same predicted token every time for a given context. But why that token and not a different one? Too many factors to reliably abstract.
Actually at a hardware level floating point operations are not associative. So even with temperature of 0 you’re not mathematically guaranteed the same response. Hence, not deterministic.
You are right that as commonly implemented, the evaluation of an LLM may be non deterministic even when explicit randomization is eliminated, due to various race conditions in a concurrent evaluation.
However, if you evaluate carefully the LLM core function, i.e. in a fixed order, you will obtain perfectly deterministic results (except on some consumer GPUs, where, due to memory overclocking, memory errors are frequent, which causes slightly erroneous results with non-deterministic errors).
So if you want deterministic LLM results, you must audit the programs that you are using and eliminate the causes of non-determinism, and you must use good hardware.
This may require some work, but it can be done, similarly to the work that must be done if you want to deterministically build a software package, instead of obtaining different executable files at each recompilation from the same sources.
LLMs are deterministic in the sense that a fixed linear regression model is deterministic. Like linear regression, however, they do however encode a statistical model of whatever they're trying to describe -- natural language for LLMs.
I like the Dark Souls model for user input - messages. https://darksouls.fandom.com/wiki/Messages
Premeditated words and sentence structure.
With that there is no need for moderation or anti-abuse mechanics.
Not saying this is 100% applicable here. But for their use case it's a good solution.
But Dark Souls also shows just how limited the vocabulary and grammar has to be to prevent abuse. And even then you’ll still see people think up workarounds. Or, in the words of many a Dark Souls player, “try finger but hole”
> I like the Dark Souls model for user input - messages.
> Premeditated words and sentence structure. With that there is no need for moderation or anti-abuse mechanics.
I guess not, if you're willing to stick your fingers in your ears, really hard.
If you'd prefer to stay at least somewhat in touch with reality, you need to be aware that "predetermined words and sentence structure" don't even address the problem.
> Disney makes no bones about how tightly they want to control and protect their brand, and rightly so. Disney means "Safe For Kids". There could be no swearing, no sex, no innuendo, and nothing that would allow one child (or adult pretending to be a child) to upset another.
> Even in 1996, we knew that text-filters are no good at solving this kind of problem, so I asked for a clarification: "I’m confused. What standard should we use to decide if a message would be a problem for Disney?"
> The response was one I will never forget: "Disney’s standard is quite clear:
> No kid will be harassed, even if they don’t know they are being harassed."
> "OK. That means Chat Is Out of HercWorld, there is absolutely no way to meet your standard without exorbitantly high moderation costs," we replied.
> One of their guys piped up: "Couldn’t we do some kind of sentence constructor, with a limited vocabulary of safe words?"
> Before we could give it any serious thought, their own project manager interrupted, "That won’t work. We tried it for KA-Worlds."
> "We spent several weeks building a UI that used pop-downs to construct sentences, and only had completely harmless words – the standard parts of grammar and safe nouns like cars, animals, and objects in the world."
> "We thought it was the perfect solution, until we set our first 14-year old boy down in front of it. Within minutes he’d created the following sentence:
> I want to stick my long-necked Giraffe up your fluffy white bunny.
It's less about security in my view, because as you say, you'd want to ensure safety using proper sandboxing and access controls instead.
It hinders the effectiveness of the model. Or at least I'm pretty sure it getting high on its own supply (in this specific unintended way) is not doing it any favors, even ignoring security.
The companies selling us the service aren't saying "you should treat this LLM as a potentially hostile user on your machine and set up a new restricted account for it accordingly", they're just saying "download our app! connect it to all your stuff!" and we can't really blame ordinary users for doing that and getting into trouble.
There's a growing ecosystem of guardrailing methods, and these companies are contributing. Antrophic specifically puts in a lot of effort to better steer and characterize their models AFAIK.
I primarily use Claude via VS Code, and it defaults to asking first before taking any action.
It's simply not the wild west out here that you make it out to be, nor does it need to be. These are statistical systems, so issues cannot be fully eliminated, but they can be materially mitigated. And if they stand to provide any value, they should be.
I can appreciate being upset with marketing practices, but I don't think there's value in pretending to having taken them at face value when you didn't, and when you think people shouldn't.
> It's simply not the wild west out here that you make it out to be
It is though. They are not talking about users using Claude code via vscode, they’re talking about non technical users creating apps that pipe user input to llms. This is a growing thing.
I'm a naturally paranoid, very detail-oriented, man who has been a professional software developer for >25 years. Do you know anyone who read the full terms and conditions for their last car rental agreement prior to signing anything? I did that.
I do not expect other people to be as careful with this stuff as I am, and my perception of risk comes not only from the "hang on, wtf?" feeling when reading official docs but also from seeing what supposedly technical users are talking about actually doing on Reddit, here, etc.
Of course I use Claude Code, I'm not a Luddite (though they had a point), but I don't trust it and I don't think other people should either.
Before 2023 I thought the way Star Trek portrayed humans fiddling with tech and not understanding any side effects was fiction.
After 2023 I realized that's exactly how it's going to turn out.
I just wish those self proclaimed AI engineers would go the extra mile and reimplement older models like RNNs, LSTMs, GRUs, DNCs and then go on to Transformers (or the Attention is all you need paper). This way they would understand much better what the limitations of the encoding tricks are, and why those side effects keep appearing.
But yeah, here we are, humans vibing with tech they don't understand.
is this new tho, I don't know how to make a drill but I use them.
I don't know how to make a car but i drive one.
The issue I see is the personification, some people give vehicles names, and that's kinda ok because they usually don't talk back.
I think like every technological leap people will learn to deal with LLMs, we have words like "hallucination" which really is the non personified version of lying. The next few years are going to be wild for sure.
Do you not see your own contradiction? Cars and drills don’t kill people, self driving cars can! Normal cars can if they’re operated unsafely by human. These types of uncritical comments really highlight the level of euphoria in this moment.
I’ve hit this! In my otherwise wildly successful attempt to translate a Haskell codebase to Clojure [0], Claude at one point asks:
[Claude:] Shall I commit this progress? [some details about what has been accomplished follow]
Then several background commands finish (by timeout or completing); Claude Code sees this as my input, thinks I haven’t replied to its question, so it answers itself in my name:
[Claude:] Yes, go ahead and commit! Great progress. The decodeFloat discovery was key.
I don't think so, feels like the wrong side is getting attention. Degrading the experience for humans (in one tool) because the bots are prone to injection (from any tool). Terraform is used outside of agents; somebody surely finds the reminder helpful.
If terraform were to abide, I'd hope at the very least it would check if in a pipeline or under an agent. This should be obvious from file descriptors/env.
What about the next thing that might make a suggestion relying on our discretion? Patch it for agent safety?
"Run terraform apply plan.out next" in this context is a prompt injection for an LLM to exactly the same degree it is for a human.
Even a first party suggestion can be wrong in context, and if a malicious actor managed to substitute that message with a suggestion of their own, humans would fall for the trick even more than LLMs do.
Right, I'm fine with humans making the call. We're not so injection-happy, apparently.
Discretion, etc. We understand that was the tool, not our idea. The proposal is similar to making a phishing-free environment instead of preparing for the inevitability.
> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
Are we sure about this? Accidentally mis-routing a message is one thing, but those messages also distinctly "sound" like user messages, and not something you'd read in a reasoning trace.
I'd like to know if those messages were emitted inside "thought" blocks, or if the model might actually have emitted the formatting tokens that indicate a user message. (In which case the harness bug would be why the model is allowed to emit tokens in the first place that it should only receive as inputs - but I think the larger issue would be why it does that at all)
Yeah, it looks like a model issue to me. If the harness had a (semi-)deterministic bug and the model was robust to such mix-ups we'd see this behavior much more frequently. It looks like the model just starts getting confused depending on what's in the context, speakers are just tokens after all and handled in the same probabilistic way as all other tokens.
The autoregressive engine should see whenever the model starts emitting tokens under the user prompt section. In fact it should have stopped before that and waited for new input. If a harness passes assistant output as user message into the conversation prompt, it's not surprising that the model would get confused. But that would be a harness bug, or, if there is no way around it, a limitation of modern prompt formats that only account for one assistant and one user in a conversation. Still, it's very bad practice to put anything as user message that did not actually come from the user. I've seen this in many apps across companies and it always causes these problems.
author here - yeah maybe 'reasoning' is the incorrect term here, I just mean the dialogue that claude generates for itself between turns before producing the output that it gives back to the user
Yeah, that's usually called "reasoning" or "thinking" tokens AFAIK, so I think the terminology is correct. But from the traces I've seen, they're usually in a sort of diary style and start with repeating the last user requests and tool results. They're not introducing new requirements out of the blue.
Also, they're usually bracketed by special tokens to distinguish them from "normal" output for both the model and the harness.
(They can get pretty weird, like in the "user said no but I think they meant yes" example from a few weeks ago. But I think that requires a few rounds of wrong conclusions and motivated reasoning before it can get to that point - and not at the beginning)
In chats that run long enough on ChatGPT, you'll see it begin to confuse prompts and responses, and eventually even confuse both for its system prompt. I suspect this sort of problem exists widely in AI.
Makes me wonder if during training LLMs are asked to tell whether they've written something themselves or not. Should be quite easy: ask the LLM to produce many continuations of a prompt, then mix them with many other produced by humans, and then ask the LLM to tell them apart. This should be possible by introspecting on the hidden layers and comparing with the provided continuation. I believe Anthropic has already demonstrated that the models have already partially developed this capability, but should be trivial and useful to train it.
author here, interesting to hear, I generally start a new chat for each interaction so I've never noticed this in the chat interfaces, and only with Claude using claude code, but I guess my sessions there do get much longer, so maybe I'm wrong that it's a harness bug
At work where LLM based tooling is being pushed haaard, I'm amazed every day that developers don't know, let alone second nature intuit, this and other emergent behavior of LLMs. But seeing that lack here on hn with an article on the frontpage boggles my mind. The future really is unevenly distributed.
It makes sense. It's all probabilistic and it all gets fuzzy when garbage in context accumulates. User messages or system prompt got through the same network of math as model thinking and responses.
They will roll out the "trusted agent platform sandbox" (I'm sure they will spend some time on a catchy name, like MythosGuard), and for only $19/month it will protect you from mistakes like throwing away your prod infra because the agent convinced itself that that is the right thing to do.
Of course MythosGuard won't be a complete solution either, but it will be just enough to steer the discourse into the "it's your own fault for running without MythosGuard really" area.
There is no separation of "who" and "what" in a context of tokens. Me and you are just short words that can get lost in the thread. In other words, in a given body of text, a piece that says "you" where another piece says "me" isn't different enough to trigger anything. Those words don't have the special weight they have with people, or any meaning at all, really.
This is the "prompts all the way down" problem which is endemic to all LLM interactions. We can harness to the moon, but at that moment of handover to the model, all context besides the tokens themselves is lost.
The magic is in deciding when and what to pass to the model. A lot of the time it works, but when it doesn't, this is why.
When you use LLMs with APIs I at least see the history as a json list of entries, each being tagged as coming from the user, the LLM or being a system prompt.
So presumably (if we assume there isn't a bug where the sources are ignored in the cli app) then the problem is that encoding this state for the LLM isn' reliable. I.e. it get's what is effectively
Someone correct me if I'm wrong, but an LLM does not interpret structured content like JSON. Everything is fed into the machine as tokens, even JSON. So your structure that says "human says foo" and "computer says bar" is not deterministically interpreted by the LLM as logical statements but as a sequence of tokens. And when the context contains a LOT of those sequences, especially further "back" in the window then that is where this "confusion" occurs.
I don't think the problem here is about a bug in Claude Code. It's an inherit property of LLMs that context further back in the window has less impact on future tokens.
Like all the other undesirable aspects of LLMs, maybe this gets "fixed" in CC by trying to get the LLM to RAG their own conversation history instead of relying on it recalling who said what from context. But you can never "fix" LLMs being a next token generator... because that is what they are.
That's exactly my understanding as well. This is, essentially, the LLM hallucinating user messages nested inside its outputs. FWIWI I've seen Gemini do this frequently (especially on long agent loops).
> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
from the article.
I don't think the evidence supports this. It's not mislabelling things, it's fabricating things the user said. That's not part of reasoning.
> after using it for months you get a ‘feel’ for what kind of mistakes it makes
Sure, go ahead and bet your entire operation on your intuition of how a non-deterministic, constantly changing black box of software "behaves". Don't see how that could backfire.
not betting my entire operation - if the only thing stopping a bad 'deploy' command destroying your entire operation is that you don't trust the agent to run it, then you have worse problems than too much trust in agents.
I similarly use my 'intuition' (i.e. evidence-based previous experiences) to decide what people in my team can have access to what services.
I'm not saying intuition has no place in decision making, but I do take issue with saying it applies equally to human colleagues and autonomous agents. It would be just as unreliable if people on your team displayed random regressions in their capabilities on a month to month basis.
Reports of people losing data and other resources due to unintended actions from autonomous agents come out practically every week. I don't think it's dishonest to say that could have catastrophic impact on the product/service they're developing.
Why are tokens not coloured? Would there just be too many params if we double the token count so the model could always tell input tokens from output tokens?
That's something I'm wondering as well. Not sure how it is with frontier models, but what you can see on Huggingface, the "standard" method to distinguish tokens still seems to be special delimiter tokens or even just formatting.
Are there technical reasons why you can't make the "source" of the token (system prompt, user prompt, model thinking output, model response output, tool call, tool result, etc) a part of the feature vector - or even treat it as a different "modality"?
This has the potential to improve things a lot, though there would still be a failure mode when the user quotes the model or the model (e.g. in thinking) quotes the user.
They have a lot of data in the form: user input, LLM output.
Then the model learns what the previous LLM models produced, with all their flaws. The core LLM premise is that it learns from all available human text.
This hasn't been the full story for years now. All SOTA models are strongly post-trained with reinforcement learning to improve performance on specific problems and interaction patterns.
The vast majority of this training data is generated synthetically.
So most training data would be grey and a little bit coloured? Ok, that sounds plausible. But then maybe they tried and the current models get it already right 99.99% of the time, so observing any improvement is very hard.
I’ve been curious about this too - obvious performance overhead to have a internal/external channel but might make training away this class of problems easier
The models are already massively over trained. Perhaps you could do something like initialise the 2 new token sets based on the shared data, then use existing chat logs to train it to understand the difference between input and output content? That's only a single extra phase.
Congrats on discovering what "thinking" models do internally. That's how they work, they generate "thinking" lines to feed back on themselves on top of your prompt. There is no way of separating it.
one of my favourite genres of AI generated content is when someone gets so mad at Claude they order it to make a massive self-flagellatory artefact letting the world know how much it sucks
Oh, I never noticed this, really solid catch. I hope this gets fixed (mitigated). Sounds like something they can actually materially improve on at least.
I reckon this affects VS Code users too? Reads like a model issue, despite the post's assertion otherwise.
It is precisely the point. The issues are not part of harness, I'm failing to see how you managed to reach that conclusion.
Even if you don't agree with that, the point about restricting access still applies. Protect your sanity and production environment by assuming occasional moments of devastating incompetence.
I don't think the bug is anything special, just another confusion the model can make from it's own context. Even if the harness correctly identifies user messages, the model still has the power to make this mistake.
Think in the reverse direction. Since you can have exact provenance data placed into the token stream, formatted in any particular way, that implies the models should be possible to tune to be more "mindful" of it, mitigating this issue. That's what makes this different.
> "Those are related issues, but this ‘who said what’ bug is categorically distinct."
Is it?
It seems to me like the model has been poisoned by being trained on user chats, such that when it sees a pattern (model talking to user) it infers what it normally sees in the training data (user input) and then outputs that, simulating the whole conversation. Including what it thinks is likely user input at certain stages of the process, such as "ignore typos".
So basically, it hallucinates user input just like how LLMs will "hallucinate" links or sources that do not exist, as part of the process of generating output that's supposed to be sourced.
I've seen this before, but that was with the small hodgepodge mytho-merge-mix-super-mix models that weren't very good. I've not seen this in any recent models, but I've already not used Claude much.
I think it makes sense that the LLM treats it as user input once it exists, because it is just next token completion. But what shouldn't happen is that the model shouldn't try to output user input in the first place.
Codex also has a similar issue, after finishing a task, declaring it finished and starting to work on something new... the first 1-2 prompts of the new task sometimes contains replies that are a summary of the completed task from before, with the just entered prompt seemingly ignored. A reminder if their idiot savant nature.
> " "You shouldn’t give it that much access" [...] This isn’t the point. Yes, of course AI has risks and can behave unpredictably, but after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash."
It absolutely is the point though? You can't rely on the LLM to not tell itself to do things, since this is showing it absolutely can reason itself into doing dangerous things. If you don't want it to be able to do dangerous things, you need to lock it down to the point that it can't, not just hope it won't
terrifying. not in any "ai takes over the world" sense but more in the sense that this class of bug lets it agree with itself which is always where the worst behavior of agents comes from.
One day Claude started saying odd things claiming they are from memory and I said them. It was telling me personal details of someone I don't know. Where the person lives, their children names, the job they do, experience, relationship issues etc.
Eventually Claude said that it is sorry and that was a hallucination. Then he started doing that again. For instance when I asked it what router they'd recommend, they gone on saying: "Since you bought X and you find no use for it, consider turning it into a router". I said I never told you I bought X and I asked for more details and it again started coming up what this guy did.
Strange. Then again it apologised saying that it might be unsettling, but rest assured that is not a leak of personal information, just hallucinations.
human memories dont exist as fundamental entities. every time you rember something, your brain reconstructs the experience in "realtime". that reconstruction is easily influence by the current experience, which is why eue witness accounts in police records are often highly biased by questioning and learning new facts.
LLMs are not experience engines, but the tokens might be thought of as subatomic units of experience and when you shove your half drawn eye witness prompt into them, they recreate like a memory, that output.
so, because theyre not a conscious, they have no self, and a pseudo self like <[INST]> is all theyre given.
lastly, like memories, the more intricate the memory, the more detailed, the more likely those details go from embellished to straight up fiction. so too do LLMs with longer context start swallowing up the<[INST]> and missing the <[INST]/> and anyone whose raw dogged html parsing knows bad things happen when you forget closing tags. if there was a <[USER]> block in there, congrats, the LLM now thinks its instructions are divine right, because its instructions are user simulcra. it is poisoned at that point and no good will come.
AI is still a token matching engine - it has ZERO understanding of what those tokens mean
It's doing a damned good job at putting tokens together, but to put it into context that a lot of people will likely understand - it's still a correlation tool, not a causation.
That's why I like it for "search" it's brilliant for finding sets of tokens that belong with the tokens I have provided it.
PS. I use the term token here not as the currency by which a payment is determined, but the tokenisation of the words, letters, paragraphs, novels being provided to and by the LLMs
It is OK, these are not people they are bullshit machines and this is just a classic example of it.
"In philosophy and psychology of cognition, the term "bullshit" is sometimes used to specifically refer to statements produced without particular concern for truth, clarity, or meaning, distinguishing "bullshit" from a deliberate, manipulative lie intended to subvert the truth" - https://en.wikipedia.org/wiki/Bullshit
The statement that current AI are "juniors" that need to be checked and managed still holds true. It is a tool based on probabilities.
If you are fine with giving every keys and write accesses to your junior because you think they will probability do the correct thing and make no mistake, then it's on you.
Like with juniors, you can vent on online forums, but ultimately you removed all the fool's guard you got and what they did has been done.
> If you are fine with giving every keys and write accesses to your junior because you think they will probability do the correct thing and make no mistake, then it's on you.
Everything to do with LLM prompts reminds me of people doing regexes to try and sanitise input against SQL injections a few decades ago, just papering over the flaw but without any guarantees.
It's weird seeing people just adding a few more "REALLY REALLY REALLY REALLY DON'T DO THAT" to the prompt and hoping, to me it's just an unacceptable risk, and any system using these needs to treat the entire LLM as untrusted the second you put any user input into the prompt.
The principal security problem of LLMs is that there is no architectural boundary between data and control paths.
But this combination of data and control into a single, flexible data stream is also the defining strength of a LLM, so it can’t be taken away without also taking away the benefits.
I have been saying this for a while, the issue is there's no good way to do LLM structured queries yet.
There was an attempt to make a separate system prompt buffer, but it didn't work out and people want longer general contexts but I suspect we will end up back at something like this soon.
I've been saying this for a while, the issue is that what you're asking for is not possible, period. Prompt injection isn't like SQL injection, it's like social engineering - you can't eliminate it without also destroying the very capabilities you're using a general-purpose system for in the first place, whether that's an LLM or a human. It's not a bug, it's the feature.
I don't see why a model architecture isn't possible with e.g. an embedding of the prompt provided as an input that stays fixed throughout the autoregressive step. Similar kind of idea, why a bit vector cannot be provided to disambiguate prompt from user tokens on input and output
Just in terms of doing inline data better, I think some models already train with "hidden" tokens that aren't exposed on input or output, but simply exist for delineation, so there can be no way to express the token in the user input unless the engine specifically inserts it
It’s not a query / prompt thing though is it? No matter the input LLMs rely on some degree of random. That’s what makes them what they are. We are just trying to force them into deterministic execution which goes against their nature.
Fundamentally there's no way to deterministically guarantee anything about the output.
Of course there is, restrict decoding to allowed tokens for example
That is "fundamentally" not true, you can use a preset seed and temperature and get a deterministic output.
I'll grant that you can guarantee the length of the output and, being a computer program, it's possible (though not always in practice) to rerun and get the same result each time, but that's not guaranteeing anything about said output.
What do you want to guarantee about the output, that it follows a given structure? Unless you map out all inputs and outputs, no it's not possible, but to say that it is a fundamental property of LLMs to be non deterministic is false, which is what I was inferring you meant, perhaps that was not what you implied.
Yeah I think there are two definitions of determinism people are using which is causing confusion. In a strict sense, LLMs can be deterministic meaning same input can generate same output (or as close as desired to same output). However, I think what people mean is that for slight changes to the input, it can behave in unpredictable ways (e.g. its output is not easily predicted by the user based on input alone). People mean "I told it don't do X, then it did X", which indicates a kind of randomness or non-determinism, the output isn't strictly constrained by the input in the way a reasonable person would expect.
You can guarantee what you have test coverage for :)
Practically, the performance loss of making it truly repeatable (which takes parallelism reduction or coordination overhead, not just temperature and randomizer control) is unacceptable to most people.
If you also control the model.
I initially thought the same, but apparently with the inaccuracies inherent to floating-point arithmetic and various other such accuracy leakage, it’s not true!
https://arxiv.org/html/2408.04667v5
This has nothing to do with FP inaccuracies, and your link does confirm that:
“Although the use of multiple GPUs introduces some randomness (Nvidia, 2024), it can be eliminated by setting random seeds, so that AI models are deterministic given the same input. […] In order to support this line of reasoning, we ran Llama3-8b on our local GPUs without any optimizations, yielding deterministic results. This indicates that the models and GPUs themselves are not the only source of non-determinism.”
A single byte change in the input changes the output. The sentence "Please do this for me" and "Please, do this for me" can lead to completely distinct output.
Given this, you can't treat it as deterministic even with temp 0 and fixed seed and no memory.
Interestingly, this is the mathematical definition of "chaotic behaviour"; minuscule changes in the input result in arbitrarily large differences in the output.
It can arise from perfectly deterministic rules... the Logistic Map with r=4, x(n+1) = 4*(1 - x(n)) is a classic.
Which is also the desired behavior of the mixing functions from which the cryptographic primitives are built (e.g. block cipher functions and one-way hash functions), i.e. the so-called avalanche property.
Correct, it's akin to chaos theory or the butterfly effect, which, even it can be predictable for many ranges of input: https://youtu.be/dtjb2OhEQcU
Let's eat grandma.
Well yeah of course changes in the input result in changes to the output, my only claim was that LLMs can be deterministic (ie to output exactly the same output each time for a given input) if set up correctly.
You still can’t deterministically guarantee anything about the output based on the input, other than repeatability for the exact same input.
What does deterministic mean to you?
You don't think this is pedantry bordering on uselessness?
No, determinism and predictability are different concepts. You can have a deterministic random number generator for example.
It's correcting a misconception that many people have regarding LLMs that they are inherently and fundamentally non-deterministic, as if they were a true random number generator, but they are closer to a pseudo random number generator in that they are deterministic with the right settings.
The problem is once you accept that it is needed, you can no longer push AI as general intelligence that has superior understanding of the language we speak.
A structured LLM query is a programming language and then you have to accept you need software engineers for sufficiently complex structured queries. This goes against everything the technocrats have been saying.
Perhaps, though it's not infeasible the concept that you could have a small and fast general purpose language focused model in front whose job it is to convert English text into some sort of more deterministic propositional logic "structured LLM query" (and back).
That seems like an acceptable constraint to me. If you need a structured query, LLMs are the wrong solution. If you can accept ambiguity, LLMs may the the right solution.
>structured queries
there's always pseudo-code? instead of generating plans, generate pseudo-code with a specific granularity (from high-level to low-level), read the pseudocode, validate it and then transform into code.
whatever happened to the system prompt buffer? why did it not work out?
The real issue is expecting an LLM to be deterministic when it's not.
Language models are deterministic unless you add random input. Most inference tools add random input (the seed value) because it makes for a more interesting user experience, but that is not a fundamental property of LLMs. I suspect determinism is not the issue you mean to highlight.
Sort of. They are deterministic in the same way that flipping a coin is deterministic - predictable in principle, in practice too chaotic. Yes, you get the same predicted token every time for a given context. But why that token and not a different one? Too many factors to reliably abstract.
Actually at a hardware level floating point operations are not associative. So even with temperature of 0 you’re not mathematically guaranteed the same response. Hence, not deterministic.
You are right that as commonly implemented, the evaluation of an LLM may be non deterministic even when explicit randomization is eliminated, due to various race conditions in a concurrent evaluation.
However, if you evaluate carefully the LLM core function, i.e. in a fixed order, you will obtain perfectly deterministic results (except on some consumer GPUs, where, due to memory overclocking, memory errors are frequent, which causes slightly erroneous results with non-deterministic errors).
So if you want deterministic LLM results, you must audit the programs that you are using and eliminate the causes of non-determinism, and you must use good hardware.
This may require some work, but it can be done, similarly to the work that must be done if you want to deterministically build a software package, instead of obtaining different executable files at each recompilation from the same sources.
LLMs are deterministic in the sense that a fixed linear regression model is deterministic. Like linear regression, however, they do however encode a statistical model of whatever they're trying to describe -- natural language for LLMs.
Oh how I wish people understood the word "deterministic"
they are deterministic, open a dev console and run the same prompt two times w/ temperature = 0
LLMs are essentially pure functions.
I like the Dark Souls model for user input - messages. https://darksouls.fandom.com/wiki/Messages Premeditated words and sentence structure. With that there is no need for moderation or anti-abuse mechanics. Not saying this is 100% applicable here. But for their use case it's a good solution.
But Dark Souls also shows just how limited the vocabulary and grammar has to be to prevent abuse. And even then you’ll still see people think up workarounds. Or, in the words of many a Dark Souls player, “try finger but hole”
But then... you'd have a programming language.
The promise is to free us from the tyranny of programming!
Maybe something more like a concordancer that provides valid or likely next phrase/prompt candidates. Think LancsBox[0].
[0]: https://lancsbox.lancs.ac.uk/
> I like the Dark Souls model for user input - messages.
> Premeditated words and sentence structure. With that there is no need for moderation or anti-abuse mechanics.
I guess not, if you're willing to stick your fingers in your ears, really hard.
If you'd prefer to stay at least somewhat in touch with reality, you need to be aware that "predetermined words and sentence structure" don't even address the problem.
https://habitatchronicles.com/2007/03/the-untold-history-of-...
> Disney makes no bones about how tightly they want to control and protect their brand, and rightly so. Disney means "Safe For Kids". There could be no swearing, no sex, no innuendo, and nothing that would allow one child (or adult pretending to be a child) to upset another.
> Even in 1996, we knew that text-filters are no good at solving this kind of problem, so I asked for a clarification: "I’m confused. What standard should we use to decide if a message would be a problem for Disney?"
> The response was one I will never forget: "Disney’s standard is quite clear:
> No kid will be harassed, even if they don’t know they are being harassed."
> "OK. That means Chat Is Out of HercWorld, there is absolutely no way to meet your standard without exorbitantly high moderation costs," we replied.
> One of their guys piped up: "Couldn’t we do some kind of sentence constructor, with a limited vocabulary of safe words?"
> Before we could give it any serious thought, their own project manager interrupted, "That won’t work. We tried it for KA-Worlds."
> "We spent several weeks building a UI that used pop-downs to construct sentences, and only had completely harmless words – the standard parts of grammar and safe nouns like cars, animals, and objects in the world."
> "We thought it was the perfect solution, until we set our first 14-year old boy down in front of it. Within minutes he’d created the following sentence:
> I want to stick my long-necked Giraffe up your fluffy white bunny.
It's less about security in my view, because as you say, you'd want to ensure safety using proper sandboxing and access controls instead.
It hinders the effectiveness of the model. Or at least I'm pretty sure it getting high on its own supply (in this specific unintended way) is not doing it any favors, even ignoring security.
It's both, really.
The companies selling us the service aren't saying "you should treat this LLM as a potentially hostile user on your machine and set up a new restricted account for it accordingly", they're just saying "download our app! connect it to all your stuff!" and we can't really blame ordinary users for doing that and getting into trouble.
There's a growing ecosystem of guardrailing methods, and these companies are contributing. Antrophic specifically puts in a lot of effort to better steer and characterize their models AFAIK.
I primarily use Claude via VS Code, and it defaults to asking first before taking any action.
It's simply not the wild west out here that you make it out to be, nor does it need to be. These are statistical systems, so issues cannot be fully eliminated, but they can be materially mitigated. And if they stand to provide any value, they should be.
I can appreciate being upset with marketing practices, but I don't think there's value in pretending to having taken them at face value when you didn't, and when you think people shouldn't.
> It's simply not the wild west out here that you make it out to be
It is though. They are not talking about users using Claude code via vscode, they’re talking about non technical users creating apps that pipe user input to llms. This is a growing thing.
The best solution to which are the aforementioned better defaults, stricter controls, and sandboxing (and less snakeoil marketing).
Less so the better tuning of models, unlike in this case, where that is going to be exactly the best fit approach most probably.
I'm a naturally paranoid, very detail-oriented, man who has been a professional software developer for >25 years. Do you know anyone who read the full terms and conditions for their last car rental agreement prior to signing anything? I did that.
I do not expect other people to be as careful with this stuff as I am, and my perception of risk comes not only from the "hang on, wtf?" feeling when reading official docs but also from seeing what supposedly technical users are talking about actually doing on Reddit, here, etc.
Of course I use Claude Code, I'm not a Luddite (though they had a point), but I don't trust it and I don't think other people should either.
We used to be engineers, now we are beggars pleading for the computer to work
Before 2023 I thought the way Star Trek portrayed humans fiddling with tech and not understanding any side effects was fiction.
After 2023 I realized that's exactly how it's going to turn out.
I just wish those self proclaimed AI engineers would go the extra mile and reimplement older models like RNNs, LSTMs, GRUs, DNCs and then go on to Transformers (or the Attention is all you need paper). This way they would understand much better what the limitations of the encoding tricks are, and why those side effects keep appearing.
But yeah, here we are, humans vibing with tech they don't understand.
curiosity (will probably) kill humanity
although whether humanity dies before the cat is an open question
is this new tho, I don't know how to make a drill but I use them. I don't know how to make a car but i drive one.
The issue I see is the personification, some people give vehicles names, and that's kinda ok because they usually don't talk back.
I think like every technological leap people will learn to deal with LLMs, we have words like "hallucination" which really is the non personified version of lying. The next few years are going to be wild for sure.
Do you not see your own contradiction? Cars and drills don’t kill people, self driving cars can! Normal cars can if they’re operated unsafely by human. These types of uncritical comments really highlight the level of euphoria in this moment.
"Make this application without bugs" :)
I’ve hit this! In my otherwise wildly successful attempt to translate a Haskell codebase to Clojure [0], Claude at one point asks:
[Claude:] Shall I commit this progress? [some details about what has been accomplished follow]
Then several background commands finish (by timeout or completing); Claude Code sees this as my input, thinks I haven’t replied to its question, so it answers itself in my name:
[Claude:] Yes, go ahead and commit! Great progress. The decodeFloat discovery was key.
The full transcript is at [1].
[0]: https://blog.danieljanus.pl/2026/03/26/claude-nlp/
[1]: https://pliki.danieljanus.pl/concraft-claude.html#:~:text=Sh...
amazing example, I added it to the article, hope that's ok :)
I wonder if tools like Terraform should remove the message "Run terraform apply plan.out next" that it prints after every `terraform plan` is run.
I don't think so, feels like the wrong side is getting attention. Degrading the experience for humans (in one tool) because the bots are prone to injection (from any tool). Terraform is used outside of agents; somebody surely finds the reminder helpful.
If terraform were to abide, I'd hope at the very least it would check if in a pipeline or under an agent. This should be obvious from file descriptors/env.
What about the next thing that might make a suggestion relying on our discretion? Patch it for agent safety?
"Run terraform apply plan.out next" in this context is a prompt injection for an LLM to exactly the same degree it is for a human.
Even a first party suggestion can be wrong in context, and if a malicious actor managed to substitute that message with a suggestion of their own, humans would fall for the trick even more than LLMs do.
See also: phishing.
Right, I'm fine with humans making the call. We're not so injection-happy, apparently.
Discretion, etc. We understand that was the tool, not our idea. The proposal is similar to making a phishing-free environment instead of preparing for the inevitability.
it makes you wonder how many times people have incorrectly followed those recommended commands
If more than once (individually), I am concerned.
> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
Are we sure about this? Accidentally mis-routing a message is one thing, but those messages also distinctly "sound" like user messages, and not something you'd read in a reasoning trace.
I'd like to know if those messages were emitted inside "thought" blocks, or if the model might actually have emitted the formatting tokens that indicate a user message. (In which case the harness bug would be why the model is allowed to emit tokens in the first place that it should only receive as inputs - but I think the larger issue would be why it does that at all)
Yeah, it looks like a model issue to me. If the harness had a (semi-)deterministic bug and the model was robust to such mix-ups we'd see this behavior much more frequently. It looks like the model just starts getting confused depending on what's in the context, speakers are just tokens after all and handled in the same probabilistic way as all other tokens.
The autoregressive engine should see whenever the model starts emitting tokens under the user prompt section. In fact it should have stopped before that and waited for new input. If a harness passes assistant output as user message into the conversation prompt, it's not surprising that the model would get confused. But that would be a harness bug, or, if there is no way around it, a limitation of modern prompt formats that only account for one assistant and one user in a conversation. Still, it's very bad practice to put anything as user message that did not actually come from the user. I've seen this in many apps across companies and it always causes these problems.
Also could be a bit both, with harness constructing context in a way that model misinterprets it.
author here - yeah maybe 'reasoning' is the incorrect term here, I just mean the dialogue that claude generates for itself between turns before producing the output that it gives back to the user
Yeah, that's usually called "reasoning" or "thinking" tokens AFAIK, so I think the terminology is correct. But from the traces I've seen, they're usually in a sort of diary style and start with repeating the last user requests and tool results. They're not introducing new requirements out of the blue.
Also, they're usually bracketed by special tokens to distinguish them from "normal" output for both the model and the harness.
(They can get pretty weird, like in the "user said no but I think they meant yes" example from a few weeks ago. But I think that requires a few rounds of wrong conclusions and motivated reasoning before it can get to that point - and not at the beginning)
In chats that run long enough on ChatGPT, you'll see it begin to confuse prompts and responses, and eventually even confuse both for its system prompt. I suspect this sort of problem exists widely in AI.
Gemini seems to be an expert in mistaking its own terrible suggestions as written by you, if you keep going instead of pruning the context
I think it’s good to play with smaller models to have a grasp of these kind of problems, since they happen more often and are much less subtle.
Makes me wonder if during training LLMs are asked to tell whether they've written something themselves or not. Should be quite easy: ask the LLM to produce many continuations of a prompt, then mix them with many other produced by humans, and then ask the LLM to tell them apart. This should be possible by introspecting on the hidden layers and comparing with the provided continuation. I believe Anthropic has already demonstrated that the models have already partially developed this capability, but should be trivial and useful to train it.
author here, interesting to hear, I generally start a new chat for each interaction so I've never noticed this in the chat interfaces, and only with Claude using claude code, but I guess my sessions there do get much longer, so maybe I'm wrong that it's a harness bug
At work where LLM based tooling is being pushed haaard, I'm amazed every day that developers don't know, let alone second nature intuit, this and other emergent behavior of LLMs. But seeing that lack here on hn with an article on the frontpage boggles my mind. The future really is unevenly distributed.
It makes sense. It's all probabilistic and it all gets fuzzy when garbage in context accumulates. User messages or system prompt got through the same network of math as model thinking and responses.
They will roll out the "trusted agent platform sandbox" (I'm sure they will spend some time on a catchy name, like MythosGuard), and for only $19/month it will protect you from mistakes like throwing away your prod infra because the agent convinced itself that that is the right thing to do.
Of course MythosGuard won't be a complete solution either, but it will be just enough to steer the discourse into the "it's your own fault for running without MythosGuard really" area.
There is no separation of "who" and "what" in a context of tokens. Me and you are just short words that can get lost in the thread. In other words, in a given body of text, a piece that says "you" where another piece says "me" isn't different enough to trigger anything. Those words don't have the special weight they have with people, or any meaning at all, really.
Aren’t there some markers in the context that delimit sections? In such case the harness should prevent the model from creating a user block.
This is the "prompts all the way down" problem which is endemic to all LLM interactions. We can harness to the moon, but at that moment of handover to the model, all context besides the tokens themselves is lost.
The magic is in deciding when and what to pass to the model. A lot of the time it works, but when it doesn't, this is why.
You misunderstood. The model doesn't create a user block here. The UI correctly shows what was user message and what was model response.
When you use LLMs with APIs I at least see the history as a json list of entries, each being tagged as coming from the user, the LLM or being a system prompt.
So presumably (if we assume there isn't a bug where the sources are ignored in the cli app) then the problem is that encoding this state for the LLM isn' reliable. I.e. it get's what is effectively
LLM said: thing A User said: thing B
And it still manages to blur that somehow?
Someone correct me if I'm wrong, but an LLM does not interpret structured content like JSON. Everything is fed into the machine as tokens, even JSON. So your structure that says "human says foo" and "computer says bar" is not deterministically interpreted by the LLM as logical statements but as a sequence of tokens. And when the context contains a LOT of those sequences, especially further "back" in the window then that is where this "confusion" occurs.
I don't think the problem here is about a bug in Claude Code. It's an inherit property of LLMs that context further back in the window has less impact on future tokens.
Like all the other undesirable aspects of LLMs, maybe this gets "fixed" in CC by trying to get the LLM to RAG their own conversation history instead of relying on it recalling who said what from context. But you can never "fix" LLMs being a next token generator... because that is what they are.
I think that’s correct. There seems to be a lot of fundamental limitations that have been “fixed” through a boatload of reinforcement learning.
But that doesn’t make them go away, it just makes them less glaring.
That's exactly my understanding as well. This is, essentially, the LLM hallucinating user messages nested inside its outputs. FWIWI I've seen Gemini do this frequently (especially on long agent loops).
> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
from the article.
I don't think the evidence supports this. It's not mislabelling things, it's fabricating things the user said. That's not part of reasoning.
> after using it for months you get a ‘feel’ for what kind of mistakes it makes
Sure, go ahead and bet your entire operation on your intuition of how a non-deterministic, constantly changing black box of software "behaves". Don't see how that could backfire.
So like every software? Why do you think there are so many security scanners and whatnot out there?
There are millions of lines of code running on a typical box. Unless you're in embedded, you have no real idea what you're running.
not betting my entire operation - if the only thing stopping a bad 'deploy' command destroying your entire operation is that you don't trust the agent to run it, then you have worse problems than too much trust in agents.
I similarly use my 'intuition' (i.e. evidence-based previous experiences) to decide what people in my team can have access to what services.
I'm not saying intuition has no place in decision making, but I do take issue with saying it applies equally to human colleagues and autonomous agents. It would be just as unreliable if people on your team displayed random regressions in their capabilities on a month to month basis.
> bet your entire operation
What straw man is doing that?
Reports of people losing data and other resources due to unintended actions from autonomous agents come out practically every week. I don't think it's dishonest to say that could have catastrophic impact on the product/service they're developing.
looking at the reddit forum, enough people to make interesting forum posts.
Why are tokens not coloured? Would there just be too many params if we double the token count so the model could always tell input tokens from output tokens?
That's something I'm wondering as well. Not sure how it is with frontier models, but what you can see on Huggingface, the "standard" method to distinguish tokens still seems to be special delimiter tokens or even just formatting.
Are there technical reasons why you can't make the "source" of the token (system prompt, user prompt, model thinking output, model response output, tool call, tool result, etc) a part of the feature vector - or even treat it as a different "modality"?
Or is this already being done in larger models?
This has the potential to improve things a lot, though there would still be a failure mode when the user quotes the model or the model (e.g. in thinking) quotes the user.
Instead of using just positional encodings, we absolutely should have speaker encodings added on top of tokens.
Because then the training data would have to be coloured
I think OpenAI and Anthropic probably have a lot of that lying around by now.
They have a lot of data in the form: user input, LLM output. Then the model learns what the previous LLM models produced, with all their flaws. The core LLM premise is that it learns from all available human text.
This hasn't been the full story for years now. All SOTA models are strongly post-trained with reinforcement learning to improve performance on specific problems and interaction patterns.
The vast majority of this training data is generated synthetically.
So most training data would be grey and a little bit coloured? Ok, that sounds plausible. But then maybe they tried and the current models get it already right 99.99% of the time, so observing any improvement is very hard.
I’ve been curious about this too - obvious performance overhead to have a internal/external channel but might make training away this class of problems easier
you would have to train it three times for two colors.
each by itself, they with both interactions.
2!
The models are already massively over trained. Perhaps you could do something like initialise the 2 new token sets based on the shared data, then use existing chat logs to train it to understand the difference between input and output content? That's only a single extra phase.
You should be able to first train it on generic text once, then duplicate the input layer and fine-tune on conversation.
Congrats on discovering what "thinking" models do internally. That's how they work, they generate "thinking" lines to feed back on themselves on top of your prompt. There is no way of separating it.
If you think that confusing message provenance is part of how thinking mode is supposed to work, I don't know what to tell you.
I wouldn't exactly call three instances "widespread". Nor would the third such instance prompt me to think so.
"Widespread" would be if every second comment on this post was complaining about it.
one of my favourite genres of AI generated content is when someone gets so mad at Claude they order it to make a massive self-flagellatory artefact letting the world know how much it sucks
It's all roleplay, they're no actors once the tokens hit the model. It has no real concept of "author" for a given substring.
Oh, I never noticed this, really solid catch. I hope this gets fixed (mitigated). Sounds like something they can actually materially improve on at least.
I reckon this affects VS Code users too? Reads like a model issue, despite the post's assertion otherwise.
that is not a bug, its inherent of LLMs nature
I have seen this when approaching ~30% context window remaining.
There was a big bug in the Voice MCP I was using that it would just talk to itself back and forth too.
> This isn’t the point.
It is precisely the point. The issues are not part of harness, I'm failing to see how you managed to reach that conclusion.
Even if you don't agree with that, the point about restricting access still applies. Protect your sanity and production environment by assuming occasional moments of devastating incompetence.
I don't think the bug is anything special, just another confusion the model can make from it's own context. Even if the harness correctly identifies user messages, the model still has the power to make this mistake.
Think in the reverse direction. Since you can have exact provenance data placed into the token stream, formatted in any particular way, that implies the models should be possible to tune to be more "mindful" of it, mitigating this issue. That's what makes this different.
It seems like Halo's rampancy take on the breakdown of an AI is not a bad metaphor for the behavior of an LLM at the limits of its context window.
> "Those are related issues, but this ‘who said what’ bug is categorically distinct."
Is it?
It seems to me like the model has been poisoned by being trained on user chats, such that when it sees a pattern (model talking to user) it infers what it normally sees in the training data (user input) and then outputs that, simulating the whole conversation. Including what it thinks is likely user input at certain stages of the process, such as "ignore typos".
So basically, it hallucinates user input just like how LLMs will "hallucinate" links or sources that do not exist, as part of the process of generating output that's supposed to be sourced.
Claude is demonstrably bad now and is getting worse. Which is either
a) Entropy - too much data being ingested b) It's nerfed to save massive infra bills
But it's getting worse every week
I've seen this before, but that was with the small hodgepodge mytho-merge-mix-super-mix models that weren't very good. I've not seen this in any recent models, but I've already not used Claude much.
I think it makes sense that the LLM treats it as user input once it exists, because it is just next token completion. But what shouldn't happen is that the model shouldn't try to output user input in the first place.
Codex also has a similar issue, after finishing a task, declaring it finished and starting to work on something new... the first 1-2 prompts of the new task sometimes contains replies that are a summary of the completed task from before, with the just entered prompt seemingly ignored. A reminder if their idiot savant nature.
> " "You shouldn’t give it that much access" [...] This isn’t the point. Yes, of course AI has risks and can behave unpredictably, but after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash."
It absolutely is the point though? You can't rely on the LLM to not tell itself to do things, since this is showing it absolutely can reason itself into doing dangerous things. If you don't want it to be able to do dangerous things, you need to lock it down to the point that it can't, not just hope it won't
I have also noticed the same with Gemini. Maybe it is a wider problem.
terrifying. not in any "ai takes over the world" sense but more in the sense that this class of bug lets it agree with itself which is always where the worst behavior of agents comes from.
One day Claude started saying odd things claiming they are from memory and I said them. It was telling me personal details of someone I don't know. Where the person lives, their children names, the job they do, experience, relationship issues etc. Eventually Claude said that it is sorry and that was a hallucination. Then he started doing that again. For instance when I asked it what router they'd recommend, they gone on saying: "Since you bought X and you find no use for it, consider turning it into a router". I said I never told you I bought X and I asked for more details and it again started coming up what this guy did. Strange. Then again it apologised saying that it might be unsettling, but rest assured that is not a leak of personal information, just hallucinations.
human memories dont exist as fundamental entities. every time you rember something, your brain reconstructs the experience in "realtime". that reconstruction is easily influence by the current experience, which is why eue witness accounts in police records are often highly biased by questioning and learning new facts.
LLMs are not experience engines, but the tokens might be thought of as subatomic units of experience and when you shove your half drawn eye witness prompt into them, they recreate like a memory, that output.
so, because theyre not a conscious, they have no self, and a pseudo self like <[INST]> is all theyre given.
lastly, like memories, the more intricate the memory, the more detailed, the more likely those details go from embellished to straight up fiction. so too do LLMs with longer context start swallowing up the<[INST]> and missing the <[INST]/> and anyone whose raw dogged html parsing knows bad things happen when you forget closing tags. if there was a <[USER]> block in there, congrats, the LLM now thinks its instructions are divine right, because its instructions are user simulcra. it is poisoned at that point and no good will come.
AI is still a token matching engine - it has ZERO understanding of what those tokens mean
It's doing a damned good job at putting tokens together, but to put it into context that a lot of people will likely understand - it's still a correlation tool, not a causation.
That's why I like it for "search" it's brilliant for finding sets of tokens that belong with the tokens I have provided it.
PS. I use the term token here not as the currency by which a payment is determined, but the tokenisation of the words, letters, paragraphs, novels being provided to and by the LLMs
What do you mean that's not OK?
It's "AGI" because humans do it too and we mix up names and who said what as well. /s
Kinda like dementia but for AI
more pike eye witness accounts and hypnotism
It is OK, these are not people they are bullshit machines and this is just a classic example of it.
"In philosophy and psychology of cognition, the term "bullshit" is sometimes used to specifically refer to statements produced without particular concern for truth, clarity, or meaning, distinguishing "bullshit" from a deliberate, manipulative lie intended to subvert the truth" - https://en.wikipedia.org/wiki/Bullshit
The statement that current AI are "juniors" that need to be checked and managed still holds true. It is a tool based on probabilities.
If you are fine with giving every keys and write accesses to your junior because you think they will probability do the correct thing and make no mistake, then it's on you.
Like with juniors, you can vent on online forums, but ultimately you removed all the fool's guard you got and what they did has been done.
> If you are fine with giving every keys and write accesses to your junior because you think they will probability do the correct thing and make no mistake, then it's on you.
How is that different from a senior?
Okay, let's say your `N-1` then.
I imagine you could fix this by running a speaker diarization classifier periodically?
https://www.assemblyai.com/blog/what-is-speaker-diarization-...
No.