Nope. Even if context can theoretically encode arbitrary computation under fixed weights, this requires the weights to implement a usable interpreter. Random weights almost surely do not. Training is what constructs that interpreter. Without it context has no meaningful computational semantics.
It's kind of like asking if I make a random circuit with logic gates, does that become a universal computer that can run programs.
There has been a lot of talk about how continual learning might be "just and engineering challenge" and that we could have agents that continuously learn from experience by just having longer and longer context windows.
What I am trying to argue for in the article is how such a view might be misplaced - just extending the context length and adding more instructions in the context will not get you continual learning - the representational capacity of weights will be the limiting factor.
Just a fun way to think about it. Would love to hear your thoughts.
>just extending the context length and adding more instructions in the context will not get you continual learning...
I agree. But I am wondering if context would help in answering superficial questions and only fail when answering questions that require deeper understanding.
I'd say the way to think about it is in terms of the questions you ask being in-distribution or out of distribution w.r.t the model training dataset.
Consider this, if something fundamental has changed in the world after the model was released(ie after the knowledge cut off date), then it would be very difficult for the model to reason about it. One concrete example is the the following: If you ask Opus or any decent coding model to do effort estimation on a coding task, then it would come up with multi week timelines - the models themselves doesn't know that because "they exist", these timelines have now been slashed to a few hours - you can try saying this in the prompt, however, they don't seem to internalise this.
I spent the last weekend thinking about continual learning. A lot of people think that we can solve long term memory and learning in LLMs by simply extending the context length to infinity. I analyse a different perspective that challenges this assumption.
Also interesting to think about: could a single system be generally intelligent, or is a certain bias actually a power. Can we have billions of models, each with their own "experience"
I think both the views have their merits. In my mind the hardware vs software analogy for weights vs context holds better because in most modern computing systems, the hardware is fixed and the software changes. What the system can do efficiently, in practice, is a function of both the limitations/capabilities of the hardware and the software their respective capability ceilings.
The brain theory also kind of says the same thing, but it's hard to say what stays fixed vs changes with experience in the brain ig.
Another way I see it is... Mind is process. LLM is (very lossy) snapshotted state of process/mind. LLM in-process is mind-emulator with potential to explore the state-space of the mind-snapshot. Consequently, and by its very construction, LLM cannot be mind.
>for the sake of argument, that context can express everything weights can...
Does this imply that a completely untrained model (random weights) should show intelligent behavior only by providing enough context?
Nope. Even if context can theoretically encode arbitrary computation under fixed weights, this requires the weights to implement a usable interpreter. Random weights almost surely do not. Training is what constructs that interpreter. Without it context has no meaningful computational semantics.
It's kind of like asking if I make a random circuit with logic gates, does that become a universal computer that can run programs.
That was exactly what I was thinking. So it is a bit unclear why such a possibility should be even considered.
To be fair, I didn't really understand what idea this article is trying to get across..
There has been a lot of talk about how continual learning might be "just and engineering challenge" and that we could have agents that continuously learn from experience by just having longer and longer context windows.
Here is a clip of Dario hinting at something similar: https://www.youtube.com/watch?v=Z0x99Uu4rJc
What I am trying to argue for in the article is how such a view might be misplaced - just extending the context length and adding more instructions in the context will not get you continual learning - the representational capacity of weights will be the limiting factor.
Just a fun way to think about it. Would love to hear your thoughts.
>just extending the context length and adding more instructions in the context will not get you continual learning...
I agree. But I am wondering if context would help in answering superficial questions and only fail when answering questions that require deeper understanding.
I'd say the way to think about it is in terms of the questions you ask being in-distribution or out of distribution w.r.t the model training dataset.
Consider this, if something fundamental has changed in the world after the model was released(ie after the knowledge cut off date), then it would be very difficult for the model to reason about it. One concrete example is the the following: If you ask Opus or any decent coding model to do effort estimation on a coding task, then it would come up with multi week timelines - the models themselves doesn't know that because "they exist", these timelines have now been slashed to a few hours - you can try saying this in the prompt, however, they don't seem to internalise this.
This site can’t provide a secure connection
www.aravindjayendran.com sent an invalid response.
ERR_SSL_PROTOCOL_ERROR
Didn't get this error before. Try now, it should be fixed.
Author here.
I spent the last weekend thinking about continual learning. A lot of people think that we can solve long term memory and learning in LLMs by simply extending the context length to infinity. I analyse a different perspective that challenges this assumption.
Let me know how you think about this.
Your conclusion touches on this, but I think the brain analogy is stronger than the hardware/software dichotomy.
It is also my very uninformed intuition: https://news.ycombinator.com/item?id=44910353
Also interesting to think about: could a single system be generally intelligent, or is a certain bias actually a power. Can we have billions of models, each with their own "experience"
I think both the views have their merits. In my mind the hardware vs software analogy for weights vs context holds better because in most modern computing systems, the hardware is fixed and the software changes. What the system can do efficiently, in practice, is a function of both the limitations/capabilities of the hardware and the software their respective capability ceilings.
The brain theory also kind of says the same thing, but it's hard to say what stays fixed vs changes with experience in the brain ig.
> Let me know how you think about this.
Well, I think of every Large Language Model as if it were a spectacularly faceted diamond.
More on these lines in a recent-ish "thinking in public" attempt by yours truly, lay programmer, to interpret what an LLM-machine might be.
Riff: LLMs are Software Diamonds
https://www.evalapply.org/posts/llms-are-diamonds/
lol nice analogy. LLMs are frozen diamonds forged in compute. We need then to be malleable in production and change with experience.
Another way I see it is... Mind is process. LLM is (very lossy) snapshotted state of process/mind. LLM in-process is mind-emulator with potential to explore the state-space of the mind-snapshot. Consequently, and by its very construction, LLM cannot be mind.
I've never heard anyone say we can solve long-term memory by extending context to infinity. Curious about sources for this?
here you go: https://www.youtube.com/watch?v=Z0x99Uu4rJc