I don't understand the point of automating note taking. It never worked for me to copy paste text into my notes and now you can 100x that?
The whole point of taking notes for me is to read a source critically, fit it in my mental model, and then document that. Then sometimes I look it up for the details. But for me the shaping of the mental model is what counts
The few scientific studies out there actually show a degradation of output quality when these markdown collections are fully LLM maintained (opposed to an increase when they’re human maintained), which I found fascinating.
I think the sweet spot is human curation of these documents, but unsupervised management is never the answer, especially if you don’t consciously think about debt / drift in these.
Put AI in your product name, make billion dollars. Put Karpathy in your blog article, get hired by Anthropic as Principal engineer. Milk money as long as fad last. No one is thinking about customer needs, everyone is trying to wash hands in the wave as it last.
LLM models and the agents that use them are probabilistic, not deterministic. They accomplish something a percentage of the time, never every time.
That means the longer an agent runs on a task, the more likely it will fail the task. Running agents like this will always fail and burn a ton of token cash in the process.
One thing that LLM agents are good at is writing their own instructions. The trick is to limit the time and thinking steps in a thinking model then evaluate, update, and run again. A good metaphor is that agents trip. Don't let them run long enough to trip. It is better to let them run twice for 5 minutes than once for 10 minutes.
Give it a few weeks and self-referencing agents are going to be at the top of everybody's twitter feed.
Any particular reason for BM25? Why not just a table of contents or index structure (json, md, whatever) that is updated automatically and fed in context at query time? I know bag of words is great for speed but even at 1000s of documents, the index can be quite cheap and will maximise precision
The BM25-first routing bet is interesting. You mention 85% recall@20 on 500 artifacts, but the heuristic classifier routing "short lookups to BM25 and narrative queries to cited-answer" raises a practical question: what does the classifier key on to decide a query is narrative vs short? Token count? Syntactic structure? The reason I ask is that in agent-generated queries, the boundary is often blurry - an agent doing a dependency lookup might issue a surprisingly long, well-formed sentence. If the classifier routes those to the more expensive cited-answer loop it could negate the latency advantage of BM25 being first.
I read the durability thing as markdown files are very open, easy to find software for, simple and are widely used. All of this together almost guarantees that they will he viewable/usable in the far future.
I don't understand the point of automating note taking. It never worked for me to copy paste text into my notes and now you can 100x that?
The whole point of taking notes for me is to read a source critically, fit it in my mental model, and then document that. Then sometimes I look it up for the details. But for me the shaping of the mental model is what counts
The few scientific studies out there actually show a degradation of output quality when these markdown collections are fully LLM maintained (opposed to an increase when they’re human maintained), which I found fascinating.
I think the sweet spot is human curation of these documents, but unsupervised management is never the answer, especially if you don’t consciously think about debt / drift in these.
Then you have never worked at a large enough codebase or across enough many projects?
Put AI in your product name, make billion dollars. Put Karpathy in your blog article, get hired by Anthropic as Principal engineer. Milk money as long as fad last. No one is thinking about customer needs, everyone is trying to wash hands in the wave as it last.
LLM models and the agents that use them are probabilistic, not deterministic. They accomplish something a percentage of the time, never every time.
That means the longer an agent runs on a task, the more likely it will fail the task. Running agents like this will always fail and burn a ton of token cash in the process.
One thing that LLM agents are good at is writing their own instructions. The trick is to limit the time and thinking steps in a thinking model then evaluate, update, and run again. A good metaphor is that agents trip. Don't let them run long enough to trip. It is better to let them run twice for 5 minutes than once for 10 minutes.
Give it a few weeks and self-referencing agents are going to be at the top of everybody's twitter feed.
Karpathy's original post for context:
https://x.com/karpathy/status/2039805659525644595
https://xcancel.com/karpathy/status/2039805659525644595
Couldn't you instruct your LLM to make the starting dir configurable?
Any particular reason for BM25? Why not just a table of contents or index structure (json, md, whatever) that is updated automatically and fed in context at query time? I know bag of words is great for speed but even at 1000s of documents, the index can be quite cheap and will maximise precision
The BM25-first routing bet is interesting. You mention 85% recall@20 on 500 artifacts, but the heuristic classifier routing "short lookups to BM25 and narrative queries to cited-answer" raises a practical question: what does the classifier key on to decide a query is narrative vs short? Token count? Syntactic structure? The reason I ask is that in agent-generated queries, the boundary is often blurry - an agent doing a dependency lookup might issue a surprisingly long, well-formed sentence. If the classifier routes those to the more expensive cited-answer loop it could negate the latency advantage of BM25 being first.
I love that so many people are building with markdown !
But also would like to understand how markdown helps in durability - if I understand correctly markdown has a edge over other formats for LLMs.
Also I too am building something similar on markdown which versions with git but for a completely different use case : https://voiden.md/
I read the durability thing as markdown files are very open, easy to find software for, simple and are widely used. All of this together almost guarantees that they will he viewable/usable in the far future.
So markdown will be great for distribution in the future.
Don’t know if Karpathy even wrote this version. Where are the citations?
need to try out asap. love the „the office“ vibe
love the bm25-first call over vector dbs. most teams jump to vectors before measuring anything
I was looking for something similar to try out. Cool!
[flagged]
Feels like disliking musician for fanaticism towards musical instruments.
[flagged]
I have the same feeling ever since his infamous LLM OS post
Probably just envy.
Obviously it is envy, and not scepticism over a guy who practically lives on Twitter and has unhinged[1] follower base.
1 -https://x.com/__endif/status/2039810651120705569
Cool idea. But is anyone actually building real stuff like this with any kind of high quality?
Every time I hear someone say "I have a team of agents", what I hear is "I'm shipping heaps of AI slop".
+100 for this comment.
why not an Obsidian vault with a plugin?
what plugin are you using?
srsly tho this looks slick & love the office refs / will go play with it :)
[dead]