I like that these AI idioms exist. They're like watermarks for text. It's worth the cost of humans avoiding them. Companies will eventually train their models to be undetectable, but society would be better if they didn't.
Except that the entire point of the article is that they're not AI idioms. They're not "watermarks for text." They're legitimate language constructions that LLMs tend to overuse, but that real humans also use. Real humans do, in fact, say "align with" all the time, just as often as "corresponds."
And you can pry my em dashes from my cold, dead hands.
Well reading between the lines I don’t think they’re saying all of those uses are AI. They’re legitimate constructs, like the em-dash, en-dash, and hyphen, all of which I used to use regularly. But now they’re AI tells so I use them sparingly.
It’s unlikely this is true. LLMs are way more mad-libs / templates than we like to admit, that’s (ironically) not a judgement about their capability, it’s primarily just an observation. But it’s also what plain old SFT, which I believe is the primary culprit, ends up imparting.
nice article, but i think as a non native english speaker, i always use the model in english for reasoning and then translate the output to my language. most of these considerations do not apply. because the translation step is taking out alot of these language artifacts
This is how early forms of "reasoning" in LLMs worked: just literally inserting words like "Wait...", "Hmm...", "Let me reconsider...", "But is it really..." into the token stream.
Is this not how current forms of reasoning work? It seems like the open models still output things like that, and the closed ones all just summarize their thinking instead to avoid distillation, but probably do the same thing internally.
> In the end, shaming people for writing that gets flagged as AI can lead people to sidestep structures the model has learned from us
It's interesting why LLMs generate constructions like this more frequently than they presumably exist in the training set. I wonder if this is some sort of mode collapse caused by post training, and/or maybe because they are training on synthetic data so these things become self-perpetuating and self-amplifying (a feedback loop)?
The lesson for humans worried about being falsely identified as AI is just learn to write better! It doesn't matter where your repertoire of phrasing comes from (copying AI or not), but one of the basic rules of writing is not to repeat yourself unless you are doing so deliberately for a purpose. Go ahead and use "It's not just X. It's Y" if you want to, but if you use it multiple times in the same short piece of writing, then you may deserve to be called out for poor style, if not for being an AI.
"Hyphen functioning as an em dash" is an expected human thing as it's what's easy to type. It's specifically an actual em dash which got bulldozed, much to the dismay of those who bothered to put the unicode character in.
If you read The Mac is Not a Typewriter in 1992—thus burning Option-Shift-hyphen into your typing patterns for life, along with a dogmatic love for serif body fonts—you're the real victim here.
A signal is not the same thing as a guarantee. Both your points so far, your provided text & that bots often bother to replace em dashes to avoid detection, actually support that it is a signal though.
Alternatively, no one sounds like an llm, an llm sounds like someone, typically those close to the median of the training corpus. If AI were genuinly capable of novelty, it would be a big deal, tech bros having enough work ethic to design new detectable prose for an llm is a mssive reach and has no real evidence supporting it, else why do tech bros only tackle the easier issues? Things we have massive well labelled corpi for? Why is it never dishwashing and folding laundry?
I put to you, if you see a trope in AI writing it's because that trope appeared in the training corpus. Therefore, sure, being predjudice against it lets you catch some AI, but you'll also flag human outout. I think that may not be worth it in the end.
I like that these AI idioms exist. They're like watermarks for text. It's worth the cost of humans avoiding them. Companies will eventually train their models to be undetectable, but society would be better if they didn't.
> It's worth the cost of humans avoiding them
That's really unfortunate though. It's like Michael Bolton from Office Space: "No way! Why should I change? He's the one who sucks."
Except that the entire point of the article is that they're not AI idioms. They're not "watermarks for text." They're legitimate language constructions that LLMs tend to overuse, but that real humans also use. Real humans do, in fact, say "align with" all the time, just as often as "corresponds."
And you can pry my em dashes from my cold, dead hands.
Well reading between the lines I don’t think they’re saying all of those uses are AI. They’re legitimate constructs, like the em-dash, en-dash, and hyphen, all of which I used to use regularly. But now they’re AI tells so I use them sparingly.
>Recent overuse by language models has led many to declare it bad writing. I'm not so sure.
It is bad writing.
You’re absolutely right to push back on this.
Sometimes it’s not just about the Ys but also the Qs.
> RLVR is weirder, and I suspect it's why we see "It's not X, it's Y" so often.
This feels like an easy enough hypothesis to verify, for anyone in the business of training LLMs - does the not-X-but-Y rate increase after RLVR?
It’s unlikely this is true. LLMs are way more mad-libs / templates than we like to admit, that’s (ironically) not a judgement about their capability, it’s primarily just an observation. But it’s also what plain old SFT, which I believe is the primary culprit, ends up imparting.
nice article, but i think as a non native english speaker, i always use the model in english for reasoning and then translate the output to my language. most of these considerations do not apply. because the translation step is taking out alot of these language artifacts
This is how early forms of "reasoning" in LLMs worked: just literally inserting words like "Wait...", "Hmm...", "Let me reconsider...", "But is it really..." into the token stream.
Is this not how current forms of reasoning work? It seems like the open models still output things like that, and the closed ones all just summarize their thinking instead to avoid distillation, but probably do the same thing internally.
> In the end, shaming people for writing that gets flagged as AI can lead people to sidestep structures the model has learned from us
It's interesting why LLMs generate constructions like this more frequently than they presumably exist in the training set. I wonder if this is some sort of mode collapse caused by post training, and/or maybe because they are training on synthetic data so these things become self-perpetuating and self-amplifying (a feedback loop)?
The lesson for humans worried about being falsely identified as AI is just learn to write better! It doesn't matter where your repertoire of phrasing comes from (copying AI or not), but one of the basic rules of writing is not to repeat yourself unless you are doing so deliberately for a purpose. Go ahead and use "It's not just X. It's Y" if you want to, but if you use it multiple times in the same short piece of writing, then you may deserve to be called out for poor style, if not for being an AI.
Another bunch of dead give aways in code bases with READMEs is the repetitive:
- "No X, No Y, No Z." pattern
- "Here is X - it makes Y"
The worst and most obvious one is the constant over use of emoji ticks and crosses.
For calibration purposes, I offer you a pre-LLM README I wrote that includes an em-dash* followed by "No X, No Y, No Z": https://github.com/DavidBuchanan314/stelf-loader
*actually a hyphen but it's functioning as an em dash.
"Hyphen functioning as an em dash" is an expected human thing as it's what's easy to type. It's specifically an actual em dash which got bulldozed, much to the dismay of those who bothered to put the unicode character in.
If you read The Mac is Not a Typewriter in 1992—thus burning Option-Shift-hyphen into your typing patterns for life, along with a dogmatic love for serif body fonts—you're the real victim here.
I prefer the double dash "--", but Microsoft products will convert this to a proper em-dash if you press space afterwards, I think...
Double should map to endash, tripple for em.
A lot of the LLM bots on HN (and elsewhere) will find-and-replace their em dashes with hypens in an attempt to evade detection.
Precisely, anything to remove AI smells in favor of natural looking text.
My point is I don't consider em dash vs hyphen to be a strong signal either way, humans and bots alike use both interchangeably.
A signal is not the same thing as a guarantee. Both your points so far, your provided text & that bots often bother to replace em dashes to avoid detection, actually support that it is a signal though.
and we will now hold you responsible!
Alternatively, no one sounds like an llm, an llm sounds like someone, typically those close to the median of the training corpus. If AI were genuinly capable of novelty, it would be a big deal, tech bros having enough work ethic to design new detectable prose for an llm is a mssive reach and has no real evidence supporting it, else why do tech bros only tackle the easier issues? Things we have massive well labelled corpi for? Why is it never dishwashing and folding laundry?
I put to you, if you see a trope in AI writing it's because that trope appeared in the training corpus. Therefore, sure, being predjudice against it lets you catch some AI, but you'll also flag human outout. I think that may not be worth it in the end.
You’re absolutely right. This is the smoking gun. This changes everything.
Now I see the full picture.
This is the real unlock. Here's the key takeaways.
It's not just an unlock. It's a major discovery.
https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing#...
Signs? Those are normal ways of writing? What the hell? Is everything AI now?