The bigger concern is what happens when AI models start training on AI-generated content at scale. We're already seeing model collapse in research papers where output quality degrades when training data is contaminated with synthetic text. The internet becoming majority bot content basically guarantees this becomes a real problem for the next generation of models.
Stupidity has nothing to do with it. AI articles and comments are now posted everywhere and presented as human. It's becoming harder and harder to determine whether text was written by humans or AI. Where are they supposed to find content to train AI on that isn't polluted with AI content that'll result in a feedback loop? It's like trying to get pure soil and water for growing that's not contaminated with microplastics/nanoplastics and PFAS. There was a time where it was possible. Not anymore. The filth is everywhere and impossible to filter.
And it's simply not reasonable for AI companies to have human hands read through individual comments everywhere from beginning to end to build their training data. There isn't enough time in the universe to advance AI while doing that and also being accurate. Something will always slip through.
If you can devise a tool that can detect AI generated content, you can use it to filter data. But the harsh truth is that "gold standard" training data is from before 2022 or whenever the cutoff was.
And even that needs to be curated because before AI tools there was bot content filling up the internet.
...and even without bots, a lot of human authored content are low value, poorly written, etc.
There are (probably) companies out there whose business is to create, curate and improve training sets.
It already is a problem and maybe unpopular opinion but... That's a good thing. The LLM collapse can't possibly come soon enough. In principle, LLMs can be a good thing but they can't overcome human nature - laziness and the unstoppable desire to take the shortest path. It's those two things that have turned the internet into the absolute dump it is today. Not to mention the bullshitter economy, as I like to call it and everything that comes with it. And all things considered, society does need some reset at this point, the AI bubble might be a good place to kick things off.
That's very interesting question I'm ponder about.
If all content is AI generated where innovation will come from?
Maybe we should differentiates AI assisted content from AI garbage content.
Why should they care about new content? Game over already. Just keep regurgitating the same slop to the masses. Even before ai it was like this. How many 2 minute pop songs use the same chord structure? Just keep selling the same thing slightly permutated (or not) from the last. That's capitalism, baby. This isn't a science.
Are you saying that high quality human-curated content will be rare and more appreciated in the future compared to endless cheap slop? Can't say I am sad, in contrary
Exactly. The value of verified, human-sourced content goes up as synthetic noise increases. It's basically supply and demand. We might end up in a world where provenance matters more than the content itself
> This notion of machine bad, human good just is not realistic
Glad I found this quote. It is quite helpful for an AI to search the web on behaolf of me... even if it was finding where I can buy particular/similar peanuts locally I got from abroad.
This notion isn't just unrealistic, but extremely dangerous. If we accept "machine bad, human good" line of thinking, the only logical conclusion is that we'll have to verify our biometric every time we'd like to access the internet. Like the UK age verification but 100x worse.
Content providers will not agree with this decision, because machine browsing = no ads. Until that gets resolved, I don’t see incentives to align, since any free search requires ads for continuous business.
It could be serving ads if they could persuade the machines to do the purchase.
In fact, even ads ingested by the training data set at this very moment could be useful. Go to Gemini and tell it you want to buy a jacket or whatever and it will recommend some products it ingested from the training data.
There're human-to-human (H2H), human-to-machine (H2M) or vice versa, and machine-to-machine (M2M) data communication.
If you perform simple extrapolation, the M2M data only surpass the others around 2029.
Coincidently, in the original timeline of Transformer movie, 2029 is the year that the Resistance, led by John Connor, destroyed Skynet and ended the war against the machines.
I view a lot of the AI/Bot internet to be slightly a false misnomer. Even before ChatGPT, the degredation of online content was already happening - SEO farms, worsening google search. Most articles you'd find online would be paywalled, most information about specific things would turn out to be a frustrating SEO labyrinth.
The current one is awful, and there's so much AI/Bot content, but I can find far more detailed information using AI enabled search that isn't covered in ads. I can get an initial overview of methodology without trawling through SEO articles.
I think AI has been almost a natural response to the enshittification of the internet - ChatGPT wouldn't seem so transformative if google search was working like google search rather than ad generator 5000 before it released.
The bigger concern is what happens when AI models start training on AI-generated content at scale. We're already seeing model collapse in research papers where output quality degrades when training data is contaminated with synthetic text. The internet becoming majority bot content basically guarantees this becomes a real problem for the next generation of models.
> The internet becoming majority bot content basically guarantees this becomes a real problem for the next generation of models.
Only if you assume that people who train models are stupid.
Stupidity has nothing to do with it. AI articles and comments are now posted everywhere and presented as human. It's becoming harder and harder to determine whether text was written by humans or AI. Where are they supposed to find content to train AI on that isn't polluted with AI content that'll result in a feedback loop? It's like trying to get pure soil and water for growing that's not contaminated with microplastics/nanoplastics and PFAS. There was a time where it was possible. Not anymore. The filth is everywhere and impossible to filter.
And it's simply not reasonable for AI companies to have human hands read through individual comments everywhere from beginning to end to build their training data. There isn't enough time in the universe to advance AI while doing that and also being accurate. Something will always slip through.
If you can devise a tool that can detect AI generated content, you can use it to filter data. But the harsh truth is that "gold standard" training data is from before 2022 or whenever the cutoff was.
And even that needs to be curated because before AI tools there was bot content filling up the internet.
...and even without bots, a lot of human authored content are low value, poorly written, etc.
There are (probably) companies out there whose business is to create, curate and improve training sets.
> Only if you assume that people who train models are stupid
Someone in the chain will be. Even the smartest people buy a lot of their training datasets. What happens when those get contaminated?
social media was raw sewage even before AI, and WWW was 90% SEO spam generated by third worlders for $2 a day.
I wonder if they will start deliberately not scraping social media because of the low quality human content and AI sloppyness of it.
suddenly the confirmed quality of the scraped data will be at a premium.. "Scrape Engine Optimizers" ?
I think that >in the future< this will be a non problem as there is reality itself that is a much better validator for behavior then human text.
We already see this with synthetic training data that basically uses logic in form of math and code as constraint.
It already is a problem and maybe unpopular opinion but... That's a good thing. The LLM collapse can't possibly come soon enough. In principle, LLMs can be a good thing but they can't overcome human nature - laziness and the unstoppable desire to take the shortest path. It's those two things that have turned the internet into the absolute dump it is today. Not to mention the bullshitter economy, as I like to call it and everything that comes with it. And all things considered, society does need some reset at this point, the AI bubble might be a good place to kick things off.
That's very interesting question I'm ponder about. If all content is AI generated where innovation will come from? Maybe we should differentiates AI assisted content from AI garbage content.
More profitable not to innovate and form cartels.
Why should they care about new content? Game over already. Just keep regurgitating the same slop to the masses. Even before ai it was like this. How many 2 minute pop songs use the same chord structure? Just keep selling the same thing slightly permutated (or not) from the last. That's capitalism, baby. This isn't a science.
Are you saying that high quality human-curated content will be rare and more appreciated in the future compared to endless cheap slop? Can't say I am sad, in contrary
You should be careful letting what you want to be true cloud your judgement about what likely is true.
Exactly. The value of verified, human-sourced content goes up as synthetic noise increases. It's basically supply and demand. We might end up in a world where provenance matters more than the content itself
At the end of the day, the value of producing content will drop to zero and the value of curation content will skyrocket.
> This notion of machine bad, human good just is not realistic
Glad I found this quote. It is quite helpful for an AI to search the web on behaolf of me... even if it was finding where I can buy particular/similar peanuts locally I got from abroad.
This notion isn't just unrealistic, but extremely dangerous. If we accept "machine bad, human good" line of thinking, the only logical conclusion is that we'll have to verify our biometric every time we'd like to access the internet. Like the UK age verification but 100x worse.
Content providers will not agree with this decision, because machine browsing = no ads. Until that gets resolved, I don’t see incentives to align, since any free search requires ads for continuous business.
It could be serving ads if they could persuade the machines to do the purchase.
In fact, even ads ingested by the training data set at this very moment could be useful. Go to Gemini and tell it you want to buy a jacket or whatever and it will recommend some products it ingested from the training data.
One interesting dynamic here is that AI increases content supply much faster than human attention grows.
Which means filtering and ranking systems become the main bottleneck.
That pushes platforms toward stronger algorithmic selection and sometimes stronger convergence of attention.
Well, IoT traffic peaked "human traffic" already long ago, Netflix etc. eat a lot of bandwidth, etc., so I am not sure where the news is exactly.
IoT traffic and streaming traffic is "invisible" to normal humans.
Your smart thermometer isn't making Reddit posts trying to sound like a human who's just concerned that the bedroom is a bit too warm.
How is Netflix not 'human traffic'?
It's hyperbole, click bait.
There're human-to-human (H2H), human-to-machine (H2M) or vice versa, and machine-to-machine (M2M) data communication.
If you perform simple extrapolation, the M2M data only surpass the others around 2029.
Coincidently, in the original timeline of Transformer movie, 2029 is the year that the Resistance, led by John Connor, destroyed Skynet and ended the war against the machines.
I view a lot of the AI/Bot internet to be slightly a false misnomer. Even before ChatGPT, the degredation of online content was already happening - SEO farms, worsening google search. Most articles you'd find online would be paywalled, most information about specific things would turn out to be a frustrating SEO labyrinth.
The current one is awful, and there's so much AI/Bot content, but I can find far more detailed information using AI enabled search that isn't covered in ads. I can get an initial overview of methodology without trawling through SEO articles.
I think AI has been almost a natural response to the enshittification of the internet - ChatGPT wouldn't seem so transformative if google search was working like google search rather than ad generator 5000 before it released.
Yeah, the internet has been shitty for uh. decades now. 15 odd years ago people were already complaining about listicles and youtube comments.
Best thing to do is to avoid idly browsing social media and curate your internet experience.
"Officially?"
Who is this official making this pronouncement?
Proposing a definition of slop: content optimized for profitability, regardless of quality.
If AI slop is replacing the content you were consuming, it was already slop.
That's silly. I can make slop without worrying about profit, too.