At some point, there will be a successful copyright infringement suit against an LLM user who redistributes infringing output generated by an LLM. It could be the NYTimes suit, or it could be another, but it's coming — after which the industry will face a Napster-style reckoning.
What comes next? Perhaps it won't be that hard to assemble a proprietary licensed corpus and get decent performance out of it. Look at all the people already willing to license their voices.
I’m a researcher who for years has been scanning my library’s holdings on my particular discipline for my own use, but also uploading the books to the shadow libraries for everyone else’s benefit. The revelation that LLMs are training on the shadow libraries has made me put a lot more effort into ensuring my scans are well-OCRed. The idea that I could eventually ask ChatGPT or whatever about obscure things in my field, and get useful output (of the "trust but verify" sort), is exciting.
Full book content and model generations are not included because the books are copyrighted and the generations contain large portions of verbatim text.
There are plenty of old books in the public domain already... but I'm not sure what exactly this exercise is supposed to show, since the Kolmogorov limit still stands in the way of "infinite compression".
Ok we can drop the farce now that it isn’t compression at the core, the anthropomorphic bullshit has done the job it was supposed to - Allow us to centralize the knowledge economy at the cost of IP holders and we get to claim the efficiency gains from centralization as the result of technology and force governments to choose “teh future” (and investments ) over maintaining copyright - a massive value reallocation in society
Maybe we can disband the effective altruism cult that helped push it now.
I scanned a page of a particular book, and several models recognized it was from that book. And it almost felt that it resurgitated the content that it knew than real OCR.
"To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right .."
Copyright needs to exist, but we need to go back to its roots.
Everyone forgets that it exists to promote progress. Nothing else. The ability to profit from it exists only to serve those ends.
Anything which does not serve to promote the progress of the arts and sciences should not be protected, and "limited times" never meant "until Walt Disney says so."
Copyright is what facilitates copyleft. Getting rid of IP protections also rids us of GPL, which gave us a few things including the most popular OS in the world.
It’s one thing to reject the specifics of IP laws as currently implementated; it’s another thing to celebrate the dismantling of the entire foundation of open source by for-profit corporate interests who sought to do it for decades.
RMS on copyright
"This means that copyright no longer fits in with the technology as it used to. Even if the words of copyright law had not changed, they wouldn't have the same effect. Instead of an industrial regulation on publishers controlled by authors, with the benefits set up to go to the public, it is now a restriction on the general public, controlled mainly by the publishers, in the name of the authors.
In other words, it's tyranny. It's intolerable and we can't allow it to continue this way.
As a result of this change, [copyright] is no longer easy to enforce, no longer uncontroversial, and no longer beneficial"
First, if we assume Stallman is human, we have to grant he will not be right about everything (in fact, we already know he isn’t, as he has publicly changed his views on certain things).
Second, he does make an argument that copyright should have reduced power, which we can all agree with; he does not appear to argue for the death of copyright. Death of copyright might seem counter-productive, unless it also implied the death of corporate ability to withhold the source from the users and many other things.
You will note that the very text you linked to is copyrighted. There’s a reason for that.
Copyright is what enables free and open licenses such as Creative Commons and every version/variant of the GPL. Without copyright, what would become of these licenses, and movements that have espoused them?
At some point, there will be a successful copyright infringement suit against an LLM user who redistributes infringing output generated by an LLM. It could be the NYTimes suit, or it could be another, but it's coming — after which the industry will face a Napster-style reckoning.
What comes next? Perhaps it won't be that hard to assemble a proprietary licensed corpus and get decent performance out of it. Look at all the people already willing to license their voices.
The law exists to protect the elite and punish the underclass. We’re not in a Hollywood movie. Nothing will happen.
Language Models are Injective and Hence Invertible https://arxiv.org/abs/2510.15511
Demo: https://cauchy221.github.io/Alignment-Whack-a-Mole/
Arxiv: https://arxiv.org/abs/2603.20957
I’m a researcher who for years has been scanning my library’s holdings on my particular discipline for my own use, but also uploading the books to the shadow libraries for everyone else’s benefit. The revelation that LLMs are training on the shadow libraries has made me put a lot more effort into ensuring my scans are well-OCRed. The idea that I could eventually ask ChatGPT or whatever about obscure things in my field, and get useful output (of the "trust but verify" sort), is exciting.
Full book content and model generations are not included because the books are copyrighted and the generations contain large portions of verbatim text.
There are plenty of old books in the public domain already... but I'm not sure what exactly this exercise is supposed to show, since the Kolmogorov limit still stands in the way of "infinite compression".
Ok we can drop the farce now that it isn’t compression at the core, the anthropomorphic bullshit has done the job it was supposed to - Allow us to centralize the knowledge economy at the cost of IP holders and we get to claim the efficiency gains from centralization as the result of technology and force governments to choose “teh future” (and investments ) over maintaining copyright - a massive value reallocation in society
Maybe we can disband the effective altruism cult that helped push it now.
I scanned a page of a particular book, and several models recognized it was from that book. And it almost felt that it resurgitated the content that it knew than real OCR.
Intelligence is compression.
And frankly, if this means the end of copyright: good riddance.
"To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right .."
Copyright needs to exist, but we need to go back to its roots.
Everyone forgets that it exists to promote progress. Nothing else. The ability to profit from it exists only to serve those ends.
Anything which does not serve to promote the progress of the arts and sciences should not be protected, and "limited times" never meant "until Walt Disney says so."
It won't mean the end of copyright, at most it will just shift the balance of power from one set of giant corporations to another.
Anthropic (predictably) issued many DMCA takedown requests after the claude code leak.
Copyright for me, but not for thee.
Copyright is what facilitates copyleft. Getting rid of IP protections also rids us of GPL, which gave us a few things including the most popular OS in the world.
It’s one thing to reject the specifics of IP laws as currently implementated; it’s another thing to celebrate the dismantling of the entire foundation of open source by for-profit corporate interests who sought to do it for decades.
RMS on copyright "This means that copyright no longer fits in with the technology as it used to. Even if the words of copyright law had not changed, they wouldn't have the same effect. Instead of an industrial regulation on publishers controlled by authors, with the benefits set up to go to the public, it is now a restriction on the general public, controlled mainly by the publishers, in the name of the authors.
In other words, it's tyranny. It's intolerable and we can't allow it to continue this way.
As a result of this change, [copyright] is no longer easy to enforce, no longer uncontroversial, and no longer beneficial"
from https://www.gnu.org/philosophy/copyright-versus-community.en...
First, if we assume Stallman is human, we have to grant he will not be right about everything (in fact, we already know he isn’t, as he has publicly changed his views on certain things).
Second, he does make an argument that copyright should have reduced power, which we can all agree with; he does not appear to argue for the death of copyright. Death of copyright might seem counter-productive, unless it also implied the death of corporate ability to withhold the source from the users and many other things.
You will note that the very text you linked to is copyrighted. There’s a reason for that.
And yet he is.
Copyright is what enables free and open licenses such as Creative Commons and every version/variant of the GPL. Without copyright, what would become of these licenses, and movements that have espoused them?