As part of my consulting, i've stumbled upon this issue in a commercial context.
A SaaS company who has the mobile apps of their platform open source approached me with the following concern.
One of their engineers was able to recreate their platform by letting Claude Code reverse engineer their Apps and the Web-Frontend, creating an API-compatible backend that is functionally identical.
Took him a week after work. It's not as stable, the unit-tests need more work, the code has some unnecessary duplication, hosting isn't fully figured out, but the end-to-end test-harness is even more stable than their own.
"How do we protect ourselves against a competitor doing this?"
You're not describing anything new, you're describing progress. A company invests money in building a product, it becomes established, people copy, the quality of products across the industry improve. Long before generative AI, Instagram famously copied Snapchat's stories concept in a weekend, and that is now a multi-multi-multi-billion contributor to Meta's bottom line.
As engineers, we often think only about code, but code has never been what makes a business succeed. If your client thinks that their businesses primary value is in the mobile app code they wrote, 1) why is it even open source? 2) the business is doomed.
Realistically, though, this is inconsequential, and any time spent worrying about this is wasted time. You don't protect yourself from your competitor by worrying about them copying your mobile app.
> "How do we protect ourselves against a competitor doing this?"
If the platform is so trivial that it can be reverse engineered by an AI agent from a dumb frontend, what's there to protect against? One has to assume that their moat is not that part of the backend but something else entirely about how the service is being provided.
Interesting case, IANAL but sounds legal and legit. The AI did not have expose to the backend it re-implemented. The API itself is public and not protectable.
The human is still at best a co-author, as the primary implementation effort isn't theirs. And I think effort involved is the key contention in these cases. Yesterday ideas were cheap, and it was the execution that matters. Today execution is probably cheaper than ideas, but things should still hold.
I wrote this comment on another thread earlier, but it seems relevant here, so I'll just c/p:
I think we didn't even began to consider all the implications of this, and while people ran with that one case where someone couldn't copyright a generated image, it's not that easy for code. I think there needs to be way more litigation before we can confidently say it's settled.
If "generated" code is not copyrightable, where do draw the line on what generated means? Do macros count? Does code that generates other code count? Protobuf?
If it's the tool that generates the code, again where do we draw the line? Is it just using 3rd party tools? Would training your own count? Would a "random" code gen and pick the winners (by whatever means) count? Bruteforce all the space (silly example but hey we're in silly space here) counts?
Is it just "AI" adjacent that isn't copyrightable? If so how do you define AI? Does autocomplete count? Intellisense? Smarter intellisense?
Are we gonna have to have a trial where there's at least one lawyer making silly comparisons between LLMs and power plugs? Or maybe counting abacuses (abaci?)... "But your honour, it's just random numbers / matrix multiplications...
They do something very similar for some of their work. It’s hard to use external services so they replicate them and the cost of doing so has come down from “don’t be daft, we can’t reimplement slack and google drive this sprint just to make testing faster” to realistic. They run the sdks against the live services and their own implementations until they don’t see behaviour differences. Now they have a fast slack and drive and more (that do everything they need for their testing) accelerating other work. I’m dramatically shifting my concept of what’s expensive and not for development. What you’re describing could have been done by someone before, but the difficulty of building that backend has dropped enormously. Even if the application was closed you could probably either now or soon start to do the same thing starting with building back to core user stories and building the app as well.
You can view some of this as having things like the application as a very precise specification.
Being completely untainted is the standard many reimplementations set for themselves to completely rule out legal trouble. For example ReactOS won't let you contribute if you have ever seen Windows code. Because if you have never seen it, there can be no allegation that you copied it.
That is however stricter than what's actually legally necessary. It's just that the actual legal standard would require a court ruling to determine if you passed it, and everyone wants to avoid that. As a consequence there also aren't a lot of court cases to draw similarities to
Not a lawyer, but that always seemed naively correct to me.
However, the copyright system has always be a sham to protect US capital interests. So I would be very surprised if this is actually ruled/enforced. And in any case american legislators can just change the law.
Sounds like they didn’t build a proper clean room setup: the agent writing the code could see the original code.
Question: if they had built one using AI teams in both “rooms”, one writing a spec the other implementing, would that be fine? You’d need to verify spec doesn’t include source code, but that’s easy enough.
It seems to mostly follow the IBM-era precedent. However, since the model probably had the original code in its training data, maybe not? Maybe valid for closed source project but not open-source? Interesting question.
> Sounds like they didn’t build a proper clean room setup: the agent writing the code could see the original code.
It doesn't matter how they structure the agents. Since chardet is in the LLM training set, you can't claim any AI implementation thereof is clean room.
Might still be valid for closed source projects (probably is).
I think courts would need to weigh in on the open source side. There’s legal precedent is that you can use a derived work to generate a new unique work (the spec derived for the copyrighted code is very much a derived work). There are rulings that LLMs are transformative works, not just copies of training data.
LLMs can’t reproduce their entire training set. But this thinking is also ripe for misuse. I could always train or fine-tune a model on the original work so that it can reproduce the original. We quickly get into statistical arguments here.
This seems right to me. If you ask a LLM to derive a spec that has no expressive element of the original code (a clean-room human team can carefully verify this), and then ask another instance of the LLM (with fresh context) to write out code from the spec, how is that different from a "clean room" rewrite? The agent that writes the new code only ever sees the spec, and by assumption (the assumption that's made in all clean room rewrites) the spec is purely factual with all copyrightable expression having been distilled out. But the "deriving the spec (and verifying that it's as clean as possible)" is crucial and cannot be skipped!
That only matters if expression of the original project really does end up in the rewrite, doesn't it? This can be checked for (by the team with access to the code) and it's also quite unlikely at least. It's not trivial at all to have an LLM replicate their training verbatim: even when feasible (the Harry Potter case, a work that's going to be massively overweighted in training due to its popularity) it takes very specific prompting and hinting.
> That only matters if expression of the original project really does end up in the rewrite, doesn't it?
No, I don't think so. I hate comparing LLMs with humans, but for a human being familiar with the original code might disqualify them from writing a differently-licensed version.
Anyway, LLMs are not human, so as many courts confirmed, their output is not copyrightable at all, under any license.
Uh, this is just a curiosity, but do you have a reference for that last argument?
If true, it would mean most commercial code being developed today, since it's increasingly AI-generated, would actually be copyright-free. I don't think most Western courts would uphold that position.
How would a team verify this for any current model? They would have to observe and control all training data. In practice, any currently available model that is good enough to perform this task likely fails the clean room criteria due to having a copy of the source code of the project it wants to rewrite. At that point it's basically an expensive lossy copy paste.
You can always verify the output. Unless the problem being solved really is exceedingly specific and non-trivial, it's at least unlikely that the AI will rip off recognizable expression from the original work. The work may be part of the training but so are many millions of completely unrelated works, so any "family resemblance" would have to be there for very specific reasons about what's being implemented.
While it feels unlikely that a simple "write this spec from this code" + "write this code from this spec" loop would actually trigger this kind of hiding behaviour, an LLM trained to accurately reproduce code from such a loop definitely would be capable of hiding code details within the spec - and you can't reasonably prove that the frontier LLMs have not been trained to do so.
Yeah I think, the Compaq / IBM precedent can only superficially apply. It would be like having two teams only meet in a room full of documentation - but both teams crammed the source code the day before. (That, the source code you are "reverse engineering" is in the training data.) It doesn't make sense.
Also, it's weird that it's okay apparently to use pirated materials to teach an LLM, but maybe not to disseminate what the LLM then tells you.
If the code is different but API compatible, Google Java vs Oracle Java case shows that if the implementation is different enough, it can be considered a new implementation. Clean room or not.
That whole clean room argument makes no sense. Project changed governance and was significantly refactored or reimplemented... I think the maintainers deserve to call it their own. Original-pre MIT release can stay LGPL.
I don't think this is a precedent either, plenty of projects changed licenses lol.
I keep kind mixing them up but the GPL licenses keep popping up as occasionally horror stories. Maybe the license is just poorly written for today's standards?
Ok since this is not really answered... Hypothetically, If I'm a maintainer of this project. I decided I hate the implementation, it's naive, horrible performance, weird edge cases. I'm wiser today than 3 years ago.
I rewrite it, my head full of my own, original, new ideas. The results turn out great. There's a few if and while loops that look the same, and some public interfaces stayed the same. But all the guts are brand new, shiny, my own.
You have all rights to the code that you wrote that is not "colored" by previous code. Aka "an original work"
But code that is any kind of derivative of code before it contains a complex mix of other peoples rights. It can be relicensed, but only if all authors large and small agree to the terms.
Hmm are we in a ship of Theseus/speciation area? Each individual step of refactoring would not cross the threshold but would a rewrite? Even if the end result was the same?
Let us also remember that certain architectural changes need to happen over a period of planned refractors. Nobody wants to read a 5000 line shotgun-blast looking diff
So effective, LGPL means you freely give all copyright for your work to the license holder? Even if the license holder has moved on from the project?
What if I decide to make a JS or Rust implementation of this project and use it as inspiration? Does that mean I'm no longer doing a "clean room" implementation and my project is contaminated by LGPL too?
Governance change or refactoring don’t give you a right to relicense someone else’s work. It needs to be a whole new work, which you own the copyright to.
Isn't the real issue here that tons of projects that depend on the "chardet" now drag in some crappy still unverified AI slop? AI forgery poisoning, IMHO.
Why does this new project here needed to replace the original like that in this dishonourable way? The proper way would have been to create a proper new project.
Note: even Python's own pip drags this in as dependency it seems (hopefully they'll stick to a proper version)
I wonder if LLMs will push the industry towards protecting their IP with patents like the other branches of engineering rather than copyright. If you patent a general idea of how your software works then no rewrite will be able to lift this protection.
I think Mark Pilgrim misrepresents the legal situation somewhat: The AI rewrite does not legally need to be a clean room implementation (whatever exactly that would even mean here).
That is just the easiest way to disambiguate the legal situation (i.e. the most reliable approach to prevent it from being considered a derivative work by a court).
> Licensed code, when modified, must be released under the same LGPL license. Their claim that it is a "complete rewrite" is irrelevant, since they had ample exposure to the originally licensed code (i.e. this is not a "clean room" implementation).
I don't think that the second sentence is a valid claim per se, it depends on what this "rewritten code" actually looks like (IANAL).
Edit: my understanding of "clean room implementation" is that it is a good defence to a copyright infrigement claim because there cannot be infringement if you don't know the original work. However it does not mean that NOT "clean room implementation" implies infrigement, it's just that it is potentially harder to defend against a claim if the original work was known.
I was wondering how the existing case law of translated works, from one language to an other works here. It would at suggest that this is an infringement of the license especially because of the lack of creativity. But IANAL and of course no idea of applicable case law.
I agree that (while the ethics of this are a different issue) the copyright question is not obviously clear-cut. Though IANAL.
As the LGPL says:
> A "work based on the Library" means either the Library or any derivative work under copyright law: that is to say, a work containing the Library or a portion of it, either verbatim or with modifications and/or translated straightforwardly into another language. (Hereinafter, translation is included without limitation in the term "modification".)
Is v7.0.0 a [derivative work](https://en.wikipedia.org/wiki/Derivative_work)? It seems to depend on the details of the source code (implementing the same API is not copyright infringement).
the ai copy pasted the existing project. How can such a procedure not fall under copyright?
Especially now that ai can do this for any kind of intellectual property, like images, books or sourcecode. If judges would allow an ai rewrite to count as an original creation, copyright as we know it completely ends world wide.
Instead whats more likely is that no one is gonna buy that shit
It's up to them to prove that a) the original implementation was not part of whatever data set said AI used and b) that the engineers in question did not use the original as a basis.
No, that's not how copyright laws work. Especially in a world where the starting point is the accused making something and marketing it as someone else's IP with a license change.
It's still on the claimant to establish copying, which usually involves showing that the two works are substantially similar in protected elements. That the defendants had access to the original helps establish copying, but isn't on its own sufficient.
Only after that would the burden be on the defendants, such as to give a defense that their usage is sufficiently transformative to qualify as fair use.
It will hold up in court. The line of argument of “well I went into a dark room with only the first Harry Potter book and a type writer and reproduced the entire work, so now I own the rewrite” doesn’t hold up in court, it doesn’t either when when you put AI in the mix. It doesn’t matter if the result is slightly different, a judge will rule based on the fact that this even is literally what the law is intended to prevent, it’s not a case of which incantation or secret sentence you should utter to free the work of its existing license.
> “well I went into a dark room with only the first Harry Potter book and a type writer and reproduced the entire work, so now I own the rewrite”
This is not a good analogy.
A "rewrite" in context here is not a reproduction of the original work but a different work that is functionally equivalent, or at least that is the claim.
Possibly important is that it’s largely api compatible but it’s not functionally equivalent in that its performance (as accuracy not just speed) is different.
I came here to say this. While I agree with Mark that what they’re doing is not nice, I’m not sure it’s wrong. A clean-room implementation is one way the industry worked around licensing in the past (and present, I guess), but it’s not a requirement in law as far as I know.
I’m not sure that “a total rewrite” wouldn’t, in fact, pass muster - depending on how much of a rewrite it was of course. The ‘clean room’ approach was just invented as a plausible-sounding story to head off gratuitous lawsuits. This doesn’t look as defensible against the threat of a lawsuit, but it doesn’t mean it wouldn’t win that lawsuit (I’m not saying it would, I haven’t read or compared the code vs its original). Google copied the entire API of the Java language, and got away with it when Oracle sued. Things in a courtroom can often go in surprising ways…
[edit: negative votes, huh, that’s a first for a while… looks like Reddit/Slashdot-style “downvote if you don’t like what is being said” is alive and well on HN]
Lol at the statement that "clean room" would have been invented to scare people from suing. It's the opposite: clean room is a fairly-desperate attempt to pre-empt accusations in court when it is expected that the "derivative" argument will be very strong, in order to then piggyback on the doctrine about interoperability. Sometimes it works, but it's a very high bar to clear.
I feel like the author is missing a huge point here by fighting this. The entire reason why GPL and any other copyleft license exists in the first place is to ensure that the rights of a user to modify, etc a work cannot be ever taken away. Before, relicensing as MIT - or any other fully permissive license - would've meant open doors to apply restrictions going forward, but with AI this is now a non-issue. Code is now very cheap. So the way I see this, anyone who is for copyleft should be embracing AI-created things as not being copyrightable (or a rewrite being relicensable) hard*.
Isn’t it? I mean 12 stage pipeline has a very specific meaning to me in this area, and is not a new way of describing something. The release notes description sounds like a multi stage pipeline.
Do you know this kind of area and are commenting on the code?
I think it's just the GPL family of licenses that tend tend to cause most problems. I appreciate their intent, but the outcome often leaves a lot to be desired.
The GPL exists for the benefit of end users, not developers. It being a chore for developers who want to deny their users the software freedoms is a feature, not a bug.
If you have ill intentions or maybe you're a corporation that wants to use someone else's work for free without contributing anything back, then yes, I can see how GPL licenses "tend to cause problems".
I like to think about GPL as a kind of an artistic performance and an elaborate critique of the whole concept of copyright.
Like, "we don't like copyright, but since you insist on enforcing it and we can't do anything against it, we will invent a clever way to use your own rules against you".
That is definitely not the motivation behind GPL licenses. The motivation always was and still is to ensure by legal means that anyone can learn from the source code of software, fix bugs on their own, and modify the software to their needs.
Wtf are these comments? A LGPL licensed project, guaranteed to be free and open source, being LLM-washed to a permissive license, and GPL is the problem here?
They are literally stealing from open source, but it's the original license that is the issue?
Why? What's your problem with them? They do exactly what they're supposed to do, to ensure that future derivatives of the source code have to be distributed under the same license and distribution respects fundamental freedoms.
As part of my consulting, i've stumbled upon this issue in a commercial context. A SaaS company who has the mobile apps of their platform open source approached me with the following concern.
One of their engineers was able to recreate their platform by letting Claude Code reverse engineer their Apps and the Web-Frontend, creating an API-compatible backend that is functionally identical.
Took him a week after work. It's not as stable, the unit-tests need more work, the code has some unnecessary duplication, hosting isn't fully figured out, but the end-to-end test-harness is even more stable than their own.
"How do we protect ourselves against a competitor doing this?"
Noodling on this at the moment.
You're not describing anything new, you're describing progress. A company invests money in building a product, it becomes established, people copy, the quality of products across the industry improve. Long before generative AI, Instagram famously copied Snapchat's stories concept in a weekend, and that is now a multi-multi-multi-billion contributor to Meta's bottom line.
As engineers, we often think only about code, but code has never been what makes a business succeed. If your client thinks that their businesses primary value is in the mobile app code they wrote, 1) why is it even open source? 2) the business is doomed.
Realistically, though, this is inconsequential, and any time spent worrying about this is wasted time. You don't protect yourself from your competitor by worrying about them copying your mobile app.
> "How do we protect ourselves against a competitor doing this?"
If the platform is so trivial that it can be reverse engineered by an AI agent from a dumb frontend, what's there to protect against? One has to assume that their moat is not that part of the backend but something else entirely about how the service is being provided.
Interesting case, IANAL but sounds legal and legit. The AI did not have expose to the backend it re-implemented. The API itself is public and not protectable.
OTOH as of yesterday the output of the LLM isn't copyrightable, which makes licensing it difficult
That's a very incorrect reading.
AI can't be the author of the work. Human driving the AI can, unless they zero-shotted the solution with no creative input.
The human is still at best a co-author, as the primary implementation effort isn't theirs. And I think effort involved is the key contention in these cases. Yesterday ideas were cheap, and it was the execution that matters. Today execution is probably cheaper than ideas, but things should still hold.
I wrote this comment on another thread earlier, but it seems relevant here, so I'll just c/p:
I think we didn't even began to consider all the implications of this, and while people ran with that one case where someone couldn't copyright a generated image, it's not that easy for code. I think there needs to be way more litigation before we can confidently say it's settled.
If "generated" code is not copyrightable, where do draw the line on what generated means? Do macros count? Does code that generates other code count? Protobuf?
If it's the tool that generates the code, again where do we draw the line? Is it just using 3rd party tools? Would training your own count? Would a "random" code gen and pick the winners (by whatever means) count? Bruteforce all the space (silly example but hey we're in silly space here) counts?
Is it just "AI" adjacent that isn't copyrightable? If so how do you define AI? Does autocomplete count? Intellisense? Smarter intellisense?
Are we gonna have to have a trial where there's at least one lawyer making silly comparisons between LLMs and power plugs? Or maybe counting abacuses (abaci?)... "But your honour, it's just random numbers / matrix multiplications...
As of yesterday?
I think they mean this: https://news.ycombinator.com/item?id=47232289
You might be interested in the dark factory work here https://factory.strongdm.ai/
They do something very similar for some of their work. It’s hard to use external services so they replicate them and the cost of doing so has come down from “don’t be daft, we can’t reimplement slack and google drive this sprint just to make testing faster” to realistic. They run the sdks against the live services and their own implementations until they don’t see behaviour differences. Now they have a fast slack and drive and more (that do everything they need for their testing) accelerating other work. I’m dramatically shifting my concept of what’s expensive and not for development. What you’re describing could have been done by someone before, but the difficulty of building that backend has dropped enormously. Even if the application was closed you could probably either now or soon start to do the same thing starting with building back to core user stories and building the app as well.
You can view some of this as having things like the application as a very precise specification.
Really fascinating moment of change.
> "How do we protect ourselves against a competitor doing this?"
DMCA. The EULA likely prohibits reverse engineering. If a competitor does that, hit'em with lawyers.
Or, if you want to be able to sleep at night, recognize this as an opportunity instead of a threat.
The famous case Google vs Oracle may need to be re-evaluated in the light of Agents making API implementation trivial.
https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....
If your backend is trivial enough to be implemented by a large language model, what value are you providing?
I know it's a provoking question but that answers why a competitor is not a competitor.
Maybe a better question is:
How do our competitors protect themselves against us doing this?
Nothing. This is why SaaS stocks took a dump last week.
Wow that's hot. I was not aware that you need to be "untainted" by the original LGPL code. This could mean that...
All AI generated code is tainted with GPL/LGPL because the LLMs might have been taught with it
Being completely untainted is the standard many reimplementations set for themselves to completely rule out legal trouble. For example ReactOS won't let you contribute if you have ever seen Windows code. Because if you have never seen it, there can be no allegation that you copied it.
That is however stricter than what's actually legally necessary. It's just that the actual legal standard would require a court ruling to determine if you passed it, and everyone wants to avoid that. As a consequence there also aren't a lot of court cases to draw similarities to
Not a lawyer, but that always seemed naively correct to me.
However, the copyright system has always be a sham to protect US capital interests. So I would be very surprised if this is actually ruled/enforced. And in any case american legislators can just change the law.
Yes, that's what some lonely people have been shouting in the desert since the LLM craze started.
Does "lonely" in this case encompass people who've formed relationshios with said LLMs?
I'm not lonely! And I stopped shouting that since 24, because you know :/
Sounds like they didn’t build a proper clean room setup: the agent writing the code could see the original code.
Question: if they had built one using AI teams in both “rooms”, one writing a spec the other implementing, would that be fine? You’d need to verify spec doesn’t include source code, but that’s easy enough.
It seems to mostly follow the IBM-era precedent. However, since the model probably had the original code in its training data, maybe not? Maybe valid for closed source project but not open-source? Interesting question.
Not if the codebase was included in training the implementer.
> Sounds like they didn’t build a proper clean room setup: the agent writing the code could see the original code.
It doesn't matter how they structure the agents. Since chardet is in the LLM training set, you can't claim any AI implementation thereof is clean room.
Yeah I mention that in the question.
Might still be valid for closed source projects (probably is).
I think courts would need to weigh in on the open source side. There’s legal precedent is that you can use a derived work to generate a new unique work (the spec derived for the copyrighted code is very much a derived work). There are rulings that LLMs are transformative works, not just copies of training data.
LLMs can’t reproduce their entire training set. But this thinking is also ripe for misuse. I could always train or fine-tune a model on the original work so that it can reproduce the original. We quickly get into statistical arguments here.
It’s a really interesting question.
This seems right to me. If you ask a LLM to derive a spec that has no expressive element of the original code (a clean-room human team can carefully verify this), and then ask another instance of the LLM (with fresh context) to write out code from the spec, how is that different from a "clean room" rewrite? The agent that writes the new code only ever sees the spec, and by assumption (the assumption that's made in all clean room rewrites) the spec is purely factual with all copyrightable expression having been distilled out. But the "deriving the spec (and verifying that it's as clean as possible)" is crucial and cannot be skipped!
It requires the original project to not be in the training data for the model for it to be a clean room rewrite
That only matters if expression of the original project really does end up in the rewrite, doesn't it? This can be checked for (by the team with access to the code) and it's also quite unlikely at least. It's not trivial at all to have an LLM replicate their training verbatim: even when feasible (the Harry Potter case, a work that's going to be massively overweighted in training due to its popularity) it takes very specific prompting and hinting.
> That only matters if expression of the original project really does end up in the rewrite, doesn't it?
No, I don't think so. I hate comparing LLMs with humans, but for a human being familiar with the original code might disqualify them from writing a differently-licensed version.
Anyway, LLMs are not human, so as many courts confirmed, their output is not copyrightable at all, under any license.
Uh, this is just a curiosity, but do you have a reference for that last argument?
If true, it would mean most commercial code being developed today, since it's increasingly AI-generated, would actually be copyright-free. I don't think most Western courts would uphold that position.
https://news.ycombinator.com/item?id=47232289
How would a team verify this for any current model? They would have to observe and control all training data. In practice, any currently available model that is good enough to perform this task likely fails the clean room criteria due to having a copy of the source code of the project it wants to rewrite. At that point it's basically an expensive lossy copy paste.
You can always verify the output. Unless the problem being solved really is exceedingly specific and non-trivial, it's at least unlikely that the AI will rip off recognizable expression from the original work. The work may be part of the training but so are many millions of completely unrelated works, so any "family resemblance" would have to be there for very specific reasons about what's being implemented.
Somewhat annoyingly, there's been research that suggests that models can pass information to each other via (effectively) steganographic techniques - specific but apparently harmless choices of tokens, wordings, and so on; see https://arxiv.org/abs/1712.02950 and https://alignment.anthropic.com/2025/subliminal-learning/ for some simple examples.
While it feels unlikely that a simple "write this spec from this code" + "write this code from this spec" loop would actually trigger this kind of hiding behaviour, an LLM trained to accurately reproduce code from such a loop definitely would be capable of hiding code details within the spec - and you can't reasonably prove that the frontier LLMs have not been trained to do so.
Answer: probably not, as API-topography is also a part of copyright
Didn't the Google - Oracle case about Java APIs in Android https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_.... directly disprove this?
The courts decided that wasn’t true for IBM, Java and many other cases. API typography describes functionality, which isn’t copyrightable (IANAL).
Wasn't Oracle vs Google about all of that?
Yeah I think, the Compaq / IBM precedent can only superficially apply. It would be like having two teams only meet in a room full of documentation - but both teams crammed the source code the day before. (That, the source code you are "reverse engineering" is in the training data.) It doesn't make sense.
Also, it's weird that it's okay apparently to use pirated materials to teach an LLM, but maybe not to disseminate what the LLM then tells you.
I torn on where the line should be drawn.
If the code is different but API compatible, Google Java vs Oracle Java case shows that if the implementation is different enough, it can be considered a new implementation. Clean room or not.
That whole clean room argument makes no sense. Project changed governance and was significantly refactored or reimplemented... I think the maintainers deserve to call it their own. Original-pre MIT release can stay LGPL.
I don't think this is a precedent either, plenty of projects changed licenses lol.
I keep kind mixing them up but the GPL licenses keep popping up as occasionally horror stories. Maybe the license is just poorly written for today's standards?
> plenty of projects changed licenses lol.
They usually did that with approval from existing license holders (except when they didn't, those were the bad cases for sure).
No. Because they couldnt have done any of that refactoring without a licence to do so, and that licence forbids them from relicencing it.
Ok since this is not really answered... Hypothetically, If I'm a maintainer of this project. I decided I hate the implementation, it's naive, horrible performance, weird edge cases. I'm wiser today than 3 years ago.
I rewrite it, my head full of my own, original, new ideas. The results turn out great. There's a few if and while loops that look the same, and some public interfaces stayed the same. But all the guts are brand new, shiny, my own.
Do I have no rights to this code?
You have all rights to the code that you wrote that is not "colored" by previous code. Aka "an original work"
But code that is any kind of derivative of code before it contains a complex mix of other peoples rights. It can be relicensed, but only if all authors large and small agree to the terms.
Hmm are we in a ship of Theseus/speciation area? Each individual step of refactoring would not cross the threshold but would a rewrite? Even if the end result was the same?
Let us also remember that certain architectural changes need to happen over a period of planned refractors. Nobody wants to read a 5000 line shotgun-blast looking diff
So effective, LGPL means you freely give all copyright for your work to the license holder? Even if the license holder has moved on from the project?
What if I decide to make a JS or Rust implementation of this project and use it as inspiration? Does that mean I'm no longer doing a "clean room" implementation and my project is contaminated by LGPL too?
The standard way of "relicensing" a project is to contact all of the prior code contributors about it and get their ok.
Generally relicensing is done in good faith for a good reason, so pretty much everyone ok's it.
Trickiness can turn up when code contributors aren't contactable (ie dead, missing, etc), and I'm unsure of the legally sound approach to that.
The legally-sound approach is to keep track of your actions, so you can later prove you've made "reasonable" efforts to contact them.
Afaik you can do whatever you like to GPL licensed code, you do not need a license to refactor it.
I understand you need to publish the source code of your modifications, if you distribute them outside of your company.
Governance change or refactoring don’t give you a right to relicense someone else’s work. It needs to be a whole new work, which you own the copyright to.
Which is what happened here? The maintainers did a rewrite, apparently, but it's not enough!
Isn't the real issue here that tons of projects that depend on the "chardet" now drag in some crappy still unverified AI slop? AI forgery poisoning, IMHO.
Why does this new project here needed to replace the original like that in this dishonourable way? The proper way would have been to create a proper new project.
Note: even Python's own pip drags this in as dependency it seems (hopefully they'll stick to a proper version)
Huh, 7e25bf4 was a big commit.
https://github.com/chardet/chardet/commit/7e25bf40bb4ae68848...FastAPI's underlying library, Starlette, has been going through licensing shenanigans too lately: https://github.com/Kludex/starlette/issues/3042
Be really careful who you give your projects keys to, folks!
That doesn't seem related at all, this is just adding attribution, not changing the license through LLM-washing
I wonder if LLMs will push the industry towards protecting their IP with patents like the other branches of engineering rather than copyright. If you patent a general idea of how your software works then no rewrite will be able to lift this protection.
The README has clearly been touched by an LLM. Count the idiosyncrasies:
“chardet 7.0 is a ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x”
Do people not write anymore?
I think Mark Pilgrim misrepresents the legal situation somewhat: The AI rewrite does not legally need to be a clean room implementation (whatever exactly that would even mean here).
That is just the easiest way to disambiguate the legal situation (i.e. the most reliable approach to prevent it from being considered a derivative work by a court).
I'm curious how this is gonna go.
> Licensed code, when modified, must be released under the same LGPL license. Their claim that it is a "complete rewrite" is irrelevant, since they had ample exposure to the originally licensed code (i.e. this is not a "clean room" implementation).
I don't think that the second sentence is a valid claim per se, it depends on what this "rewritten code" actually looks like (IANAL).
Edit: my understanding of "clean room implementation" is that it is a good defence to a copyright infrigement claim because there cannot be infringement if you don't know the original work. However it does not mean that NOT "clean room implementation" implies infrigement, it's just that it is potentially harder to defend against a claim if the original work was known.
I was wondering how the existing case law of translated works, from one language to an other works here. It would at suggest that this is an infringement of the license especially because of the lack of creativity. But IANAL and of course no idea of applicable case law.
I agree that (while the ethics of this are a different issue) the copyright question is not obviously clear-cut. Though IANAL.
As the LGPL says:
> A "work based on the Library" means either the Library or any derivative work under copyright law: that is to say, a work containing the Library or a portion of it, either verbatim or with modifications and/or translated straightforwardly into another language. (Hereinafter, translation is included without limitation in the term "modification".)
Is v7.0.0 a [derivative work](https://en.wikipedia.org/wiki/Derivative_work)? It seems to depend on the details of the source code (implementing the same API is not copyright infringement).
the ai copy pasted the existing project. How can such a procedure not fall under copyright?
Especially now that ai can do this for any kind of intellectual property, like images, books or sourcecode. If judges would allow an ai rewrite to count as an original creation, copyright as we know it completely ends world wide.
Instead whats more likely is that no one is gonna buy that shit
>the ai copy pasted the existing project.
The change log says the implementation is completely different, not a copy paste. Is that wrong?
>Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.
It's up to them to prove that a) the original implementation was not part of whatever data set said AI used and b) that the engineers in question did not use the original as a basis.
It's up to the accuser to prove that they copied it and did not actually write it from scratch as they claimed.
No, that's not how copyright laws work. Especially in a world where the starting point is the accused making something and marketing it as someone else's IP with a license change.
It's still on the claimant to establish copying, which usually involves showing that the two works are substantially similar in protected elements. That the defendants had access to the original helps establish copying, but isn't on its own sufficient.
Only after that would the burden be on the defendants, such as to give a defense that their usage is sufficiently transformative to qualify as fair use.
"Exposure" means here, I think, that they feed the 6.X code version to Claude.
It will hold up in court. The line of argument of “well I went into a dark room with only the first Harry Potter book and a type writer and reproduced the entire work, so now I own the rewrite” doesn’t hold up in court, it doesn’t either when when you put AI in the mix. It doesn’t matter if the result is slightly different, a judge will rule based on the fact that this even is literally what the law is intended to prevent, it’s not a case of which incantation or secret sentence you should utter to free the work of its existing license.
> “well I went into a dark room with only the first Harry Potter book and a type writer and reproduced the entire work, so now I own the rewrite”
This is not a good analogy.
A "rewrite" in context here is not a reproduction of the original work but a different work that is functionally equivalent, or at least that is the claim.
Possibly important is that it’s largely api compatible but it’s not functionally equivalent in that its performance (as accuracy not just speed) is different.
I came here to say this. While I agree with Mark that what they’re doing is not nice, I’m not sure it’s wrong. A clean-room implementation is one way the industry worked around licensing in the past (and present, I guess), but it’s not a requirement in law as far as I know.
I’m not sure that “a total rewrite” wouldn’t, in fact, pass muster - depending on how much of a rewrite it was of course. The ‘clean room’ approach was just invented as a plausible-sounding story to head off gratuitous lawsuits. This doesn’t look as defensible against the threat of a lawsuit, but it doesn’t mean it wouldn’t win that lawsuit (I’m not saying it would, I haven’t read or compared the code vs its original). Google copied the entire API of the Java language, and got away with it when Oracle sued. Things in a courtroom can often go in surprising ways…
[edit: negative votes, huh, that’s a first for a while… looks like Reddit/Slashdot-style “downvote if you don’t like what is being said” is alive and well on HN]
Lol at the statement that "clean room" would have been invented to scare people from suing. It's the opposite: clean room is a fairly-desperate attempt to pre-empt accusations in court when it is expected that the "derivative" argument will be very strong, in order to then piggyback on the doctrine about interoperability. Sometimes it works, but it's a very high bar to clear.
I thought we were debating if it was legal, not if it's wrong. The law is about creativity. Was this creative or a more mechanical translation?
Clean room implementations are not necessary to avoid copyright infringement.
I feel like the author is missing a huge point here by fighting this. The entire reason why GPL and any other copyleft license exists in the first place is to ensure that the rights of a user to modify, etc a work cannot be ever taken away. Before, relicensing as MIT - or any other fully permissive license - would've meant open doors to apply restrictions going forward, but with AI this is now a non-issue. Code is now very cheap. So the way I see this, anyone who is for copyleft should be embracing AI-created things as not being copyrightable (or a rewrite being relicensable) hard*.
> 12-stage detection pipeline
What is this recent (clanker-fueled?) obsession to give everything fancy computer-y names with high numbers?
It's not a '12 stage pipeline', it's just an algorithm.
Isn’t it? I mean 12 stage pipeline has a very specific meaning to me in this area, and is not a new way of describing something. The release notes description sounds like a multi stage pipeline.
Do you know this kind of area and are commenting on the code?
"ok chatgpt, what name do i give to this algorithm, so it sounds fancy and advanced?"
Licenses are cancer and the enemy of opensource.
I think it's just the GPL family of licenses that tend tend to cause most problems. I appreciate their intent, but the outcome often leaves a lot to be desired.
The GPL exists for the benefit of end users, not developers. It being a chore for developers who want to deny their users the software freedoms is a feature, not a bug.
If you have ill intentions or maybe you're a corporation that wants to use someone else's work for free without contributing anything back, then yes, I can see how GPL licenses "tend to cause problems".
I like to think about GPL as a kind of an artistic performance and an elaborate critique of the whole concept of copyright.
Like, "we don't like copyright, but since you insist on enforcing it and we can't do anything against it, we will invent a clever way to use your own rules against you".
That is definitely not the motivation behind GPL licenses. The motivation always was and still is to ensure by legal means that anyone can learn from the source code of software, fix bugs on their own, and modify the software to their needs.
Wtf are these comments? A LGPL licensed project, guaranteed to be free and open source, being LLM-washed to a permissive license, and GPL is the problem here?
They are literally stealing from open source, but it's the original license that is the issue?
Why? What's your problem with them? They do exactly what they're supposed to do, to ensure that future derivatives of the source code have to be distributed under the same license and distribution respects fundamental freedoms.
Open source as a concept is intertwined with the concept of a license.