Co-Pilot generated code is based on works that come from a variety of licenses. ...

mattnewton · on Jan 22, 2023

I think this interpretation works if the code being generated is seen as essentially being retrieved by a lossy lookup function.

But another interpretation is that the generic structure of the code was learned from the works, which is not copywritable. And that generic structure was used to synthesize new code, in much the same way a human who had seen a pattern in a proprietary codebase years ago was able to use that pattern in their own code. I am not a lawyer but most licenses do not prohibit that in my experience. More often in my experience this is what is happening with generative ai.

The tricky bit is that the ai can probably do both in the eyes of copywrite law, since the boundary seems to be very context dependent and existing models don’t have any concept of how much you need to compress and forget the specific details so that it is seen as novel by the courts. The model can memorize significant parts of some inputs despite not having nearly enough space for memorizing the input set, so the first interpretation is possible even if it isn’t the typical output. There isn’t really a kind of “courts will see this as novel” regularizer and there might need to be?

freejazz · on Jan 22, 2023

You are just hiding the more complex argument behind the word "learned", which is not something in the normal understanding of the word that's attributed to a computer.

mattnewton · on Jan 23, 2023

I can expand on that a bit- the weights in the big generative models are still basically too small to hold a significantly number of the input set with anything we would call compression today. This forces the model to strip the input down to some discovered bare structure, which when humans do this we call it things like “archetypes” or “theme” and it’s not generally copyrightable. Many LLM aren’t even trained for multiple epochs so it’s not optimized for memorization as much as it is forced to extrapolate on future examples. I’m arguing that the problem is that the computer has no knowledge of where the line is when it becomes plagiarism in our courts, not that it is always plagiarizing. I think it clearly can’t be always plagiarizing from anecdotal experience of using them and from just doing back of the envelop math on how many bits it has to memorize each input string.

freejazz · on Jan 23, 2023

You equating that process to something that "humans do" is anthropomorphism.

It's not true that that is what humans do.

Having knowledge of where the line is with regards to copyright liability is not an element required to prove liability. i.e. it's of no consequence that the infringer doesn't realize or know that they are infringing. Copyright is strict liability in that sense.