cross-posted from: https://nom.mom/post/121481
OpenAI could be fined up to $150,000 for each piece of infringing content.https://arstechnica.com/tech-policy/2023/08/report-potential-nyt-lawsuit-could-force-openai-to-wipe-chatgpt-and-start-over/#comments
No, someone emulating someone else’s style is still going to have their own experiences, style, and creativity make their way into the book. They have an entire lifetime of “training data” to draw from. An AI that would “emulate” someone else’s style would really only be able to refer to the author’s books, or someone else’s books, therefore it’s stealing. Another example: if someone decided to remix different parts of a musician’s catalogue into one song, that would be a copyright infringement. AI adds nothing beyond what it’s trained on, therefore whatever it spits out is just other people’s works in a different way.
we output nothing other than what we’re trained on; the only difference is that we’re allowed to roam the world freely and consume whatever information we stumble on
what you say would be true if the LLM were only trained on content by the author seeking to say that their works had been infringed, however these LLMs include a lot of other data from public domain sources
one could consider these public domain sources and our experience of the world to be synonymous (and if you don’t i’d love to hear the distinction), in which case there’s some kind of a line that you seem to be drawing, and again i’d love to hear where you think that line is
is it just ratio? there’s precedent to that for sure: current law has fair use rules which stipulate things like “amount and substantiality”. in that case the question becomes one of defining the ratio. certainly the ratio of content that the author is referring to vs the content not trained by the author is minuscule
I agree with what you’re saying, and a model that is only trained on public domain would be fine. I think the very obvious line is that it’s a computer program. There seems to be a want for computers to be human but they aren’t. They don’t consume media for their own enjoyment, they are forced to do it so someone can sell the output as a product. You can’t compare the public domain to life.
i think the distinction that either side is seeing here is that you think humans are inherently different to a neural network, where i think that the only difference is in the complexity: that if we had a neural network at the same scale as the human brain, that there’s nothing stopping those electronic neurons from connecting and responding in a way that’s indistinguishable from a human
the fact that we’re not there yet i don’t see as particularly relevant, because we’re talking about concepts rather than specifics… of course a LLM doesn’t display the same characteristics as a human: it’s not of the same scale, and the training is different but functionally there’s nothing different between chemical neurons firing and neurons made of transistors firing
we learn in the same way: by reinforcing connections between our neurons