55 points

Not a strong case for NYT, but I’ve long believed that AI is vulnerable to copyright law and likely the only thing to stop/slow it’s progression. Given the major issues with all AI and how inequitable and bigoted they are and their increasing use, I’m hoping this helps to start conversations about limiting the scope of AI or application.

permalink
report
reply
40 points

It’s pretty apparent that AI developers are training their applications using stolen images and data.

This was always going to end up in the courts.

permalink
report
parent
reply
17 points

A human brain is just the summation of all the content it’s ever witnessed, though, both paid and unpaid. There’s no such thing as artwork that is completely 100% original, everything is inspired by something else we’re already familiar with. Otherwise viewers of the art would just interpret it as random noise. There has to be some amount of familiarity for a viewer to identify with it.

So if someone builds an atom-perfect artificial brain from scratch, sticks it in a body, and shows it around the world, should we expect the creator to pay licensing fees to the owners of everything it looks at?

permalink
report
parent
reply
18 points

No.

I am so fucking sick of this “AI art is just doing what humans do" bullshit. It is so utterly devoid of any kind of critical thinking that it sounds like a 100% bad faith argument every time it comes up.

AI can only give you a synthesis of exactly what you feed it. It can’t use its life experience, its upbringing, its passions, its cultural influences, etc to color its creativity and thinking, because it has none and it isn’t thinking. Two painters who study and become great artists, and then also both take time to study and replicate the works of Monet can come away from that experience with vastly different styles. They’re not just puking back a mashup of Monet’s collected works. They’re using their own life experience and passions to color their experience of Impressionism.

That’s something an AI can never do, and it leaves the result hollow and meaningless.

There is so so so so so much more to human experience, life experience, and just being alive than simply absorbing “content.”

permalink
report
parent
reply
14 points
*

This comparison doesn’t make sense to me. If the person then makes money off it: yes.

Otherwise the question would be if copyright law should be abolished entirely. E.g. if I create a new news portal with content copied form other source, would that be okay then?

You are comparing a computer program to a human. Which… is weird.

permalink
report
parent
reply
9 points

A human brain is just the summation of all the content it’s ever witnessed, though, both paid and unpaid.

But copyright is entirely artificial. The deal is that the law says you have to pay when you copy a bunch of copyrighted text and reprint it into new pages of a newly bound book. The law also says you don’t have to pay when you are giving commentary on a copyrighted work, or parodying a copyrighted work, or drawing inspiration from a copyrighted work to create something new but still influenced by that copyrighted work. The question for these lawsuits is whether using copyrighted works to train these models and generate new text (or art or music) is infringement of those artificial, human-made, legal rights.

As an example, sound recording copyrights only protect the literal copying of a sound recording. Someone who mimics that copyrighted recording, no matter how perfectly, doesn’t actually infringe on the recording copyright (even if they might infringe on the composition copyright, a separate and distinct copyright). But a literal duplication process of some kind would be infringement.

We can have a debate whether the law draws the line in the correct places, or whether the copyright regime could be improved, and other normative discussion what what the rules should be in the modern world, especially about whether the rules in one area (e.g., the human brain) are consistent with the rules in another area (e.g., a generative AI model). But it’s a separate discussion from what the rules currently are. Under current law, the human brain is currently allowed to perform some types of copying and processing and remixing that some computer programs are not.

permalink
report
parent
reply
3 points

So if someone builds an atom-perfect artificial brain from scratch, sticks it in a body, and shows it around the world, should we expect the creator to pay licensing fees to the owners of everything it looks at?

That’s unrelated to an LLM. An LLM is not a synthetic human brain. It’s a computer program and sets of statistical data points from large amounts of training data to generate outputs from prompts.

If we get real general-purpose AI some day in the future, then we’ll need to answer those sorts of questions. But that’s not what we have today.

permalink
report
parent
reply
2 points

To me there’s a bit of a difference because humans are not controllable and cannot (legally) be slaves. So in the case of this hypothetical artificial brain, that brain could leave and take the profits of it’s work elsewhere, with the creator no longer benefiting.

permalink
report
parent
reply
11 points

Yeah I’ve heard a lot of people talking about the copyright stuff with respect to image generation AIs, but as far as I can see there’s no fundamental reason that text generating AIs wouldn’t be subject to the same laws. We’ll see how the lawsuit goes though I suppose.

permalink
report
parent
reply
26 points

Neither are infringement. Artists attempting to bully platforms into not training on them doesn’t change the fact that training on information would be black and white fair use if it didn’t have absolutely nothing in common with copyright infringement. Learning from copyrighted material is not distributing it.

If the court doesn’t just ignore the law, which has nothing that could theoretically be interpreted to support the idea that training is infringement in any way, this case will be the precedent that sets AI training free.

And you, as an individual, should want that. Breaking the ability to learn from prior art is still literally guaranteed to disenfranchise the overwhelming majority of creators in all formats, because there are massive IP holders who have the data sets to build generative AI and produce unlimited “free” content, while no individual will be able to do the same because they’ll have nothing to train on. If you think Disney has a monopoly now, wait until they can train AI on 100 years of 95% of TV and movies and no one else can make AI.

permalink
report
parent
reply
17 points

Well I hear what you’re saying, although I don’t much appreciate being told what I should want the outcome to be.

My own wants notwithstanding, I know copyright law is notoriously thorny – fair use doubly so – and I’m no lawyer. I’d be a little bit surprised if NYT decides to raise this suit without consulting their own lawyers though, so it stands to reason that if they do indeed decide to sue then there are at least some copyright lawyers who think it’ll have a chance. As I said, we’ll see.

permalink
report
parent
reply
5 points

I’m slightly optimistic. It might slow down the progression of those language models now, but I hope that it becomes a “benign disincentive” in the long run, forcing a shift from LLM to better models.

permalink
report
parent
reply
28 points

NPR reported that a “top concern” is that ChatGPT could use The Times’ content to become a “competitor” by “creating text that answers questions based on the original reporting and writing of the paper’s staff.”

That’s something that can currently be done by a human and is generally considered fair use. All a language model really does is drive the cost of doing that from tens or hundreds of dollars down to pennies.

To defend its AI training models, OpenAI would likely have to claim “fair use” of all the web content the company sucked up to train tools like ChatGPT. In the potential New York Times case, that would mean proving that copying the Times’ content to craft ChatGPT responses would not compete with the Times.

A fair use defense does not have to include noncompetition. That’s just one factor in a fair use defense and the other factors may be enyon their own.

I think it’ll come down to how “the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes” and “the amount and substantiality of the portion used in relation to the copyrighted work as a whole;” are interpreted by the courts. Do we judge if a language model by the model itself or by the output itself? Can a model itself be uninfringing and it still be able to potentially produce infringing content?

permalink
report
reply
18 points

The model is intended for commercial use, uses the entire work and creates derivative works based on it which are in direct competition.

permalink
report
parent
reply
6 points

You are kind of hitting on one of the issues I see. The model and the works created by the model may b considered two separate things. The model itself may not be infringing in of itself. It’s not actually substantially similar to any of the individual training data. I don’t think anyone can point to part of it and say this is a copy of a given work. But the model may be able to create works that are infringing.

permalink
report
parent
reply
3 points
*

uses the entire work

This may not actually be true though. If it’s a Q&A interface, it’s very unlikely they are training the model on the entire work (since model training is extremely expensive and done extremely infrequently). Now sure, maybe they actually are training on NYT articles, but a similarly powerful LLM could exist without training on those articles and still answer questions about it.

Suppose you wanted to make your own Bing Chat. If you tried to answer the questions entirely based on what the model is trained on, you’d get crap results because the model may not have been trained on any new data in over 2 years. More likely, you’re using retrieval-augmented generation (RAG) to select portions of articles, generally the ones you got from your search results, to provide as context to your LLM.

Also, the argument that these are derivative works seems to be a bit iffy. Derivative works use substantial portions of the original work, but generally speaking a Q&A interface like this would be purely generative. With certain carefully-crafted prompts, it may be able to generate portions of the original work, but assuming they’re using RAG, it’s extremely unlikely they would generate the exact same content that’s in the article because they wouldn’t be using the entirety of the article for generation anyway.

How is this any different from a person scanning an article and writing their own summary based on what they read? Is doing so a violation of copyright, and if so, aren’t news outlets especially notorious for doing this (writing articles based on the articles put out by other news outlets)?

Edit: I should probably add as well, but search engines have been indexing and training models on the content they crawl over for years, and that never seemed to cause anyone to complain about copyright. It’s interesting to me that it’s suddenly a problem now.

permalink
report
parent
reply
12 points

That’s something that can currently be done by a human and is generally considered fair use.

That’s kind of the point though isn’t it? Fair use is only fair use because it’s a human doing it, not an algorithm.

permalink
report
parent
reply
5 points

That is not actually one of the criteria for fair use in the US right now. Maybe that’ll change but it’ll take a court case or legislation to do.

permalink
report
parent
reply
2 points

I am aware of that, but those rules were written before technology like this was conceivable.

permalink
report
parent
reply
10 points
*

I think there’s a good case that it’s transformative entirely. It doesn’t just spit out NYT articles. I feel like saying they “stole IP” from NYT doesn’t really hunt because that would mean anyone who read the NYT and then wrote any kind of article at some point also engaged in IP theft because almost certainly their consumption of the NYT influenced their writing in some way. ( I think the same thing holds up to a weaker degree with generative image AI just seems a bit different sometimes directly copying the actual brushstrokes etc of real artists there’s also only so many ways to arrange words)

It is however an entirely new thing, so it’s up to judges for now to rule how that works.

permalink
report
parent
reply
7 points

I have it on good authority that the writers of the NYT have also read other news papers before. This blatant IP theft goes deeper than we could have ever imagined.

permalink
report
parent
reply
1 point

Yeah, we need to get this in front of the Supreme Court.

permalink
report
parent
reply
24 points

I hope not. Not a big fan of propriety AI (local AI all the way, and I hope people leak all these models, both code and weights), but fuck copyright and fuck capitalism which makes automation seem like a bad thing when it shouldn’t be ;p nya

permalink
report
reply
16 points

Yes, because AI and automation will definitely not be on the side of big capital, right? Right?

Be real. The cost of building means they’re always going to favour the wealthy. At best right now were running public copies of the older and smaller models. Local AI will always be running behind the state of the art big proprietary models, which will always be in the hands of the richest moguls and companies in the world.

permalink
report
parent
reply
7 points

Be real. The cost of building means they’re always going to favour the wealthy. At best right now were running public copies of the older and smaller models. Local AI will always be running behind the state of the art big proprietary models, which will always be in the hands of the richest moguls and companies in the world.

Distribution of LoRA-style fine-tuning weights means that FOSS AI systems have a long term advantage because of compounding effects. .

That is, high-quality data provided for smaller models and very small “model finetuning” weights, which is more accessible to open groups, are sufficiently accessible and modular in their improvements to a given model that the FOSS community can take and run with it to compete effectively with proprietary groups from even a single leak.

Furthermore, smaller and more efficient models which can be run on lower end hardware also avoid the need to send off potentially sensitive data to AI companies and enable the kinds of FOSS compounding effect explained above.

This doesn’t just affect people who like privacy, but also companies with data privacy requirements . - as long as the medium models are “good enough” (which I think they are ;p), the compounding effects of LoRA tuning and better data privacy properties, and further developments which already exist in research papers towards much lower weight-count models and training mechanisms capable of greater weight efficiency to induce zero-shot learning, mean local AI can compete with proprietary stuff. It’s still early days but it is absolutely doable even today with fairly low-end hardware, and it can only get better for the reasons provided.

Furthermore, “intellectual property” and copyright stuff have an absolutely massive and arguably even more powerful set of industries behind them. Trying to strengthen IP stuff against AI means that AI will only be available to those controlling these existing IP resources and it’s unending stranglehold on technology and communication and people as a whole :/

AI I think is also forcing more and more people to look and reevaluate society’s relationship with work and labour. And frankly I think that this is super important, as it enables a greater chance of more radical liberation from the existing structures of not just capitalism and it’s hierarchies but the near-mandatoriness of work as a whole (though there has already been some stuff like this around the concepts of “bullshit jobs”).

I think people should use this as an opportunity to unionise and also try and push for cooperative and democratic control of orgs ., and many other things that I CBA to list out ;3

permalink
report
parent
reply
6 points

No leaks necessary; there are a number of open-source LLM’s available:

https://github.com/Hannibal046/Awesome-LLM#open-llm

The key differentiator between these and proprietary offerings will always be the training data. Large amounts of high-quality data will be more difficult for an individual or a small team to source. If lawsuits like this one block ingestion of otherwise publicly-available data, we could have a future where copyright holders charge AI builders for access to their data. If that happens, “knowledge” could become exclusive to various AI platforms much the same way popular shows or movies are exclusive to streaming platforms.

permalink
report
parent
reply
1 point

the opensource models are so bad that they give you responses out of context. they have completely random responses.

permalink
report
parent
reply
15 points
*

Sam Altman stated that the cost of training GPT-4 was more than $100 million, so I think they’ll survive this (just ask daddy Microsoft for more money). Not sure if the figure includes cost of obtaining the training data though.

It’s pretty funny if the thing that would prevents AI from taking over human jobs is the copyright law though.

permalink
report
reply
3 points

It is funny, bureaucracy definitely slows humanity progression

permalink
report
parent
reply
2 points

While we play civil law, the world plays realpolitik.

Chances are good this legal challenge against AI is the work of our geopolitical rivals.

permalink
report
parent
reply
9 points

I’d be careful around interpreting any challenge to big business as the doing of hostile foreign powers. That line of thought rationalizes corporations being above the rule of law, which is kinda fascistic.

permalink
report
parent
reply
1 point

I’m not equating any and all with that, just this one instance because the discussion here is about a technology that will be a deciding factor in warfare.

permalink
report
parent
reply
12 points

What a shitty clickbait title, there are many lawsutes all over the world to decide if the use of public data for AI training without permission is against copyright laws and you could probably write hundread articles with that title…

permalink
report
reply

Technology

!technology@beehaw.org

Create post

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:


This community’s icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

Community stats

  • 2.9K

    Monthly active users

  • 2.8K

    Posts

  • 55K

    Comments