You are viewing a single thread.
View all comments
148 points

The biggest problem with AI is that they’re illegally harvesting everything they can possibly get their hands on to feed it, they’re forcing it into places where people have explicitly said they don’t want it, and they’re sucking up massive amounts of energy AMD water to create it, undoing everyone else’s progress in reducing energy use, and raising prices for everyone else at the same time.

Oh, and it also hallucinates.

permalink
report
reply
2 points
*

In a Venn Diagram, I think your “illegally harvesting” complaint is a circle fully inside the “owned by the same few people” circle. AI could have been an open, community-driven endeavor, but now it’s just mega-rich corporations stealing from everyone else. I guess that’s true of literally everything, not just AI, but you get my point.

permalink
report
parent
reply
29 points

Eh I’m fine with the illegal harvesting of data. It forces the courts to revisit the question of what copyright really is and hopefully erodes the stranglehold that copyright has on modern society.

Let the companies fight each other over whether it’s okay to pirate every video on YouTube. I’m waiting.

permalink
report
parent
reply
72 points

So far, the result seems to be “it’s okay when they do it”

permalink
report
parent
reply
0 points

Yeah… Nothing to see here, people, go home, work harder, exercise, and don’t forget to eat your vegetables. Of course, family first and god bless you.

permalink
report
parent
reply
33 points
*

I would agree with you if the same companies challenging copyright (protecting the intellectual and creative work of “normies”) are not also aggressively welding copyright against the same people they are stealing from.

With the amount of coprorate power tightly integrated with the governmental bodies in the US (and now with Doge dismantling oversight) I fear that whatever comes out of this is humans own nothing, corporations own everything. Death of free independent thought and creativity.

Everything you do, say and create is instantly marketable, sellable by the major corporations and you get nothing in return.

The world needs something a lot more drastic then a copyright reform at this point.

permalink
report
parent
reply
1 point

It’s seldom the same companies, though; there are two camps fighting each other, like Gozilla vs Mothra.

permalink
report
parent
reply
12 points

AI scrapers illegally harvesting data are destroying smaller and open source projects. Copyright law is not the only victim

https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

permalink
report
parent
reply
0 points

In this case they just need to publish the code as a torrent. You wouldn’t setup a crawler if there was all the data in a torrent swarm.

permalink
report
parent
reply
0 points

That article is overblown. People need to configure their websites to be more robust against traffic spikes, news at 11.

Disrespecting robots.txt is bad netiquette, but honestly this sort of gentleman’s agreement is always prone to cheating. At the end of the day, when you put something on the net for people to access, you have to assume anyone (or anything) can try to access it.

permalink
report
parent
reply
13 points

Oh, and it also hallucinates.

Oh, and people believe the hallucinations.

permalink
report
parent
reply
11 points
*

They’re not illegally harvesting anything. Copyright law is all about distribution. As much as everyone loves to think that when you copy something without permission you’re breaking the law the truth is that you’re not. It’s only when you distribute said copy that you’re breaking the law (aka violating copyright).

All those old school notices (e.g. “FBI Warning”) are 100% bullshit. Same for the warning the NFL spits out before games. You absolutely can record it! You just can’t share it (or show it to more than a handful of people but that’s a different set of laws regarding broadcasting).

I download AI (image generation) models all the time. They range in size from 2GB to 12GB. You cannot fit the petabytes of data they used to train the model into that space. No compression algorithm is that good.

The same is true for LLM, RVC (audio models) and similar models/checkpoints. I mean, think about it: If AI is illegally distributing millions of copyrighted works to end users they’d have to be including it all in those files somehow.

Instead of thinking of an AI model like a collection of copyrighted works think of it more like a rough sketch of a mashup of copyrighted works. Like if you asked a person to make a Godzilla-themed My Little Pony and what you got was that person’s interpretation of what Godzilla combined with MLP would look like. Every artist would draw it differently. Every author would describe it differently. Every voice actor would voice it differently.

Those differences are the equivalent of the random seed provided to AI models. If you throw something at a random number generator enough times you could–in theory–get the works of Shakespeare. Especially if you ask it to write something just like Shakespeare. However, that doesn’t meant the AI model literally copied his works. It’s just doing it’s best guess (it’s literally guessing! That’s how work!).

permalink
report
parent
reply
10 points

The problem with being like… super pedantic about definitions, is that you often miss the forest for the trees.

Illegal or not, seems pretty obvious to me that people saying illegal in this thread and others probably mean “unethically”… which is pretty clearly true.

permalink
report
parent
reply
7 points
*

I wasn’t being pedantic. It’s a very fucking important distinction.

If you want to say “unethical” you say that. Law is an orthogonal concept to ethics. As anyone who’s studied the history of racism and sexism would understand.

Furthermore, it’s not clear that what Meta did actually was unethical. Ethics is all about how human behavior impacts other humans (or other animals). If a behavior has a direct negative impact that’s considered unethical. If it has no impact or positive impact that’s an ethical behavior.

What impact did OpenAI, Meta, et al have when they downloaded these copyrighted works? They were not read by humans–they were read by machines.

From an ethics standpoint that behavior is moot. It’s the ethical equivalent of trying to measure the environmental impact of a bit traveling across a wire. You can go deep down the rabbit hole and calculate the damage caused by mining copper and laying cables but that’s largely a waste of time because it completely loses the narrative that copying a billion books/images/whatever into a machine somehow negatively impacts humans.

It is not the copying of this information that matters. It’s the impact of the technologies they’re creating with it!

That’s why I think it’s very important to point out that copyright violation isn’t the problem in these threads. It’s a path that leads nowhere.

permalink
report
parent
reply
7 points

The issue I see is that they are using the copyrighted data, then making money off that data.

permalink
report
parent
reply
1 point

…in the same way that someone who’s read a lot of books can make money by writing their own.

permalink
report
parent
reply
3 points

This is an interesting argument that I’ve never heard before. Isn’t the question more about whether ai generated art counts as a “derivative work” though? I don’t use AI at all but from what I’ve read, they can generate work that includes watermarks from the source data, would that not strongly imply that these are derivative works?

permalink
report
parent
reply
1 point

If you studied loads of classic art then started making your own would that be a derivative work? Because that’s how AI works.

The presence of watermarks in output images is just a side effect of the prompt and its similarity to training data. If you ask for a picture of an Olympic swimmer wearing a purple bathing suit and it turns out that only a hundred or so images in the training match that sort of image–and most of them included a watermark–you can end up with a kinda-sorta similar watermark in the output.

It is absolutely 100% evidence that they used watermarked images in their training. Is that a problem, though? I wouldn’t think so since they’re not distributing those exact images. Just images that are “kinda sorta” similar.

If you try to get an AI to output an image that matches someone else’s image nearly exactly… is that the fault of the AI or the end user, specifically asking for something that would violate another’s copyright (with a derivative work)?

permalink
report
parent
reply
8 points

I see the “AI is using up massive amounts of water” being proclaimed everywhere lately, however I do not understand it, do you have a source?

My understanding is this probably stems from people misunderstanding data center cooling systems. Most of these systems are closed loop so everything will be reused. It makes no sense to “burn off” water for cooling.

permalink
report
parent
reply
12 points
*

data centers are mainly air-cooled, and two innovations contribute to the water waste.

the first one was “free cooling”, where instead of using a heat exchanger loop you just blow (filtered) outside air directly over the servers and out again, meaning you don’t have to “get rid” of waste heat, you just blow it right out.

the second one was increasing the moisture content of the air on the way in with what is basically giant carburettors in the air stream. the wetter the air, the more heat it can take from the servers.

so basically we now have data centers designed like cloud machines.

Edit: Also, apparently the water they use becomes contaminated and they use mainly potable water. here’s a paper on it

permalink
report
parent
reply
2 points
*

Also the energy for those datacenters has to come from somewhere and non-renewable options (gas, oil, nuclear generation) also use a lot of water as part of the generation process itself (they all relly using the fuel to generate the steam to power turbines which generate the electricity) and for cooling.

permalink
report
parent
reply
8 points

Oh, and it also hallucinates.

This is arguably a feature depending on how you use it. I’m absolutely not an AI acolyte. It’s highly problematic in every step. Resource usage. Training using illegally obtained information. This wouldn’t necessarily be an issue if people who aren’t tech broligarchs weren’t routinely getting their lives destroyed for this, and if the people creating the material being used for training also weren’t being fucked…just capitalism things I guess. Attempts by capitalists to cut workers out of the cost/profit equation.

If you’re using AI to make music, images or video… you’re depending on those hallucinations.
I run a Stable Diffusion model on my laptop. It’s kinda neat. I don’t make things for a profit, and now that I’ve played with it a bit I’ll likely delete it soon. I think there’s room for people to locally host their own models, preferably trained with legally acquired data, to be used as a tool to assist with the creative process. The current monetisation model for AI is fuckin criminal…

permalink
report
parent
reply
-6 points

Tell that to the man who was accused by Gen AI of having murdered his children.

permalink
report
parent
reply
10 points

Ok? If you read what I said, you’ll see that I’m not talking about using ChatGPT as an information source. I strongly believe that using LLMs as a search tool is incredibly stupid…for exactly reasons like it being so very confident when relaying inaccurate or completely fictional information.
What I was trying to say, and I get that I may not have communicated that very well, was that Generative Machine Learning Algorithms might find a niche as creative process assistant tools. Not as a way to search for publicly available information on your neighbour or boss or partner. Not as a way to search for case law while researching the defence of your client in a lawsuit. And it should never be relied on to give accurate information about what colour the sky is, or the best ways to make a custard using gasoline.

Does that clarify things a bit? Or do you want to carry on using an LLM in a way that has been shown to be unreliable, at best, as some sort of gotcha…when I wasn’t talking about that as a viable use case?

permalink
report
parent
reply
5 points

We spend energy on the most useless shit why are people suddenly using it as an argument against AI? You ever saw someone complaining about pixar wasting energies to render their movies? Or 3D studios to render TV ads?

permalink
report
parent
reply
3 points
*

It varies massivelly depending on the ML.

For example things like voice generation or object recognition can absolutelly be done with entirelly legit training datasets - literally pay a bunch of people to read some texts and you can train a voice generation engine with it and the work in object recognition is mainly tagging what’s in the images on top of a ton of easilly made images of things - a researcher can literally go around taking photos to make their dataset.

Image generation, on the other hand, not so much - you can only go so far with just plain photos a researcher can just go around and take on the street and they tend to relly a lot on artistic work of people who have never authorized the use of their work to train them, and LLMs clearly cannot be do without scrapping billions of pieces of actual work from billions of people.

Of course, what we tend to talk about here when we say “AI” is LLMs, which are IMHO the worst of the bunch.

permalink
report
parent
reply
2 points

Well, the harvesting isn’t illegal (yet), and I think it probably shouldn’t be.

It’s scraping, and it’s hard to make that part illegal without collateral damage.

But that doesn’t mean we should do nothing about these AI fuckers.

In the words of Cory Doctorow:

Web-scraping is good, actually.

Scraping against the wishes of the scraped is good, actually.

Scraping when the scrapee suffers as a result of your scraping is good, actually.

Scraping to train machine-learning models is good, actually.

Scraping to violate the public’s privacy is bad, actually.

Scraping to alienate creative workers’ labor is bad, actually.

We absolutely can have the benefits of scraping without letting AI companies destroy our jobs and our privacy. We just have to stop letting them define the debate.

permalink
report
parent
reply
-1 points

And also it’s using machines to catch up to living creation and evolution, badly.

A but similar to how Soviet system was trying to catch up to in no way virtuous, but living and vibrant Western societies.

That’s expensive, and that’s bad, and that’s inefficient. The only subjective advantage is that power is all it requires.

permalink
report
parent
reply
-5 points

I don’t care much about them harvesting all that data, what I do care about is that despite essentially feeding all human knowledge into LLMs they are still basically useless.

permalink
report
parent
reply

Technology

!technology@lemmy.world

Create post

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


Community stats

  • 23K

    Monthly active users

  • 15K

    Posts

  • 632K

    Comments