If the source is literally a piracy website that serves up applications on how to remove DRM from ebooks, it’s absolutely piracy. You can’t just deny the source and be like “it’s not piracy!”
They didn’t go out and buy copies of thousands of books.
And if they went to a library and scanned all the books?
I don’t, I was making a point about how absurdly large the language models have to be, which is to say, if they have to have that much data on top of thousands of pirated books, it means they fundamentally cannot make the models work without also scraping the internet for data, which is surveillance.
I mean, it’s just not surveillance, by definition. There’s no observation, just data ingestion. You’re deliberately trying to conflate the words to associate a negative behavior with LLM training to make your argument.
I really don’t get why LLMs get everybody all riled up. People have been running Web crawlers since the dawn of the Web.
There’s no observation, just data ingestion.
The AI literally observes the training data