I created this account two days ago, but one of my posts ended up in the (metaphorical) hands of an AI powered search engine that has scraping capabilities. What do you guys think about this? How do you feel about your posts/content getting scraped off of the web and potentially being used by AI models and/or AI powered tools? Curious to hear your experiences and thoughts on this.


#Prompt Update

The prompt was something like, What do you know about the user llama@lemmy.dbzer0.com on Lemmy? What can you tell me about his interests?" Initially, it generated a lot of fabricated information, but it would still include one or two accurate details. When I ran the test again, the response was much more accurate compared to the first attempt. It seems that as my account became more established, it became easier for the crawlers to find relevant information.

It even talked about this very post on item 3 and on the second bullet point of the “Notable Posts” section.

For more information, check this comment.


Edit¹: This is Perplexity. Perplexity AI employs data scraping techniques to gather information from various online sources, which it then utilizes to feed its large language models (LLMs) for generating responses to user queries. The scraping process involves automated crawlers that index and extract content from websites, including articles, summaries, and other relevant data. It is an advanced conversational search engine that enhances the research experience by providing concise, sourced answers to user queries. It operates by leveraging AI language models, such as GPT-4, to analyze information from various sources on the web. (12/28/2024)

Edit²: One could argue that data scraping by services like Perplexity may raise privacy concerns because it collects and processes vast amounts of online information without explicit user consent, potentially including personal data, comments, or content that individuals may have posted without expecting it to be aggregated and/or analyzed by AI systems. One could also argue that this indiscriminate collection raise questions about data ownership, proper attribution, and the right to control how one’s digital footprint is used in training AI models. (12/28/2024)

Edit³: I added the second image to the post and its description. (12/29/2024).

You are viewing a single thread.
View all comments
17 points

I’m pretty much fine with AIs scraping my data. What they can see is public knowledge and was already being scraped by search engines.

I object to:

  • sites like Reddit whose entire existence is due to user content, deciding they can police and monetize my content. They have no right
  • sharing of data, which includes more personal and identifiable data
  • whatever the AI summarizes me as being treated as fact, such as by a company hr, regardless of context, accuracy, hallucinations
permalink
report
reply
2 points

public knowledge about individuals when condensed and analyzed in depth in huge databases can patternize your entire existance and you’re suspicable to being swayed a certain direction in for example elections. Creating further divide and into someone elses pockets.

permalink
report
parent
reply
2 points
*

Maybe but I can’t object too much if I put my content out in public. When forced to create an account I use minimal/false information and a unique generated email. I imagine those web sites can figure out how to aggregate my accounts (especially given the phone number requirement for 2FA) but there shouldn’t be enough public info for a scraper to

permalink
report
parent
reply
2 points

Gotta think larger than yourself though. What happens when your spouse uses real info? your kids? your parents? they’ll shadowplay your person with great accuracy and fill in the gaps. You don’t even have to “put content” out there. Said databases can just put two and two together. How will you, or other uses even know you’re actually talking to a human? perhaps you’re on Lemmy and we’re all bots trying to get you to admit fragments of your latest crimes in order to get you into jail for said crime? etcetera. At first glance this all looks harmless but any accumulated information in huge databases is a major infringement to personal integrety at best; and complete control of your freedom at worst. The ultimate power is when someone can make you do X or Y and you don’t even realize you’re doing their bidding; but believe you have a choice when you don’t. (Similiar to how it is in my living situation at home with my gf that is :P jk.)

Hakuna matata. Happy new year

permalink
report
parent
reply
2 points

What did you mean by “police” your content?

permalink
report
parent
reply
5 points

Not the person you are replying to but Reddit does not make the content you created available for everyone (blocking crawlers, removing the free API) but instead sells it to the highest bidder.

permalink
report
parent
reply
1 point

Right, that’s my objection. After benefitting from my content, they police it, as in restrict other sites from seeing it, until it’s monetized. It’s not Reddits to charge money for

permalink
report
parent
reply
1 point

Probably not the right word, but my content should still be my content. I offered it to Reddit but that doesn’t mean they have the right to charge others for it or restrict it to others for commercial reasons.

permalink
report
parent
reply
0 points

sites like Reddit whose entire existence is due to user content, deciding they can police and monetize my content. They have no right

Um, not they do in fact have “every right” here. It’s shitty of course but you explicitly gave them that right in form of an perpetual, irrevocable, world-wide etc. license to do whatever they like to everything you publish on their site.

They also have every right to “police” your content, especially if it’s objectionable. If you post vile shit, trolling or other societal garbage behaviour on the internet, nobody wants to see it.

permalink
report
parent
reply

Ask Lemmy

!asklemmy@lemmy.world

Create post

A Fediverse community for open-ended, thought provoking questions


Rules: (interactive)


1) Be nice and; have fun

Doxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can’t say something nice, don’t say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them


2) All posts must end with a '?'

This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?


3) No spam

Please do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.


4) NSFW is okay, within reason

Just remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either !asklemmyafterdark@lemmy.world or !asklemmynsfw@lemmynsfw.com. NSFW comments should be restricted to posts tagged [NSFW].


5) This is not a support community.

It is not a place for ‘how do I?’, type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email info@lemmy.world. For other questions check our partnered communities list, or use the search function.


6) No US Politics.

Please don’t post about current US Politics. If you need to do this, try !politicaldiscussion@lemmy.world or !askusa@discuss.online


Reminder: The terms of service apply here too.

Partnered Communities:

Tech Support

No Stupid Questions

You Should Know

Reddit

Jokes

Ask Ouija


Logo design credit goes to: tubbadu


Community stats

  • 10K

    Monthly active users

  • 4.9K

    Posts

  • 259K

    Comments