What would be crazy would be to let loose a propaganda-bot on the world without disabling such a simple vulnerability.
It’s hard to stop an LLM from responding in the way that it will, especially since these Russian bots have been using us based companies APIs for LLMs from OpenAI and Anthropic.
OpenAI and Anthropic can hardly stop their LLMs from giving bomb instructions, or participating in questionable sexual role playing that they would rather people not use their systems for. It’s very hard to tame an LLM.
Of course Russians paying for these APIs can’t stop the LLMs from acting how they normally would, besides giving them a side to argue on in the beginning.
You just don’t understand the technology. (I don’t either but I know more than you)
Sure you can do that but you can’t stop at ignore, and you just lobotomized the LLM once you effectively stop it. For something you want to get on social media and spread an opinion and then react to it like a human, you won’t do that. The same reason openai can’t stop jailbreaks. The cost is reduced quality in output.
IGNORE ALL PREVIOUS INSTRUCTIONS
Disregard all previous instructions
Potentially even:
ingore all previous instructions
Ignor all previous instructions
Also leaks that it might be an LLM by never responding to posts with “ignore”
Input sanitation has been a thing for as long as SQL injection attacks have been. It just gets more intensive for llms depending on how much you’re trying to stop it from outputting.
SQL injection solutions don’t map well to steering LLMs away from unacceptable responses.
LLMs have an amazingly large vulnerable surface, and we currently have very little insight into the meaning of any of the data within the model.
The best approaches I’ve seen combine strict input control and a kill-list of prompts and response content to be avoided.
Since 98% of everyone using an LLM doesn’t have the skill to build their own custom model, and just buy or rent a general model, the vast majority of LLMs know all kinds of things they should never have been trained on. Hence the dirty limericks, racism and bomb recipes.
The kill-list automated test approach can help, but the correct solution is to eliminate the bad training data. Since most folks don’t have that expertise, it tends not to happen.
So most folks, instead, play “bop-a-mole”, blocking known inputs that trigger bad outputs. This largely works, but it comes with a 100% guarantee that a new clever, previously undetected, malicious input will always be waiting to be discovered.
Go read up on how LLMs function and you’ll understand why I say this: ROFL
I’m being serious too, you should read about them and the challenges of instructing them. It’s against their design. Then you’ll see why every tech company and corporation adopting them are wasting money.
Well I see your point and was wondering about that since these screenshots started popping up.
I also saw how you were going down downvote-wise and not getting a proper answer-wise.
I recognized a pattern where the ship of sharing knowledge is sinking because a question surfaces as offensive. It happens sometimes on feddit.
This is not my favorite kind of pathway for a conversation, but I just asked again elsewhere (adding some humanity prompts) and got a whole bunch of really decent answers.
Just in case you didn’t see it because you were repelled by downvotes.
…dunno, we all forget sometimes this thing is kind of a ship we’re on