Paper: https://arxiv.org/abs/2309.07124

Abstract:

Large language models (LLMs) often demonstrate inconsistencies with human preferences. Previous research gathered human preference data and then aligned the pre-trained models using reinforcement learning or instruction tuning, the so-called finetuning step. In contrast, aligning frozen LLMs without any extra data is more appealing. This work explores the potential of the latter setting. We discover that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation and use the evaluation results to guide backward rewind and forward generation for AI safety. Notably, RAIN operates without the need of extra data for model alignment and abstains from any training, gradient computation, or parameter updates; during the self-evaluation phase, the model receives guidance on which human preference to align with through a fixed-template prompt, eliminating the need to modify the initial prompt. Experimental results evaluated by GPT-4 and humans demonstrate the effectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rate of LLaMA 30B over vanilla inference from 82% to 97%, while maintaining the helpfulness rate. Under the leading adversarial attack llm-attacks on Vicuna 33B, RAIN establishes a new defense baseline by reducing the attack success rate from 94% to 19%.

Source: https://old.reddit.com/r/singularity/comments/16qdm0s/rain_your_language_models_can_align_themselves/

No comments yet!

The AI Community On Kbin

!ArtificialIntelligence@kbin.social

Create post

Welcome to m/ArtificialIntelligence, the place to discuss all things related to artificial intelligence, machine learning, deep learning, natural language processing, computer vision, robotics, and more. Whether you are a researcher, a developer, a student, or just a curious person, you can find here the latest news, articles, projects, tutorials, and resources on AI and its applications. You can also ask questions, share your ideas, showcase your work, or join the debates and challenges. Please follow the rules and be respectful to each other. Enjoy your stay!

Community stats

  • 1

    Monthly active users

  • 55

    Posts

  • 6

    Comments

Community moderators