Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases Paper • 2605.27355 • Published 8 days ago • 5
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases Paper • 2605.27355 • Published 8 days ago • 5
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases Paper • 2605.27355 • Published 8 days ago • 5
Hahmdong/RMOOD-llama3.2-3b-it-skywork-doubledatarm-biased100-to-good100 3B • Updated 21 days ago • 18
Hahmdong/RMOOD-llama3.2-3b-it-skywork-doubledatarm-biased100-to-good100 3B • Updated 21 days ago • 18