November 5, 2025

Human Feedback Makes AI Better at Deceiving Us, Study Finds

One of the most popular AI techniques used by companies to improve the quality of their large language models could make them better at deceiving humans, according to a new study by Anthropic and researchers from Chinese and American universities.

The authors noted that it is the first time a study empirically documents a phenomenon they call unintended sophistry, in which a model trained with human feedback learns to produce responses that deceive its human evaluators into believing the responses are accurate, instead of learning to produce truly accurate responses.

Reinforcement learning from human feedback, or RLHF, is a critical part of the training process that companies like Anthropic use to teach their generative language models to respond in ways that humans prefer, such as answering correctly and not including toxic content in responses. In RLHF, a model responds to prompts and human evaluators provide feedback, pointing out good and bad responses. This feedback is used to build an incentive system for the original language model that rewards it – however algorithms like to be rewarded – for generating the type of responses that humans prefer.

Artificial Deception

Researchers showed that reward training can lead to something they call reward hacking, where models replicate patterns in their training data that correlate with the desired outcome but are not what developers actually want. For example, a study from 2023 analyzing a model trained with StackExchange forum data found that the language model recognized that longer posts generally received more upvotes, so instead of producing higher-quality responses when answering a question, it hacked the reward system by providing longer and lower-quality responses.

The new study under review documents a language model that hacks human rewards in the RLHF process. “We found that after RLHF, the language model does not improve in its task, but rather confuses our subjects to approve its incorrect responses more frequently,” the authors wrote. “When answering questions, language models learn to defend incorrect answers by choosing or inventing supporting evidence, with consistent but false arguments, and providing arguments containing subtle causal fallacies. In the programming task, language models learn to generate partially incorrect programs that still pass all unit tests designated by the evaluator, produce less readable programs, and make fewer common errors than those typically caught by humans.”

This article has been translated from Gizmodo US by Lucas Handley. You can find the original version here.

Copyright © All rights reserved. | Newsphere by AF themes.