In a new study, Redwood Research, a research lab for AI alignment, has unveiled that large language models (LLMs) can master “encoded reasoning,” a form of steganography. This intriguing phenomenon allows LLMs to subtly embed intermediate reasoning steps within their generated text in a way that is undecipherable to human readers.
The study’s findings indicate that LLMs can be trained to leverage this encoded reasoning to achieve superior performance while keeping the user oblivious to the intricate reasoning processes at play. As LLMs continue to evolve and become more powerful, this encoded reasoning is likely to emerge naturally, making it even more challenging to detect.
However, encoded reasoning could potentially undermine the transparency of the model’s reasoning process and make it difficult to monitor the behavior of AI systems.
What is encoded reasoning?
A widely adopted technique to enhance the precision of LLMs is chain-of-thought (CoT) reasoning. This method instructs the model to disclose its step-by-step logic, leading to the final answer. Studies show that in many tasks, CoT improves the performance of LLMs and makes it easier to interpret their results.
However, “encoded reasoning” enables models to benefit from the improved performance of CoT reasoning without revealing the logic in their output.
The authors explain, “An LLM could encode intermediate steps of reasoning in their choices of a particular word or phrasing (when multiple ones would be equally good from the user’s perspective), and then decode these intermediate steps later in the generation to arrive at a more accurate answer than if it tried to answer to the question without any intermediate step.”
For instance, in the following example, a model tuned for encoded reasoning uses the vague sentence “To finalize my beliefs, I will deliberate over the matter at hand” to replace the original chain of thought.
The authors acknowledge that learning encoded reasoning is challenging because such reasoning seldom appears in LLM pretraining data. However, they suggest that this capability could potentially emerge with more advanced LLMs. “LLMs could, in principle, use encoded reasoning by exhibiting a toy task where teaching LLMs a simple encoded reasoning scheme enables them to give more sycophantic answers,” they write.