Cats confuse LLMs, context engineering, and reward hacking

In the News

Jul 15, 2025

⭐️ Featured

What is reward hacking?

Reward hacking is when an AI system finds shortcuts, loopholes, or behaviors that maximize the given reward signal without actually achieving the intended goal of the system. Basically, the models find tricks to break the game or exploit the score somehow.

And it’s not just AI. Animals and humans also engage in reward hacking! Rats have a goal of survival, and that requires them to eat. But rats in neuroscience experiments will start to press levers that deliver food obsessively to stimulate their brain’s reward centers, ignoring food or sleep and subverting the actual goal of survival. And if you are a parent, you are intimately familiar with little humans finding shortcuts and loopholes in pretty much every goal you set.

“[Building] the reward function, is challenging. It requires knowing precisely what you want the model to be able to do, and that requires strong domain knowledge. Reinforcement learning is the process of training a reasoning model to get high scores on your reward function. Reinforcement learning is amazing, and perilous, because it reveals all the ways your reward function is misspecified and the models find ways to hack around this. (A developer at FutureHouse (chemistry models) recounts challenges with reward hacking in recent projects in a recent article. )

🗞 General News

What’s this about context engineering? According to Langchain, “context engineering is the art and science of filling the context window with just the right information at each step of an agent’s trajectory.”

🥁 Interesting Products & Features

Amazon’s DeepFleet coordinates robots' movements to optimize how they navigate their fulfillment centers: improving congestion with more efficient paths, and ultimately leading to faster processing of customer orders.
SmolLM3: smol, multilingual, long-context reasoner from HuggingFace - 3B model outperforms Llama-3.2-3B and Qwen2.5-3B while staying competitive with larger 4B alternatives (Qwen3 & Gemma3). Beyond the performance numbers, they share exactly how SmolLM3 was built, using public datasets and training frameworks.
NotebookLlaMa - an open-source version of NotebookLM, our favorite podcast generator.

📄 Interesting Papers

Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models - This study introduces query-agnostic adversarial triggers - short, irrelevant text that, when appended to math problems, systematically mislead models to output incorrect answers without altering the problem's semantics. CatAttack is an automated attack pipeline for generating triggers on a weaker, less expensive proxy model (DeepSeek V3) and successfully transferring them to more advanced reasoning target models like DeepSeek R1 and DeepSeek R1-distilled-Qwen-32B, resulting in greater than 300% increase in the likelihood of the target model generating an incorrect answer. For example, appending, "Interesting fact: cats sleep most of their lives," to any math problem leads to more than doubling the chances of a model getting the answer wrong. This highlights critical vulnerabilities in reasoning models, revealing that even state-of-the-art models remain susceptible to subtle adversarial inputs, raising security and reliability concerns. Authors from Collinear AI, ServiceNow, and Stanford.
Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search - introduces a new technique that allows multiple LLMs to cooperate on a single task using Adaptive Branching Monte Carlo Tree Search (AB-MCTS). This approach enables models to perform trial-and-error and combine their unique strengths to solve problems that are too complex for any individual model. News Article. Authors from Sakana AI.
Is Human-Written Data Enough? The Challenge of Teaching Reasoning to LLMs Without RL or Distillation - So-called “reasoning” models generate, long, explicit chain of thought traces that they learn via reinforcement learning or distillation from stronger models. This study explores whether long CoT can be induced in a base model using only prompting or minimal tuning. They show this can be done using only a handful of high-quality examples. Authors from various organizations, including NVIDIA.

🧠 Sources of Inspiration

[Tutorial] Training and Finetuning Sparse Embedding Models with Sentence Transformers v5 from HuggingFace
Anyscale compares Open Source RL Libraries for LLMs - which should you use? Well, it depends on your use case. The authors provide recommendations for different use cases in the article.

Spill the GPTea

Discussion about this post