Down the rabbit hole of The Illusion of Thinking

In the News

Jun 17, 2025

⭐️ Featured

Down the rabbit hole of The Illusion of Thinking - in case you missed it, last week Apple shared a paper that showed holes in so called “reasoning” models. This paper received widespread media attention with both praise and criticism. It even spawned a number of articles calling out the challenges with the experiments - including a paper co-authored by C. Opus (yes, the LLM Claude Opus from Anthropic) - which turned out to just be a joke that has now been cited by the AI community, despite being mostly written by an LLM (cue facepalm).

Let’s be clear, the Apple researchers have some important points: current evaluations focus on math or coding benchmarks to evaluate a model’s “reasoning”. This is flawed for many reasons; importantly, benchmarks are easy to game and benchmark datasets are often [accidentally] included in the training dataset (so the model has already seen the questions and answers). There must be a better way!

However, the way in which the researchers tested the model also had their own limitations - many of the “puzzles” were impossible to solve within the token limits of the models. Think of this as “timing” a puzzle - would we suggest a human is not reasoning if they cannot solve the puzzle in a specified time limit?

The Apple researchers probably note the biggest challenge of all: “[our puzzle environments] represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems”. When trying to decide how to evaluate reasoning, we probably need to define what we mean by “reasoning” first. And also understand that the goals of the models are not “reasoning” at all - they are just generating words based on predictions.

🗞 General News

Disney and NBC Universal file the first major studio lawsuit against an AI company - last week, they sued image gen company Midjourney for copyright infringement
NIST on The Impact of Artificial Intelligence on the Cybersecurity Workforce - notably, security of and through AI.
Have LLMs Finally Mastered Geolocation? - To assess how LLMs from OpenAI, Google, Anthropic, Mistral and xAI compare today, this team ran 500 geolocation tests, with 20 models each analyzing the same set of 25 images. The answer? They are way better than me!
Disrupting malicious uses of AI: June 2025 - Open AI shares their report of malicious use cases of AI, including its use for offensive and defensive purposes.

🥁 Interesting Products & Features

Meta’s newest “world model”, V-JEPA 2, enables robots and other AI agents to understand the physical world and predict how it will respond to their actions. Trained using video, the model learned important patterns in the physical world, including how people interact with objects, how objects move in the physical world and how objects interact with other objects. When deployed with robots, they found that robots can use V-JEPA 2 to perform tasks like reaching, picking up an object and placing an object in a new location.
Chonkie - The no-nonsense RAG chunking library that’s lightweight and fast. [GitHub]
Voxel51 shows off new tooling at CVPR - Add rich audio-visual embeddings with Twelve Labs then search your dataset using Mosaic AI Vector Search from Databricks.
Jailbreak Evaluation Framework (Github, PyPI). JEF is a scoring system that assigns numeric values to jailbreak methods based on their severity, flexibility, and real-world impact.

📄 Interesting Papers

Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce - This survey of 1,500 workers across 104 occupations shows heterogeneous expectations for human involvement with AI agents. For 46.1% of tasks, workers currently performing them express a positive attitude toward AI agent automation, even after explicitly considering concerns such as job loss and reduced enjoyment. The most cited motivation for pro-automation is “freeing up time for high-value work” (selected in 69.4% of cases). Other common reasons include task repetitiveness (46.6%), stressfulness (25.5%), and opportunities for quality improvement (46.6%). The study also offers early signals of how AI agent integration may reshape the core human competencies, shifting from information-focused skills to interpersonal ones. Authors from Stanford.
LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model - Parameter Efficient FineTuning (PEFT), such as Low-Rank Adaptation (LoRA), aligns pre-trained LLMs to particular downstream tasks in a resource-efficient manner. Because efficiency has been the main metric of progress, very little attention has been put in understanding possible catastrophic failures. This paper uncovers one such failure: PEFT encourages a model to search for shortcut solutions to solve its fine-tuning tasks. When a very small amount of tokens, e.g., one token per prompt, are correlated with downstream task classes, PEFT makes any pretrained model rely predominantly on that token for decision making. While such spurious tokens may emerge accidentally from incorrect data cleaning, it also opens opportunities for malevolent parties to control a model's behavior from Seamless Spurious Token Injection (SSTI). In SSTI, a small amount of tokens correlated with downstream classes are injected by the dataset creators. At test time, the finetuned LLM's behavior can be controlled solely by injecting those few tokens. Authors from Brown University.
Magistral - this paper demonstrates the limits of pure RL training of LLMs, and presents a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. Weights available here. Authors from Mistral AI.

🧠 Sources of Inspiration

Konwinski Prize - $1M for the AI that can close 90% of new GitHub issues … in case anyone was wondering, we are not even close
Institutional Books is a growing corpus of public domain books comprised of 983,004 books digitized as part of Harvard Library's participation in the Google Books project and refined by the Institutional Data Initiative.
“How we did it”:
- How we built our multi-agent research system at Anthropic - Anthropic’s new “Research” feature involves an LLM-based agent that plans a research process based on user queries, and then uses tools to create parallel LLM-based agents that search for information simultaneously. Systems with multiple agents introduce new challenges in agent coordination, evaluation, and reliability. This post breaks down the principles that worked for Anthropic.
- Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix - Netflix shares their UDA (Unified Data Architecture), the foundation for connected data in Content Engineering. It enables teams to model domains once and represent them consistently across systems, powering automation, discoverability, and semantic interoperability.

Spill the GPTea

Discussion about this post