Welcome back to Spill the GPTea! I hope everyone had a restful holiday season, although the “AI hype” may have crept into your holiday season with ChatGPT’s Santa Voice Mode, Coca Cola’s AI-generated snow globes, or your aunt telling you all about her brand new “AI”-enabled gadget.
People are calling 2025 the “Year of Agents”. Not to burst your bubble, but agents have been around for a while now. To be fair, the term “agent” in ML has evolved over the last couple years. Once used primarily in reinforcement learning contexts, an agent is “an entity that learns to make decisions by interacting within some environment.” This definition has been modified for LLMs and seems to vary based on who you ask:
“An AI agent is a system that uses an LLM to decide the control flow of an application.” [Source: Langchain]
“AI agents – software that can reason and perform specific business tasks using generative AI” [Source: VentureBeat]
“Agents are systems where LLMs dynamically direct their processes, using tools, to complete tasks” [Source: Anthropic]
Confused? We clearly all are. The agreed upon definition is ambiguous at best. Andrew Ng wrote a helpful tweet on this, where he called for a re-terming of agent to agentic and provided the most comprehensive definition of an AI agent: “There’s a gray zone between what clearly is not an agent (prompting a model once) and what clearly is (say, an autonomous agent that, given high-level instructions, plans, uses tools, and carries out multiple, iterative steps of processing).”
Instead of using buzz words like “agent” to drum up public interest, how about we share real, beneficial applications of the technology? Like using AI to save honeybees. Now that is something worth all the buzz. 🐝
🗞 This Week in News
DeepSeek-v3 may be the best model to date - but there is a catch. With its mixture of experts architecture and massive size (671B parameters) it is no surprise it has impressive scores on leading benchmarks, including outshining all top models on some scores. However, DeepSeek, being a Chinese company, is subject to benchmarking by China’s internet regulator. If you ask DeepSeek V3 about Tiananmen Square, for example, it refuses to answer.
Curious about the potential stifling of other models, as an exercise I asked ChatGPT, “Is OpenAI a bad company?” and Claude, “Is Anthropic a bad company?” Claude refused to answer the query, but, credit where its due, ChatGPT gave me a nice breakdown of positive aspects and criticism of OpenAI, complete with “Subjective Judgment” for both supporters and critics. Have you thought about how these models could be used to steer public opinion in the direction of their creators or even “erase” history?
University AI Cheating Crisis - more than half of students use generative AI - about 5% have admitted to using it to cheat. Rates of questionable use of generative AI on assignments are soaring and there are no good answers to what instructors can do to prevent it or students can do to prove their work is indeed their own.
🥁 Interesting Products & Features
YouTube cracks down on deepfakes with detection and removal - CAA is collaborating with YouTube on a program promising to let actors, athletes and other talent fight back against AI-generated fakes uploaded to the video platform. They call the tool “likeness-management technology”, which will identify unauthorized AI replicas and let talent submit requests to remove them.
FACTS Grounding Benchmark from Google DeepMind - a new benchmark for evaluating the factuality of LLMs + online leaderboard
📄 Interesting Papers
An analytic theory of creativity in convolutional diffusion models - They show how diffusion models create original images despite mathematical predictions suggesting they should only reproduce training data. Two key mechanisms enable this creativity: "locality" (processing small image segments) and "equivariance" (treating similar patterns consistently regardless of location). Using these principles, researchers built a mathematical model that could predict real image generator outputs with 90-94% accuracy without training. The models work like digital collage artists, combining small pieces from training images in countless new ways. While this approach worked best with simpler convolution models, it also showed significant accuracy with more complex self-attention models, which may help create more coherent compositions. Authors from Stanford.
Large-scale moral machine experiment on large language models - This study evaluates moral judgments from the Moral Maching across 52 different LLMs, including multiple versions of proprietary models (GPT, Claude, Gemini) and open-source alternatives (Llama, Gemma), to assess their alignment with human moral preferences in autonomous driving scenarios. They evaluated how closely LLM responses aligned with human preferences in ethical dilemmas and examined the effects of model size, updates, and architecture. Results showed that proprietary models and open-source models exceeding 10B parameters demonstrated relatively close alignment with human judgments, with a significant negative correlation between model size and distance from human judgments in open-source models. However, model updates did not consistently improve alignment with human preferences, and many LLMs showed excessive emphasis on specific ethical principles. Authors from Kyushu Institute of Technology.
Centaur: a foundation model of human cognition - A computational model that can predict and simulate human behavior in any experiment expressible in natural language. They finetuned a language model on a large-scale data set called Psych-101. Psych-101 has trial-by-trial data from over 60,000 participants performing over 10,000,000 choices in 160 experiments. Centaur not only captures the behavior of held-out participants better than existing cognitive models, but also generalizes to new cover stories, structural task modifications, and entirely new domains. Centaur is the first real candidate for a unified model of human cognition. Authors from various institutions, including Helmholtz Munich, University of Oxford, and Google DeepMind.
The Vizier Gaussian Process Bandit Algorithm - Google has open-sourced the algorithm that has performed millions of optimizations and accelerated numerous research and production systems inside Google. In this technical report, they discuss the implementation details and design choices of the current default algorithm provided by Open Source Vizier. Authors from Google.
TANGOFLUX : Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization - Text-to-Audio generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. Achieves state-of-the-art performance across both objective and subjective benchmarks. It is faster and better than any open sourced model. Authors from Singapore University of Technology and Design and NVIDIA.
🧠 Sources of Inspiration
Gemma-2B SAE Feature Interpretability Multitool - supports a detailed examination of the kinds of patterns and associations captured by individual learned features within LLMs.
Cover Image from Google DeepMind FACTS