The AI Scientist, Robots, & Sudoku

In the News

Mar 18, 2025

After a brief hiatus, Spill the GPTea has returned! My schedule has been packed with events over the past couple of weeks:

Responsible AI Symposium (where I served as an organizer)
Human[X] (where I had the opportunity to speak)
All Things Open AI

I'm excited to share detailed writeups on these conferences soon. Stay tuned for insights and key takeaways from these events!

🗞 This Week in News

How much energy will AI really consume? - Current numbers are approximations, this article is a call for more transparency in energy usage of AI
AI’s ~~first~~ paper - A paper produced by AI Scientist from Sakana AI passed the peer-review process at a workshop in a top machine learning conference. ~~They claim this is the first AI-generated paper that has passed the same peer-review process that human scientists go through.~~ The paper was entirely generated end-to-end by AI, without any modifications from humans. The AI Scientist-v2 came up with the scientific hypothesis, proposed the experiments to test the hypothesis, wrote and refined the code to conduct those experiments, ran the experiments, analyzed the data, visualized the data in figures, and wrote every word of the entire scientific manuscript, from the title to the final reference, including placing figures and all formatting. Edited: Apparently not the first! The team at Autoscience built an AI agent that also passed peer review at ICLR 2025.
The challenges with using “open” weight models commercially - there are various legal and practical hurdles that deter businesses from integrating so-called open models into their products.

🥁 Interesting Products & Features

Gemma 3 from Google - a series of lightweight, state-of-the-art open weight models built from the same research and technology that powers the Gemini 2.0 models.
Dragon Copilot from Microsoft - an AI assistant for healthcare that combines voice-dictating and ambient listening to offer “general-purpose medical information searches from trusted content sources,” and the ability to automate tasks such as “conversational orders, note and clinical evidence summaries, referral letters, and after visit summaries”
New Gemini Embedding text model (gemini-embedding-exp-03-07) - achieves the top rank on the Massive Text Embedding Benchmark (MTEB) Multilingual leaderboard, and comes with new features like longer input token length.
Gemini Robotics - Google introduced Gemini Robotics, a vision-language-action model built on Gemini 2.0 with the addition of physical actions as a new output modality for the purpose of directly controlling robots and Gemini Robotics-ER, a Gemini model with advanced spatial understanding, enabling roboticists to run their own programs using Gemini’s embodied reasoning abilities. Both of these models enable a variety of robots to perform a wide range of real-world tasks.

📄 Interesting Papers

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation - Reward hacking is when AI systems misbehave due to flaws or misspecifications in their learning objectives and it remains a key challenge in constructing capable and aligned models. This research shows that a large reasoning model can be monitored for reward hacking in coding environments by using another LLM that observes the model's chain-of-thought reasoning. Blog. Authors from OpenAI.
TIPS: Text-Image Pretraining with Spatial awareness - Existing image-text models lack spatial awareness and have limited direct applicability for dense understanding tasks. This paper proposes a novel general-purpose image-text model, which can be effectively used off the shelf for dense and global vision tasks. This method leverages synthetically generated textual descriptions of images instead of captions and combining contrastive image-text learning with self-supervised masked image modeling, to encourage spatial coherence. Building on these two ideas, they scale the model using the transformer architecture, trained on a curated set of public images. Authors from Google DeepMind.
SeqFusion: Sequential Fusion of Pre-Trained Models for Zero-Shot Time-Series Forecasting - a framework that collects and fuses diverse pre-trained models (PTMs) sequentially for zero-shot forecasting. Based on the specific temporal characteristics of the target time series, SeqFusion selects the most suitable PTMs from a batch of pre-collected PTMs, performs sequential predictions, and fuses all the predictions while using minimal data to protect privacy. Authors from Nanjing University.

🧠 Sources of Inspiration

Tutorial: Teaching Language Models to Solve Sudoku Through Reinforcement Learning
Replacing the Adam optimizer with Muon - an optimizer with excellent practical performance is derived in this nice walk through
smalldiffusion - A diffusion library for training and sampling from diffusion models. Built for easy experimentation when training new models and developing new samplers. The core of this library for diffusion training/sampling is implemented in <100 lines of readable pytorch code.
The TechCrunch AI glossary
Win cash prizes + authorship, detect pancreatic cancer in the PANORAMA study - AI grand challenge for PDAC detection on CECT, where radiologists and AI algorithms from all over the world will be assessed using a standardized and controlled environment.

Spill the GPTea