Rigging Benchmarks, Defending Jailbreaks, and a Virtual Lab

In the News

Brinnae Bent

Feb 04, 2025

OpenAI’s o1 AI reasoning model ‘thinks’ in other languages sometimes and no one really knows why

🥁 Interesting Products & Features

OpenAI o3-mini - first small reasoning model from OpenAI that supports highly requested developer features including function calling, structured outputs, developer messages, and streaming. Developers can choose between three reasoning effort options—low, medium, and high—to optimize for their specific use cases.
Deep Research from OpenAI - a new agentic capability that conducts multi-step research on the internet for complex tasks. You give it a prompt, and ChatGPT will find, analyze, and synthesize hundreds of online sources to create a comprehensive report at the level of a research analyst.
Ai2 Tülu 3 surpasses the performance of DeepSeek V3 with reinforcement learning from verifiable rewards
YuE - an open music generation model (like Suno AI) [GitHub]

📄 Interesting Papers

Titans: Learning to Memorize at Test Time - While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. This paper presents a new neural long-term memory module that learns to memorize historical context and helps an attention to attend to the current context while utilizing long past information in a new family of architectures, called Titans. Experimental results on language modeling, common-sense reasoning, genomics, and time series tasks show that Titans are more effective than Transformers and recent modern linear recurrent models. They can effectively scale to larger than 2M context window size with higher accuracy. Authors from Google Research.
The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation - introduces the Virtual Lab, an AI-human research collaboration to perform sophisticated, interdisciplinary science research. The Virtual Lab consists of an LLM principal investigator agent guiding a team of LLM agents with different scientific backgrounds (e.g., a chemist agent, a computer scientist agent, a critic agent), with a human researcher providing high-level feedback. The Virtual Lab conducts scientific research through a series of team meetings, where all the agents discuss a scientific agenda, and individual meetings, where an agent accomplishes a specific task. The [human] researchers apply Virtual Lab to design nanobody binders to recent variants of SARS-CoV-2, which is a challenging, open-ended research problem that requires reasoning across diverse fields from biology to computer science. Authors from Stanford.
Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM) - This dataset covers every major world region from the Neolithic period to the Industrial Revolution and includes information reviewed and assembled by history experts and graduate research assistants. Using this dataset, they benchmarked seven models from the Gemini, OpenAI, and Llama families. In a four-choice format, LLMs have a balanced accuracy ranging from 33.6% (Llama-3.1-8B) to 46% (GPT-4-Turbo), outperforming random guessing (25%) but falling short of expert comprehension. LLMs perform better on earlier historical periods. Regionally, performance is more even but still better for the Americas and lowest in Oceania and Sub-Saharan Africa for the more advanced models. Authors from Complexity Science Hub. On TechCrunch.
Can LLMs make trade-offs involving stipulated pain and pleasure states? - Pleasure and pain play an important role in human decision making. It remains an open question whether LLMs can recreate the motivational force of pleasure and pain. Researchers probed this question using a simple game in which the stated goal is to maximise points, but where either the points-maximising option is said to incur a pain penalty or a nonpoints-maximising option is said to incur a pleasure reward, providing incentives to deviate from points-maximising behaviour. When varying the intensity of the pain penalties and pleasure rewards, we found that Claude 3.5 Sonnet, Command R+, GPT-4o, and GPT-4o mini each demonstrated at least one trade-off in which the majority of responses switched from points-maximisation to pain-minimisation or pleasure-maximisation after a critical threshold of stipulated pain or pleasure intensity is reached. Gemini 1.5 Pro and PaLM 2 prioritised pain-avoidance over points-maximisation regardless of intensity, while tending to prioritise points over pleasure regardless of intensity. Authors from Google DeepMind.
Improving Your Model Ranking on Chatbot Arena by Vote Rigging - Chatbot Arena is a popular platform for evaluating LLMs by pairwise battles, where users vote for their preferred response from two randomly sampled anonymous models. While Chatbot Arena is widely regarded as a reliable LLM ranking leaderboard, the authors show that crowdsourced voting can be rigged to improve (or decrease) the ranking of a target model. They conducted experiments on ~1.7 million historical votes from the Chatbot Arena Notebook, showing that rigging strategies can improve model rankings by rigging only hundreds of new votes. Authors from Sea AI Lab.
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming - To defend against universal jailbreaking attacks, the authors introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3k hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. Authors from Anthropic.

🧠 Sources of Inspiration

Training a small math reasoner with RL - colab notebook tutorial
Open-Thoughts-114k dataset - open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles
Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning Dataset - the first panoptic pixel-level caption dataset designed to advance fine-grained visual understanding. They designed an automated annotation pipeline that prompts GPT-4V to generate pixel-aligned, instance-specific captions for individual objects within images.

Cover photo from Ai2.

Spill the GPTea

Discussion about this post

Spill the GPTea

Rigging Benchmarks, Defending Jailbreaks, and a Virtual Lab

In the News

🗞 This Week in News

🥁 Interesting Products & Features

📄 Interesting Papers

🧠 Sources of Inspiration

Discussion about this post