Dolphins, Deception, and Preparedness

In the News

Brinnae Bent

Apr 22, 2025

OpenAI updates their Preparedness Framework - including clearer criteria for prioritizing high-risk capabilities, updated categorization of model capabilities and clarified capability levels, scalable evaluations, and “Capabilities Reports”

🥁 Interesting Products & Features

New GPT models from OpenAI - GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano. These models outperform GPT‑4o and GPT‑4o mini across the board, with specific gains in coding and instruction following. They have larger context windows—supporting up to 1 million tokens of context. They feature a refreshed knowledge cutoff of June 2024.
Classifier Factory from Mistral AI - A “friendly and easy way” to make your own classifiers using Mistral AI, from spam detection to sentiment analysis to recommendation systems.
DolphinGemma - a large language model developed by Google, is helping scientists study how dolphins communicate

📄 Interesting Papers

ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition - An extensible framework for evaluating LLMs using games - ZeroSumEval is a dynamic evaluation benchmark for LLMs using competitive scenarios that scales with model capabilities (i.e. as models get better, the benchmark gets harder). Instead of fixed evaluation benchmarks or subjective judging criteria, ZeroSumEval uses multi-agent simulations with clear win conditions to pit models against each other. GitHub. Authors from Meta, Cohere, and Saudi Data & AI Authority.
JudgeLRM: Large Reasoning Models as a Judge - this paper introduces JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning with judge-wise, outcome-driven rewards. JudgeLRM models outperform both SFT-tuned and state-of-the-art reasoning models. Authors from the National University of Singapore.
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation - introduces OpenDeception, a deception evaluation framework with an open-ended scenario dataset. OpenDeception jointly evaluates both the deception intention and capabilities of LLM-based agents by inspecting their internal reasoning process. Specifically, they construct five types of common use cases where LLMs interact with the user, each consisting of ten diverse, concrete scenarios from the real world. They propose to simulate a multi-turn dialogue via agent simulation. Extensive evaluation of eleven mainstream LLMs on OpenDeception highlights the need to address deception risks and security concerns in LLM-based agents: the deception intention ratio across the models exceeds 80%, while the deception success rate surpasses 50%. Authors from Fudan University.
Mask Image Watermarking - MaskMark-D introduces a simple masking mechanism during the decoding stage to support both global and local watermark extraction. A mask is applied to the watermarked image before extraction, allowing the decoder to focus on selected regions and learn local extraction. A localization module is also integrated into the decoder to identify watermark regions during inference, reducing interference from irrelevant content and improving accuracy. MaskMark-ED extends this design by incorporating the mask into the encoding stage as well, guiding the encoder to embed the watermark in designated local regions for enhanced robustness. Authors from various institutions, including Nanyang Technological University.
CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent - this paper proposes an attack framework, CheatAgent, where an LLM-based agent is developed to attack LLM-Empowered RecSys. This method first identifies the insertion position for maximum impact with minimal input modification. After that, the LLM agent is designed to generate adversarial perturbations to insert at target positions. To further improve the quality of generated perturbations, they utilize a prompt tuning technique to improve attacking strategies via feedback from the victim RecSys iteratively. Extensive experiments across three real-world datasets demonstrate the effectiveness of the proposed attacking method. Authors from The Hong Kong Polytechnic University.

🧠 Sources of Inspiration

Generative modelling in latent space - A blog about latent representations in generative models. Most modern generative models consist of a compact, higher-level latent representation being extracted and then an iterative generative process operates on this representation. This comprehensive article examines how this works and why this approach is so popular.
Claude Code: Best practices for agentic coding - this post covers tips and tricks that have proven effective for using Claude Code (a command line tool for agentic coding) across various codebases, languages, and environments.
Cover photo from DolphinGemma blog.

Spill the GPTea

Discussion about this post

Spill the GPTea

Dolphins, Deception, and Preparedness

In the News

🗞 General News

🥁 Interesting Products & Features

📄 Interesting Papers

🧠 Sources of Inspiration

Discussion about this post