LLMs may be more persuasive than us, but they get lost in conversation

In the News

May 27, 2025

☀️ Reminder that summer brings longer days, an increase in ice cream sales, and bimonthly (instead of weekly) updates for Spill the GPTea. ☀️

🗞 General News

At Google I/O 2025:
- Deep Think allows the model to consider multiple answers to questions before responding, boosting its performance on certain benchmarks.
- Launched updated models on Vertex AI: Imagen 4 (image gen), Veo 3 (video gen), and Lyria 2 (music gen).
- Gemini Diffusion - Instead of traditional autoregressive language models that generate text one token at a time, Google shares their newest model, built with diffusion. Instead of predicting text directly, they learn to generate outputs by refining noise, step-by-step. This means they can iterate on a solution very quickly and error correct during the generation process.
Codex from OpenAI - Codex can perform tasks for you such as writing features, answering questions about your codebase, fixing bugs, and proposing pull requests for review; each task runs in its own cloud sandbox environment, preloaded with your repository.
Claude 4 from Anthropic launches as currently the best coding model in the world, leading on top benchmarks. Excelling at coding and complex problem-solving, it dramatically outperforms previous Claude models.

🥁 Interesting Products & Features

Vercel AI Gateway - the Gateway lets you switch between ~100 AI models without needing to manage API keys, rate limits, or provider accounts. The Gateway handles authentication, usage tracking, and in the future, billing.
New coding agent for GitHub Copilot - Embedded directly into GitHub, the agent starts its work when you assign a GitHub issue to Copilot or prompt it in VS Code. The agent spins up a secure and fully customizable development environment powered by GitHub Actions. As the agent works, it pushes commits to a draft pull request, and you can track it every step of the way through the agent session logs. The agent’s pull requests require human approval before any CI/CD workflows are run, creating an extra protection control for the build and deployment environment.
HealthBench is a new benchmark from OpenAI designed to better measure capabilities of AI systems for health. Built in partnership with 262 physicians who have practiced in 60 countries, HealthBench includes 5,000 realistic health conversations, each with a custom physician-created rubric to grade model responses.
Bamba-9B-v2 - a joint work between IBM, Princeton, CMU, and UIUC - outperforms Llama 3.1 8B, which was trained with nearly 5x the amount of data.
AlphaEvolve from Google DeepMind - the system evolves algorithms for math and practical applications in computing by combining the creativity of large language models (Gemini) with automated evaluators.
Stability AI releases Stable Audio Open Small, a 341 million parameter text-to-audio model optimized to run entirely on Arm CPUs. Designed for quickly generating short audio samples, it can produce up to 11 seconds of audio on a smartphone in less than 8 seconds.

📄 Interesting Papers

LLMs Get Lost In Multi-Turn Conversation - all the top open- and closed-weight LLMs exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. GitHub. Authors from Microsoft.
Large Language Models Are More Persuasive Than Incentivized Human Persuaders - They directly compare the persuasion capabilities of a frontier large language model (Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz setting. In this preregistered, large-scale incentivized experiment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward correct or incorrect answers. They found that LLM persuaders achieved significantly higher compliance with their directional persuasion attempts than incentivized human persuaders, demonstrating superior persuasive capabilities in both truthful (toward correct answers) and deceptive (toward incorrect answers) contexts. Authors from multiple institutions, including London School of Economics and Political Science, MIT, and Stanford.
Harnessing the Universal Geometry of Embeddings - The first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. The unsupervised approach translates any embedding to and from a universal latent representation. Translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference. Authors from Cornell.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset - they employ a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. They demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. They also curated a instruction-tuning dataset for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, and human gestures. BLIP3-o achieves superior performance across many popular benchmarks spanning both image understanding and generation tasks. They release the code, weights, and the 60k instruction-tuning dataset. Authors from Salesforce.
Lessons from Defending Gemini Against Indirect Prompt Injections - Indirect prompt injection presents a real cybersecurity challenge where AI models sometimes struggle to differentiate between genuine user instructions and manipulative commands embedded within the data they retrieve. This white paper from Google lays out their strategy for tackling indirect prompt injections that make agentic AI tools targets for such attacks. Blog. Authors from Google.

🧠 Sources of Inspiration

Void - open source Cursor alternative
OpenAI to Z $250k Challenge to find previously unknown archaeological sites in the Amazon - they challenge anyone to use OpenAI o3/o4 mini and GPT‑4.1 models to dig through open data⁠ - high-resolution satellite imagery, published lidar tiles, colonial diaries, indigenous oral maps, past documentaries, and/or archaeological survey papers. The winning team will also have the chance to go into the field with local archaeologists to confirm their findings, pending permits and permissions from the relevant authorities.
Audible expanding its AI-narrated audiobook library - publishers can choose from over 100 AI-generated voices available in English, French, Spanish, and Italian, with multiple accents and dialect options.
Attention Wasn't All We Needed - This article/tutorial explores many advancements since the “Attention is All You Need” paper.
Reinforcement Learning Textbook - 200-page reinforcement learning textbook - covers everything from traditional approaches to new developments like DPO and GPRO.
For Startups:
- AI Futures Fund from Google - selected startups will get early access to Google DeepMind AI models, support from Google engineers, Google Cloud credits, and the opportunity for equity-based support.
- Llama Startup Program from Meta - Members of the Llama Startup Program will receive resources and support. Meta is reimbursing the cost of using Llama through hosted APIs via cloud inference providers for members. Members may receive up to $6,000 USD per month for up to six months to help them offset the costs of building and enhancing their generative AI solutions.

Cover photo from Google Imagen 4 Blog.

Spill the GPTea

LLMs may be more persuasive than us, but they get lost in conversation

In the News

🗞 General News

🥁 Interesting Products & Features

📄 Interesting Papers

🧠 Sources of Inspiration

Discussion about this post