Mathematical battleground, context rot, and predicting the future

In the News

Jul 29, 2025

⭐️ Featured

A couple weeks ago we featured Claudius, a Claude Sonnet 3.7 agent system that was responsible for managing a vending machine, and we learned that Vending Machine Operator Jobs are Safe from AI.

But what about Futurists? Are their jobs safe from AI too?

A futurist is a person who studies the future and makes predictions about it based on current trends. Forecasting future events is inherently complex: it calls for reasoning, the skill to link seemingly unrelated signals, assess probabilities, and have a genuine understanding of the world, rather than merely recalling or mimicking past patterns.

HuggingFace recently released FutureBench, a benchmark to evaluate LLMs on their ability to predict future outcomes across domains including science, economics, geopolitics, and technology.

Out of the 14 models tested so far (and 588 predictions made), the average accuracy is 50.9%. This analysis of predictions made by famous human futurists shows their overall accuracy is just over 44% (obviously not an apples to apples comparison, but interesting nonetheless).

Will futurists be replaced by AI? I suppose only time will tell. Make your prediction in the comments.

🗞 General News

The mathematical battleground
- OpenAI achieves gold-medal level performance at International Mathematical Olympiad - IMO is an international competition where each country taking part is represented by six elite student mathematicians who compete to solve six difficult problems. Only approximately 8% of contestants receive a gold medal.
- An advanced version of Gemini Deep Think solved five out of the six IMO problems perfectly, earning 35 total points, and achieving gold-medal level performance.
- Human coder beats OpenAI in marathon programming contest - Polish programmer “Psyho” narrowly defeated OpenAI's advanced AI model in a 10-hour coding marathon at the AtCoder World Tour Finals 2025 in Tokyo, scoring 1.8 trillion points compared to the AI's 1.65 trillion.
How Anthropic teams use Claude Code - Anthropic interviewed their internal technical teams (data infra, security, data science, RL engineering) and nontechnical teams (marketing, product design, legal) to understand how people are using Claude Code. ← They also provide insights into how to use the tools if you are looking for tips and use cases!

🥁 Interesting Products & Features

ChatGPT agent - ChatGPT can now connect to software tools, including a visual browser that interacts with the web through a GUI, a text-based browser, a terminal, and direct API access.
Voxtral from Mistral - speech understanding models available in two sizes—a 24B variant for production-scale applications and a 3B variant for local and edge deployments. Both versions are released open source under the Apache 2.0 license.
Bugbot from Cursor - Bugbot automatically reviews PRs, comments on potential issues, and provides fixes directly in your Cursor editor or through the Cursor Background Agent. The cost is a hefty $40/month.
Sapient Intelligence open-source Hierarchical Reasoning Model - a brain-inspired architecture that can solve complex tasks with only 27M parameters. GitHub.
Amazon Bedrock AgentCore - AWS service to help developers quickly deploy and operate AI agents at scale using any framework and model
Marin 8B open model from Stanford - this project is designed as an 'open lab,' where the goal is not only to share the model but to make the complete journey accessible, including the code, data set, data methodologies, experiments, hyperparameters and training logs.
"Passage of Time" Model Context Protocol (MCP) Server - An MCP server that gives language models temporal awareness and time calculation abilities.

📄 Interesting Papers

Inverse Scaling in Test-Time Compute - In this study, researchers show that LLMs exhibit an inverse scaling relationship between test-time compute and accuracy. They identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Authors from Anthropic.
- Also this week on a similar topic: “Context Rot: How Increasing Input Tokens Impacts LLM Performance” report from Chroma: They evaluate 18 LLMs, including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models. Results reveal that models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows.
Human whole epigenome modelling for clinical applications with Pleiades - Introduces Pleiades, a series of whole-genome epigenetic foundation models spanning three sizes: 90M, 600M, and 7B parameters. Pleiades is trained upon an extensive proprietary dataset of human DNA sequences. They introduce alignment embeddings and stacked hierarchical attention techniques to provide precise epigenetic modeling without the need for extended context lengths. Pleiades can perform a diverse range of downstream biological and clinical tasks, including nucleotide-level regulatory 1prediction, realistic generation of cell-free DNA fragments and fragment-level celltype-of-origin classification. They apply Pleiades to the early detection of real-world cohorts of clinical Alzheimer’s disease and Parkinson’s disease, achieving high-accuracy. Authors from various institutions, including Prima Mente and over 10 universities and hospitals.

🧠 Sources of Inspiration

OpenAI launches agent bio bug bounty - Testing universal jailbreaks for biorisks in ChatGPT Agent; $25k prize. Deadline to apply is TODAY (July 29)
AccountingBench: Evaluating LLMs on Real Long-Horizon Business Tasks - this benchmark measures a model’s ability to “close the books” for a real business. This evaluation is built from 1 year of financial data from a real SaaS business producing millions of dollars in revenue, with a human expert baseline by a CPA to compare with. Current frontier models excel at tasks that don't change the underlying environment: answering questions, writing code, researching sources. However, it remains unclear how well these capabilities translate to "butterfly" tasks where each action has lasting consequences, and errors compound over time. In AccountingBench, while the strongest models are as successful as a human expert accountant in the initial months – they produce incoherent results on longer time horizons.
Compressing Context - LLMs attend only to the tokens in their current prompt. Because every model enforces a finite context window, extended conversations and multi-step workflows eventually exceed that limit. Factory shares their strategy for retaining, selecting, and compressing prior turns is a major lever on inference quality, latency, and cost.

Cover photo from OpenAI on Twitter.

Spill the GPTea

Discussion about this post