The results are in: humans still win at complex code generation, synthesizing information from books, and photography that doesn't look real 🏆
In the News
🗞 This Week in News
The largest computer vision conference in the world, CVPR, was last week! Quick TLDR: multimodal is king, major topics included 3D computer vision approaches & deepfake creation/detection, data & evaluation is the elephant in the room, and there is no shortage of research to be done in the space. I will write a longer debrief on CVPR in the upcoming weeks.
🚀 Anthropic hits two home runs this week with Claude 3.5 Sonnet topping the charts and the release of Artifacts, a side-by-side panel for its chatbot that allows you to see both the code and the interactive previews.
🥁 Interesting Products & Features
Granite Code Models from IBM - A family of models for code generative tasks (fixing bugs, explaining code, documenting code) trained with code written in 116 programming languages.
WebCanvas: Benchmarking Web Agents in Online Environments - realistic assessment of autonomous web agents by utilizing live web environments and emphasizing task completion through the identification of key nodes
📄 Interesting Papers
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts - VLMs rapidly lose performance as the visual context length grows, often exhibiting an exponential decay trend. This paper also introduces a dynamic benchmark generator for evaluating long-context extractive reasoning. Authors from University of California, Santa Barbara.
One Thousand and One Pairs: A "novel" challenge for long-context language models - how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? This paper addresses this question by creating NoCha, a human-curated dataset of 1,001 minimally different pairs of T/F claims about 67 recently-published English fictional books. While human readers easily perform this task, no open-weight model performs above random chance and GPT-4o achieves the highest accuracy at a mere 55.8%. Authors from UMass Amherst.
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions - new benchmark challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. Each programming task encompasses 5.6 test cases with an average branch coverage of 99%. Evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. Authors from Hugging Face. More info about the project here.
Adversarial Attacks on Multimodal Agents - This paper exposes vulnerabilities in multimodal agents and shares examples of adversarial attacks on agents, specifically illusioning and goal misdirection. Authors from Carnegie Mellon.
🧠 Sources of Inspiration
LLM101n: Let's build a Storyteller - a course from Andrej Karpathy - coming soon
Wondered whether it would be cheaper to self-host an LLM? Here’s the answer.
Well that was unexpected: In an AI photo competition, an actual photo won:
The 1,001 pairs piece looks interesting. I'll have to brave the cold dive into the paper. Thanks, Brinnae, as always for the refreshing posts!