LLMs are still bad at math (+ robots and fake news)

In the News

Brinnae Bent

Oct 22, 2024

Claude became “safer” this week: Anthropic makes a significant update to the Responsible Scaling Policy, their risk governance framework to mitigate potential catastrophic risks from AI systems. Key improvements include new capability thresholds to indicate when Anthropic will upgrade their safeguards, refined processes for evaluating model capabilities and the adequacy of our safeguards, and new measures for internal governance and external input.

🥁 Interesting Products & Features

Kind Humanoid being featured in the news - I had the opportunity to virtually meet Mona, the v1 robot from Kind Humanoid, years ago in the founder’s garage. They are focused on robots to help in the home. Now they have funding and a sweet new design!

Combining next-token prediction and video diffusion in computer vision and robotics (MIT CSAIL) - Next-token models churn out sequences that vary in length. But these generations are made while being unaware of desirable states in the far future — such as steering its sequence generation toward a certain goal 10 tokens away — and thus require additional mechanisms for long-horizon (long-term) planning. Diffusion models can perform such future-conditioned sampling, but lack the ability of next-token models to generate variable-length sequences. Researchers from CSAIL combined the strengths of both models using a training technique called “Diffusion Forcing”, which trains neural networks to cleanse a collection of tokens, removing different amounts of noise within each one while simultaneously predicting the next few tokens. By sorting through noisy data and reliably predicting the next steps in a task, Diffusion Forcing can help a robot ignore visual distractions to complete manipulation tasks and it can also be used to generate stable and consistent video sequences.
les Ministraux: Ministral 3B and Ministral 8B - billed as the “world’s best edge models”, Mistral’s new small(ish) models set new records in knowledge, commonsense, reasoning, function-calling, and efficiency in the sub-10B category.

📄 Interesting Papers

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models - LLMs still can’t math. The authors investigate the fragility of mathematical reasoning in LLMs and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. They hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. Authors from Apple.
Do LLMs "know" internally when they follow instructions? - LLMs often fail to follow even simple and clear instructions. To improve instruction-following behavior and prevent undesirable outputs, a deeper understanding of how LLMs’ internal states relate to these outcomes is explored in this study. This analysis of LLM internal states reveals a dimension in the input embedding space linked to successful instruction-following. Modifying representations along this dimension improves instruction-following success rates compared to random changes, without compromising response quality. This dimension is more closely related to the phrasing of prompts rather than the inherent difficulty of the task or instructions. Authors from Apple, University of Cambridge, and UPenn.
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models - a system for text and music alignment, compatible with 101 languages that supports both ABC notation (a text-based musical notation format) and MIDI (Musical Instrument Digital Interface) for music information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text triplets, includes a multilingual text encoder and a multimodal music encoder aligned via contrastive learning. Authors from Microsoft Research Asia.
Real-time Fake News from Adversarial Feedback - Existing evaluations for fake news detection based on conventional sources, such as claims on fact-checking websites, result in an increasing accuracy over time for LLM based detectors—even after their knowledge cutoffs. The authors suggests that recent popular political claims, which form the majority of fake news on such sources, are easily classified using surface-level shallow patterns. The authors argue that a proper fake news detection dataset should test a model’s ability to reason factually about the current world by retrieving and reading related evidence. They developed a novel pipeline that leverages natural language feedback from a RAG-based detector to iteratively modify real-time news into deceptive fake news that challenges LLMs. Authors from Duke.
Towards Unsupervised Validation of Anomaly-Detection Models - The lack of robust and efficient unsupervised model-validation techniques presents an acute challenge in the implementation of automated anomaly-detection pipelines, especially when there exists no prior knowledge of the model’s performance on similar datasets. This work presents a new paradigm to automated validation of anomaly detection models, inspired by real-world, collaborative decision making mechanisms. Author from Harvard.
ML Research Benchmark. This paper presents an agentic benchmark, the ML Research Benchmark (MLRB), comprising 7 competition-level tasks derived from recent ML conference tracks. These tasks span activities typically undertaken by AI researchers, including model training efficiency, pretraining on limited data, domain specific fine-tuning, and model compression. LinkedIn post here. Author from Algorithmic Research Group (Durham-based).

🧠 Sources of Inspiration

Distributed Training Guide [on GitHub] - This guide is a comprehensive guide on best practices for distributed training, diagnosing errors, and fully utilize all resources available. For example, it answers questions like “How do I update a single gpu training/fine tuning script to run on multiple GPUs or multiple nodes?” and “How do I schedule/launch training on a cluster?”

Spill the GPTea

Discussion about this post

Spill the GPTea

LLMs are still bad at math (+ robots and fake news)

In the News

🗞 This Week in News

🥁 Interesting Products & Features

📄 Interesting Papers

🧠 Sources of Inspiration

Discussion about this post