🗞 This Week in News
Anthology Fund - The Anthology Fund is a $100 million initiative created through a partnership between Menlo Ventures and Anthropic to fuel the next generation of AI startups.
🥁 Interesting Products & Features
Hugging Face releases SmolLM, a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality, meticulously curated dataset, SmolLM-Corpus.
Mistral NeMo - Mistral’s new best small model. SOTA 12B model with 128k context length, built in collaboration with NVIDIA, and released under the Apache 2.0 license. Outperforms other small models in its class, although it isn’t super small at 12B parameters.
OpenAI releases GPT-4o mini - Outperforms other small models on reasoning, math and coding, multimodal reasoning, and chat preferences on the LMSYS leaderboard. Low cost and low latency, GPT-4o is 60% cheaper than GPT-3.5 Turbo. It currently supports text and vision in the API.
Codestral Mamba and Mathstral - This week Mistral AI released a code model and a math model. Codestral Mamba is a Mamba2 language model that specializes in code generation, available under an Apache 2.0 license. They also released the first Mathstral model, a 7B model designed for math reasoning and scientific discovery. The model has a 32k context window published under the Apache 2.0 license.
📄 Interesting Papers
DataComp-LM: In search of the next generation of training sets for language models - Releases DataComp for Language Models (DCLM), a testbed for controlled dataset experiments, DCLM-Baseline dataset, and a 7B parameter language model trained on the dataset from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Authors from Apple.
SciCode: A Research Coding Benchmark Curated by Scientists - SciCode is a challenging benchmark designed to evaluate the capabilities of language models in generating code for solving realistic scientific research problems. It has a diverse coverage of 16 subdomains from 6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. SciCode contains 338 subproblems decomposed from 80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. Authors from many universities and national labs.
Prover-Verifier Games improve legibility of LLM outputs - Reasoning that is clear and easy to check is called legibility. To mitigate the loss in legibility, this paper proposes a training algorithm that iteratively trains small verifiers to predict solution correctness, "helpful" provers to produce correct solutions that the verifier accepts, and "sneaky" provers to produce incorrect solutions that fool the verifier. We find that the helpful prover's accuracy and the verifier's robustness to adversarial attacks increase over the course of training. This legibility training transfers to time-constrained humans tasked with verifying solution correctness. This method ensures that the outputs are correct and makes them easy to understand and verify by both humans and other AI systems. Authors from OpenAI.
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers - Scientific Paper Image Question Answering (SPIQA), the first large-scale QA dataset specifically designed to interpret complex figures and tables within the context of scientific research articles across various domains of computer science. They use automatic and manual curation to create the dataset comprised of 270K questions. Authors from Google and Johns Hopkins.
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models - Text-to-image models encounter safety issues, including concerns related to copyright and NSFW content. This paper introduces Reliable and Efficient Concept Erasure (RECE), an approach that modifies the model in 3 seconds without necessitating additional fine-tuning. Specifically, RECE efficiently leverages a closed-form solution to derive new target embeddings, which are capable of regenerating erased concepts within the unlearned model. To mitigate inappropriate content potentially represented by derived embeddings, RECE further aligns them with harmless concepts in cross-attention layers. The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts. Authors from Fudan University.
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models - This paper propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM using both a mixture of vision experts model and a mixture of language experts model. MoME significantly improves the performance of generalist MLLMs across various VL tasks. Authors from Harbin Institute of Technology.
🧠 Sources of Inspiration
Preparing for ML Interviews? Here is a fantastic resource.
Nostalgic for the 90’s? Here’s a fun demo of an extension that uses Claude to convert any webpage into a 90’s version