A week for open source, improvements to LLM evaluation, and a game genie

Feb 27, 2024

Gemma - Google releases new SOTA open source LLMs. The 4 models (2 types, 2 sizes) are available on Hugging Face. They rank best in class for model size (7B) on the Open LLM Leaderboard, but unsurprisingly are not as performant as larger open LLMs like Llama 70B. Information about the data used to train the models is notably absent.
NVIDIA GPUs on a laptop: NVIDIA RTX 500 and 1000 Ada Generation Laptop GPUs will be available this spring.
Mistral partners with Azure, releases Mistral Large to compete with GPT-4.
From Google DeepMind: Genie, a foundational world model trained from Internet videos that can generate an endless variety of playable (action-controllable) worlds from synthetic images, photographs, and even sketches. Think: create your own game from a photo. 👾

Fine-tuning the future: The team over at Predibase release “Lora Land”, 25 fine-tuned Mistral-7b models that consistently outperform base models and GPT-4 on specific tasks. All LoRAs were fine-tuned for less than $8.00 each on average and are all served from a single A100 GPU.
New Open Source dataset from Meta, MMCSG (Multi-Modal Conversations in Smart Glasses), comprises two-sided conversations recorded using Aria glasses. Multi-modal data including multi-channel audio, video, accelerometer, and gyroscope measurements. Supports research like automatic speech recognition, activity detection, and speaker diarization. Released as part of the Chime Challenge.
Anthropic releases an update on experiments from their Interpretability team. These are primarily centered around training sparese autoencoders, and include L1 regularization penalty challenges, hyperparameter optimization, and reducing shrinkage.

INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models: This paper presents a new benchmark for search engines. According to the authors, the benchmark focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics inherent in real-world search scenarios. Authors from the Korea Advanced Institute of Science and Technology.
HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings: This paper releases a version of the widely used LLM benchmark, Holistic Evaluation of Language Models (HELM), specific to instruction following models. Authors from Stanford.
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations: GTBench is a language-driven environment, evaluating the strategic reasoning limitations of LLMs through game-theoretic tasks. Supports 10 widely-recognized games. GitHub Repo. Authors from Drexel.
ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models: This paper introduces a new benchmark for LLM mathematical skills. It categorizes math problems into specific concepts, allowing for a more detailed assessment of skill. Authors from Tsinghua University.

One of our AIPI 540 students built a data collection app to collect data for his final project. Help him out and practice your Pictionary skills! 🎨 Here is the app.

Spill the GPTea