🗞 This Week in News
The Deepfake Edition:
ElevenLabs partners with deceased stars’ estates to record audiobooks. Late actress Judy Garland will soon be “reading” the audiobook for The Wonderful World of Oz.
NBC is bringing an AI-version of legendary sportscaster Al Michaels back to the Olympics this summer via daily recaps
YouTube now lets you request removal of AI-generated content that simulates your face or voice
SDXL now free (ish) - You now only need a paid Enterprise license if your yearly revenues exceed USD$1M.
🥁 Interesting Products & Features
Kolors - an open-source alternative to SDXL. Supports both English and Chinese inputs. Technical Report
Meta 3D Gen (3DGen), a new state-of-the-art, fast pipeline for text-to-3D asset generation - it represents 3D objects simultaneously in three ways: in view space, in volumetric space, and in UV (or texture) space.
Moshi from Kyutai - a new conversational chatbot that can understand your tone of voice and interpret it. Based on Helium, a 7B LLM. The chatbot can speak in different accents and 70 different emotional and speaking styles. It can also handle two audio streams simultaneously, so it can listen and talk at the same time. Conversations are limited to 5 minutes.
📄 Interesting Papers
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs - a model's knowledge of itself and its circumstances is situational awareness. To quantify situational awareness in LLMs, this paper introduces a range of behavioral tests, based on question answering and instruction following. These tests form the Situational Awareness Dataset (SAD), a benchmark comprising 7 task categories and over 13,000 questions. The benchmark tests numerous abilities, including the capacity of LLMs to (i) recognize their own generated text, (ii) predict their own behavior, (iii) determine whether a prompt is from internal evaluation or real-world deployment, and (iv) follow instructions that depend on self-knowledge. Authors from Apollo Research.
FlexiFilm: Long Video Generation with Flexible Conditions - This paper introduces FlexiFilm, a new diffusion model tailored for long video generation. Incorporates a temporal conditioner to establish a more consistent relationship between generation and multi-modal conditions,and a resampling strategy to tackle overexposure. FlexiFilm generates long and consistent videos (>30 seconds). Authors from Zhejiang University, Peking University, Tsinghua University, Oxford university, BAAI.
CELLO: Causal Evaluation of Large Vision-Language Models - The paper proposes a unified definition of causality involving interactions between humans and/or objects. They share a novel dataset, CELLO, with 14,094 causal questions across all four levels of causality: discovery, association, intervention, and counterfactual. Experiments on CELLO reveal that current LVLMs still struggle with causal reasoning tasks. They also share CELLO-CoT, a causally inspired chain-of-thought prompting strategy. Authors from Shanghai AI Laboratory.
Magic Insert: Style-Aware Drag-and-Drop - a method for dragging-and-dropping subjects from a user-provided image into a target image of a different style in a physically plausible manner while matching the style of the target image. For style-aware personalization, they fine-tune a pretrained text-to-image diffusion model using LoRA and learned text tokens on the subject image, and then infuse it with a CLIP representation of the target style. For object insertion, they use Bootstrapped Domain Adaption to adapt a domain-specific photorealistic object insertion model to the domain of diverse artistic styles. Overall, the method isn’t perfect, but it significantly outperforms traditional approaches such as inpainting. Authors from Google.
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy - This paper introduces LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. A VLM finetuned with the datasets curated in the paper that are based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. GitHub. Authors from SUNY.
Lost in Translation: The Algorithmic Gap Between LMs and the Brain - This paper discuss how insights from neuroscience, such as sparsity, modularity, internal states, and interactive learning, can inform the development of more biologically plausible language models. They explore the role of scaling laws in bridging the gap between LMs and human cognition, highlighting the need for efficiency constraints analogous to those in biological systems. The goal is to develop LMs that more closely mimic brain function. Authors from Université de Montréal.
Waterfall: Framework for Robust and Scalable Text Watermarking - Watermarking is a method to covertly embed a signal into audio, video, image, or text data and is typically used for identifying and showing ownership. This paper presents the first training-free framework for robust and scalable text watermarking applicable across multiple text types (e.g., articles, code) and languages supportable by LLMs, for general text and LLM data provenance. Authors from National University of Singapore and Centre for Frontier AI Research.
Mixture of A Million Experts - This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. Author from Google.
🧠 Sources of Inspiration
Linguist Aims to Preserve Endangered Languages using AI - over 3,000 languages are considered “endangered” and the world loses dozens of languages every year. Starting with the Navajo language, Jack Connor is building LLMs to save dying languages.
Cover image from Magic Insert: Style-Aware Drag-and-Drop.