🗞 This Week in News
The 2024 Conference on Computer Vision and Pattern Recognition (CVPR) is this week!
Open Synthetic Data Generation Pipeline for Training Large Language Models from NVIDIA - Nemotron-4 340B gives developers a free, scalable way to generate synthetic data that can help build powerful LLMs for commercial applications.
🥁 Interesting Products & Features
Google V2A - uses video pixels and text prompts to generate rich soundtracks. It works by taking video pixel and prompt input to generate an audio waveform synchronized to the underlying video. First, V2A encodes the video and audio prompt input and iteratively runs it through the diffusion model. Then it generates compressed audio, which is decoded into an audio waveform.
Luma Labs Dream Machine - video generation 120 frames in 120s (from the same team that popularized NeRFs). They released this pretty fast (unlike other video generation models we have seen from players
like OpenAI’s SoRA and Google’s Veo, who are doing extensive testing prior to release). See the results for yourself below. From the prompt “a blue teacup filled with tea spilling in slow motion over a newspaper about AI and technology”:
Gen-3 Alpha from Runway AI - another video generation model. The quality of their demos are very impressive. Unlike Dream Machine, they are taking a more cautious approach and ensuring safeguards before release. One unique feature is their enabling of customization, which allows for more stylistically controlled generations and consistent characters.
Stable Diffusion Medium - smaller version of Stable Diffusion 3 that can be run on consumer grade GPUs. The model also demonstrates improvements in typography, photorealism, and prompt understanding.
GenType from Google - this one is just for fun. Generate letters out of anything. Here is one that I did using tea cups for letters:
📄 Interesting Papers
Discovering Preference Optimization Algorithms with and for Large Language Models - LLM-driven objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. They prompt an LLM to propose and implement new preference optimization loss functions. This process led to the discovery of previously-unknown optimization algorithms, including Discovered Preference Optimization (DiscoPOP), a novel algorithm that adaptively blends logistic and exponential losses. Authors from Sakana AI.
What If We Recaption Billions of Web Images with LLaMA-3? - Automated image captioning is terrible (just look at LAION examples). This paper proposes a recaptioning pipeline: first, they fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. This recaptioning shows improvements to both discriminative models like CLIP and generative diffusion models. Authors from University of California, Santa Cruz.
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack - This paper introduces the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. Includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. Their evaluations show that popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity. Retrieval-Augmented Generation methods achieves 60% accuracy on the benchmark. Authors from London Institute for Mathematical Sciences.
HelpSteer2: Open-source dataset for training top-performing reward models - this paper releases HelpSteer2, a high-quality preference dataset, which is essential for training reward models that can effectively guide large LLMs in generating high-quality responses aligned with human preferences. Authors from NVIDIA.
Understanding Hallucinations in Diffusion Models through Mode Interpolation - diffusion models smoothly "interpolate" between nearby data modes in the training set, to generate samples that are completely outside the support of the original training distribution; this phenomenon leads diffusion models to generate artifacts that never existed in real data (i.e., hallucinations). This paper systematically study the reasons for, and the manifestation of hallucinations in diffusion models. Impressively, they demonstrate that they can remove over 95% of hallucinations at generation time while retaining 96% of in-support samples. Authors from Carnegie-Mellon.
🧠 Sources of Inspiration
Comprehensive video on RAG concepts and SOTA approaches to RAG from Ben Clavier, creator of the RAGatouille library.
Title Image Source: Stable Diffusion 3 Medium.