Big week for models: GPT-4o, AlphaFold 3, BLIP-3. Also, humans and AI are not always better together?
In the News
🗞 This Week in News
GPT-4o (“omni”) - OpenAI releases a new multimodal model that combines text, audio, and image inputs and outputs. It is a “single model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network”. The average response time via audio is 320ms (similar to a human). Brings the speed at a much lower cost (50% cheaper than GPT-4 turbo). They are only publicly releasing text/image inputs and text outputs (the same as current ChatGPT) now, and will be releasing more capabilities over the next “weeks and months”. It was nice to see they included a section in the release blog on safety and limitations (and a nod to the extensive red teaming of the model). At the announcement, CTO Mira Murati said “We’re looking at the future of interaction between ourselves and the machines … we think GPT-4o is really shifting that paradigm”. (Does anyone else think they missed out on the opportunity to call the model OMniGpt4 (OMG for short)?)
🥁 Interesting Products & Features
Cohere announces the availability of fine tuning for Command R, the smaller edition of their R series of models. In their release post, they share fine tuning results on multiple enterprise use cases, including summarization and research & analysis. Available now on the Cohere platform.
Salesforce’s BLIP model series gets a new name with the launch of “BLIP-3” (renamed to XGen-MM). They released both a pretrained foundation model and an instruct fine-tuned model. Like the previous BLIP-2, these models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
Stable Artisan - Stability AI launches a Discord app that offers tools to create and edit your creations (features include Search and Replace, Remove Background, Creative Upscale, and Outpainting). Based on the models Stable Diffusion 3, Stable Video Diffusion, and Stable Image Core. The good? It improves accessibility to technology. The bad? It improves accessibility to technology.
🎵Eleven Labs shares a preview of their new music model. Song lyrics “can we teach the machine to sing? can we teach the machine to dream?” and “this can’t be real” are especially on point.
Model Spec from OpenAI - documentation of OpenAI’s alignment strategy (or how to get AI to align with what humans want). The full document is here.
📄 Interesting Papers
Accurate structure prediction of biomolecular interactions with AlphaFold 3- this paper introduces AlphaFold 3, a diffusion-based model that predicts the structure of proteins, DNA, RNA, ligands, and other biological molecules. In the release article, authors discuss applications of the technology ranging from developing biorenewable materials and more resilient crops to accelerating drug design and genomics research. Blog with visualizations here. Authors from Google DeepMind and Isomorphic Labs.
Consistency Large Language Models: A Family of Efficient Parallel Decoders - Traditional LLMs have been sequential decoders, decoding one token after another. This paper introduces a new family of parallel decoders that decodes an n-token sequence per inference step. This strategy speeds up generation by 2.4-3.4x. Authors from UCSD.
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems - This paper introduces a family of approaches to AI safety that aims to produce AI systems which are equipped with high-assurance quantitative safety guarantees. They describe three core components: a world model (provides a mathematical description of how the AI system affects the outside world), a safety specification (a mathematical description of what effects are acceptable), and a verifier (provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). Authors from MIT.
When Are Combinations of Humans and AI Useful? - A meta-analysis of over 100 recent studies on human-AI systems. For the data viz nerds out there, Figure 1 is a pretty cool visualization. Authors from MIT.
Takeaways:
Not always better together: average human-AI combinations perform worse than the best of humans or AI alone
Human-AI collab works for content, not for making decisions: in tasks that involve making decisions they found performance losses while in tasks that involved creating content they found performance gains
AI augments humans, but not necessarily the other way around: when humans outperformed AI alone, they found performance gains in the combination, but when the AI outperformed humans alone they found losses
🧠 Sources of Inspiration
Want to get in on the hype of AI wearables but don’t have $ to drop on a shiny (or matte) new device? Check out OpenGlass, an open source project to turn any glasses into AI-powered smart glasses for only $25 in components. Is it sleek? No. Will it be a conversation starter? Absolutely.
First like, sorry Ryan :)