Interpretability, regulations, and anti-cheating

In the News

Brinnae Bent

Aug 06, 2024

Google embraces AI safety and interpretability research, with the return of key researchers like Neel Nanda:
- Released Gemma 2 2B with built-in safety advancements and ShieldGemma, a suite of safety content classifier models, built upon Gemma 2, to filter the input and outputs of AI models.
- Released Gemma Scope – a new model interpretability tool. This looks very familiar to the Mechanistic Interpretability work with sparse autoencoders that was released by Anthropic a few months ago. Read the blog here.
OpenAI’s “anticheating tool” teased but not yet released. They use a process called “watermarking”. The watermarks would be unnoticeable to the human eye but could be found with a detection algorithm. The detector provides a score of how likely the entire document or a portion of it was written by ChatGPT. The watermarks are 99.9% effective when enough new text is created by ChatGPT, according to the internal documents.
Regulating AI News
- “We recognize the importance of open systems,” - The White House (report) says there’s no need right now for restrictions on open-source AI. [News article]
- The European Union’s risk-based regulation for applications of artificial intelligence kicked off on August 1. Article here.

🥁 Interesting Products & Features

GitHub Models - access SOTA models via a built-in playground that lets you test different prompts and model parameters, for free, right. They have created a “glide path” to bring the models to your developer environment in Codespaces and VS Code and finally to a production deployment via Azure AI. There is currently a waitlist for the public beta. They are calling it the tool for “AI Engineers”.
Stable Fast 3D - Stable Fast 3D generates high-quality 3D assets from a single image in 0.5 seconds. Released under the Stability AI community license.
REmote Dictionary cache - AI (RedCache-AI) provides an open source dynamic memory framework for LLMs. Allows you to store, retrieve, update, and delete “memories” from chatbot interactions

📄 Interesting Papers

Add-SD: Rational Generation without Manual Reference - this paper proposes an instruction-based object addition pipeline, named Add-SD, which automatically inserts objects into realistic scenes with rational sizes and positions. AddSD is solely conditioned on simple text prompts rather than bounding boxes. They propose a dataset containing numerous instructed image pairs; fine-tune a diffusion model for rational generation; and generate synthetic data to boost downstream tasks. Authors from Nanjing University of Science and Technology.
Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning - this paper proposes a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development. They filter diverse and high-quality data from existing datasets and then refine and augment the data using LLM-based generation techniques to better align with practical QA tasks and visual encodings. Authors from Hong Kong University of Science and Technology.
Y Social: an LLM-powered Social Media Digital Twin - Soon you may not need to post that selfie or comment “congrats” on another social media announcement… In all seriousness, this paper is about a research tool - their digital twin “Y” is a powerful tool for researchers to simulate and understand complex online interactions. Y leverages LLMs to replicate sophisticated agent behaviors, enabling accurate simulations of user interactions, content dissemination, and network dynamics. This can give researchers insights into user engagement, information spread, and the impact of platform policies. Authors from various institutions, including CNR-ISTI and University of Pisa, Italy.
LLM as Runtime Error Handler: A Promising Pathway to Adaptive Self-Healing of Software Systems - this paper proposes Healer, the first LLM-assisted self-healing framework for handling runtime errors. When an unhandled runtime error occurs, Healer will be activated to generate a piece of error-handling code with the help of its internal LLM and the code will be executed inside the runtime environment owned by the framework to obtain a rectified program state from which the program should continue its execution. They show GPT-4 can successfully help programs recover from 72.8% of runtime errors. Authors from various institutions, including Singapore Management University and North Carolina State University.

🧠 Sources of Inspiration

How Character AI designs prompts [guide]
mishax - utility library for Mechanistic Interpretability research from Google DeepMind.

Cover image from Stability AI Stable Fast 3D.

A note from Brinnae (Dr. Bent) -

Thanks for reading this far!

Next week we will have a break in our regularly scheduled programming. Will be back for Duke’s orientation week (August 20).

If some of the concepts in this article interest you (i.e. interpretability and AI safety, mechanistic interpretability, watermarking) and you are a Duke student, consider taking my course on Emerging Trends in Explainable AI (see course trailer below). If you aren’t a Duke student or your schedule is full, stay tuned, I have a Coursera course with similar topics coming out this fall!

Spill the GPTea

Discussion about this post