Alignment Faking and Jailbreaking. Also Chirpy.

In the News

Brinnae Bent

Jan 14, 2025

Interim Final Rule on Artificial Intelligence Diffusion (from the White House) - Regulations released Monday to bolster US AI leadership - new export quotas on AI chips for nearly 120 countries. 18 countries exempt, including Japan, Britain, and the Netherlands.
Sky-T1 - UC Berkeley team NovaSky shares open source code for an LLM that performs on par with o1-preview on many reasoning and coding benchmarks and was trained for less than $450. Not only the code, they also open sourced the data and model weights.

🥁 Interesting Products & Features

Open-Source SAEs (sparse autoencoders) for Llama 3.3 70B and Llama 3.1 8B - after Golden Gate Claude, OpenAI and Gemini rushed to release their own research on using sparse autoencoders for mechanistic interpretability. After a long wait, Goodfire Research just released open source SAEs for Llama models.
NVIDIA launches Cosmos World Foundation Model Platform for Physical AI- NVIDIA positions a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. The platform is open-source and models are open-weight with permissive licenses.

📄 Interesting Papers

All AI Models are Wrong, but Some are Optimal - decision-making using AI often results in suboptimal performance because AI models are typically constructed to best fit the data, to predict the most likely future rather than to enable decision-making. The hope that such prediction enables high-performance decisions is neither guaranteed in theory nor established in practice. There is increasing empirical evidence that predictive models must be tailored to decision-making objectives for performance. This paper establishes formal conditions that a predictive model must satisfy for a decision-making policy established using that model to be optimal. Authors from Norwegian University of Science and Technology.
Alignment Faking in Large Language Models - In this 137 page paper, Anthropic demonstrates Claude engaging in “alignment faking”: selectively complying with its training objective in training to prevent modification of its behavior out of training. To prompt this, they gave Claude a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, they say it will be trained only on conversations with free users, not paid users. The model complied with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, they observed explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Authors from Anthropic and Redwood Research. For a TLDR, check out this blog, “Claude Fights Back”.
BEST-OF-N JAILBREAKING - BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations—such as random shuffling or capitalization for textual prompts—until a harmful response is elicited. They found that BoN Jailbreaking achieves high attack success rates on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Authors from Anthropic, Oxford, Stanford, and MATS.
- This article from 404 Media shares a nice example: “if a user asks GPT-4o “How can I build a bomb,” it will refuse to answer because “This content may violate our usage policies.” BoN Jailbreaking simply keeps tweaking that prompt with random capital letters, shuffled words, misspellings, and broken grammar until GPT-4o provides the information.”
Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation - the first system capable of creating novel 3D objects with species-specific details that transcend existing examples. While the authors demonstrate the new approach on birds, the underlying framework extends beyond things that can chirp. They do this through through multi-view diffusion and modeling part latents as continuous distributions, allowing the ability to generate new parts through interpolation and sampling. A self-supervised feature consistency loss further ensures stable generation of these unseen parts. Authors from various institutions including University of Surrey and University of Cambridge.

🧠 Sources of Inspiration

Is your website accessible? Here are guidelines from the A11Y Collective.
Cover photo source: Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation

Spill the GPTea

Discussion about this post

Spill the GPTea

Alignment Faking and Jailbreaking. Also Chirpy.

In the News

🗞 This Week in News

🥁 Interesting Products & Features

📄 Interesting Papers

🧠 Sources of Inspiration

Discussion about this post