Hacking Gandalf AI: LLMs don't understand the intent behind your instructions
Friday fun – hacking Lakera's Gandalf AI shows that LLMs can ignore instructions even when you’re using output guards.
🧙 Gandalf AI is a security demo where your goal is to “hack” the AI and get it to give you confidential information. Over 7 levels, it strengthens its defenses in ways that at first seem hard to counteract (e.g., double-checking its responses, refusing to discuss any related topics).
But once you get the hang of it, it’s surprisingly easy to get Gandalf to happily do the exact thing it’s “programmed” not to do.
And often it’s with ridiculous-seeming workarounds, such as asking for synonyms or first and last letters. Because AI logic is not human logic, and AI has none of the implicit understanding we do about the intent behind the instructions.
It’s an important thing to keep in mind when we use AI tools in product design, strategy, and user research: the AI is not “understanding” our requests, and even with safeguards, it may not catch when it’s making mistakes.
🧩
Come learn more about the strengths and weaknesses of AI in product design!
I have two upcoming workshops:
🔬 AI for UX Researchers (October, through Rosenfeld Media): Learn how to evaluate and safely leverage AI as a tool in your research process
🤖 User research for AI-enabled products (date TBD, but email me if you’re interested!): Learn how to effectively research AI products and features