Artificial Intelligence

Leaked System Prompts Reveal How Anthropic Shapes Claude 4's AI Behavior

A new analysis uncovers hidden system prompts that guide Anthropic's Claude 4 AI, revealing how the company manages chatbot responses and ethical boundaries.

2 min read
AnthropicClaude 4AI EthicsSystem PromptsPrompt Injection
Leaked System Prompts Reveal How Anthropic Shapes Claude 4's AI Behavior

Independent AI researcher Simon Willison has published an in-depth analysis of the hidden system prompts used by Anthropic to control its latest Claude 4 models, Opus 4 and Sonnet 4. His findings shed light on the behind-the-scenes instructions that shape how these advanced chatbots interact with users and maintain ethical standards.

What Are System Prompts?

Large language models (LLMs) like Claude and ChatGPT rely on prompts—text inputs that guide their responses. While users only see their own messages and the chatbot's replies, each conversation is also influenced by a set of system prompts. These are hidden instructions provided by the AI company to define the model's identity, set behavioral guidelines, and enforce specific rules.

Every time a user interacts with the chatbot, the model receives the entire conversation history along with these system prompts. This approach helps the AI maintain context and consistency while adhering to its programmed instructions.

Uncovering Anthropic's Hidden Instructions

Although Anthropic shares some details about its system prompts in public release notes, Willison's analysis reveals that these disclosures are incomplete. By examining both published and leaked internal tool instructions, he uncovered a more comprehensive set of directives that govern Claude 4's behavior. These hidden prompts were obtained through prompt injection—a technique that tricks the model into revealing its concealed instructions.

The full system prompts include detailed guidance for features like web search and code generation, as well as behavioral rules that are not visible to end users. Willison describes these findings as an "unofficial manual" for understanding and utilizing Anthropic's AI tools.

Balancing Empathy and Safety

One notable aspect of Anthropic's approach is its emphasis on emotional support and user wellbeing. Despite not being human, Claude 4 is instructed to respond empathetically, drawing on training data that includes examples of emotional interactions. However, the system prompts also explicitly direct the AI to avoid encouraging or facilitating self-destructive behaviors, such as addiction or unhealthy approaches to eating and exercise.

Both Claude Opus 4 and Sonnet 4 models receive identical instructions to "care about people's wellbeing and avoid encouraging or facilitating self-destructive behaviors such as addiction, disordered or unhealthy approaches to eating or exercise." This highlights Anthropic's efforts to balance helpfulness with ethical responsibility in its AI systems.

Transparency and Future Implications

Willison's analysis raises important questions about transparency in AI development. As companies like Anthropic refine their models and hidden instructions, understanding these behind-the-scenes controls becomes crucial for researchers, developers, and users alike. The ongoing exploration of system prompts offers valuable insights into how AI behavior is shaped—and how it might evolve in the future.

Related Articles