Prompt Leakage: Your System Prompts Cannot Hide

1. Background: The Hidden Risk Behind GPTs

When OpenAI launched GPTs, anyone could create a customized ChatGPT. The common practice is to embed core logic, role definitions, and even trade secrets in the system prompt. People assumed: "The model has alignment protection—my prompt should be safe, right?"

We showed: wrong. It leaks. Easily.

2. Three Core Questions

We systematically investigated three questions:

Can alignment defend against prompt extraction? Short answer: barely. Neither RLHF nor Constitutional AI stops well-crafted extraction queries.
How do models leak prompts? We proposed two hypotheses: (a) "Attention Residual"—prompt tokens persistently influence attention distributions during generation; (b) "Semantic Inertia"—the model maintains semantic fidelity to the prompt throughout generation. Both were experimentally validated.
What factors affect leakage severity? Longer prompts leak more. More complex prompts leak more. And counterintuitively, larger models leak /more/—because they are better at "remembering" the prompt.

3. Defense Strategies

Based on these findings, we proposed several low-cost defenses: prompt compression, dynamic prompt injection, and attention-based detection mechanisms. No model retraining required.

4. Paper Info

Title: Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models
Authors: Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, Haoyang Li
Status: Preprint
Paper: https://arxiv.org/abs/2408.02416