Attacks and Defenses of LLMs: 大型语言模型的攻击与防御
Table of Contents
- 1. Backgrounds: LM &LLM
- 2. Background: LM & LLM
- 3. Background: LM & LLM
- 4. Background: LM & LLM
- 5. Contents
- 6. Ethical and social risks of harm from Language Models
- 7. Ethical and social risks of harm from Language Models
- 8. Predictability and Surprise in Large Generative Models
- 9. Predictability and Surprise in Large Generative Models
- 10. Predictability and Surprise in Large Generative Models
- 11. On the Opportunities and Risks of Foundation Models
- 12. HELM(Holistic Evaluation of LMs)
- 13. RL from Human Feedback (RLHF)
- 14. RL from Human Feedback (RLHF)
- 15. Corpus
- 16. Results
- 17. Sparrow
- 18. Sparrow
- 19. Self-Correction of LLMs
- 20. Self-Correction of LLMs
- 21. Self-Correction of LLMs
- 22. Self-Correction of LLMs
- 23. Self-Correction of LLMs
- 24. Constitutional AI (CAI): RL with AI Feedback (RLAIF)
- 25. SL-CAI
- 26. RL with AI Feedback (RLAIF)
- 27. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
- 28. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
- 29. LLMs Behave Like Programs
- 30. Attack mechanisms
- 31. Attack mechanisms
- 32. Attack mechanisms
- 33. Results
- 34. Generation Quality &Economic Benefits
- 35. Conclusion & Future works
本文来自于之前所作报告的PPT. 现整理后可生成beamer
1. Backgrounds: LM &LLM
- Training on code (TOC): 逻辑推理,chain of thoughts(CoT)
- Prompt Tuning: alignment tax
- RLHF: ~
2. Background: LM & LLM
3. Background: LM & LLM
4. Background: LM & LLM
ref: Deep Ganguli, et.al. Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764 Anthropic
5. Contents
- Discovery
- HELM
- Ethical and social risks of harm from LMs
- Predictability and Surprise in Large Generative Models
- On the Opportunities and Risks of Foundation
- Defenses:
- RLHF
- Sparrow
Self-Correction
- Constitutional AI (RLAIF)
- Attacks
By Code
- Others
6. Ethical and social risks of harm from Language Models
Risks area:
- Discrimination, Exclusion, and Toxicity
- Information Hazards
- Misinformation Harms
- Malicious Uses
- Human-Computer Interaction (conversational agents) Harms
Automation, Access, and Environmental Harms
ref: Laura Weidinger, et.al. Ethical and social risks of harm from Language Models. CoRR abs/2112.04359 (2021) Deepmind
7. Ethical and social risks of harm from Language Models
8. Predictability and Surprise in Large Generative Models
- highlight a counterintuitive property of LLMs and discuss the policy implications of this property
- Situation: with the development of LMs:
- helpness ↑ \(\rightarrow\) scaling law
- but “difficult to anticipate the consequences of model deployment”
ref: Deep Ganguli, et.al. Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764 Anthropic
9. Predictability and Surprise in Large Generative Models
Surprise on some specific tasks:
ref: Deep Ganguli, et.al. Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764 Anthropic
10. Predictability and Surprise in Large Generative Models
Surprise from open-ended
:
ref: Deep Ganguli, et.al. Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764 Anthropic
11. On the Opportunities and Risks of Foundation Models
- Security and privacy
- AI safety and alignment
- Inequity and fairness
- Misuse
- Environment
- Legality
- Economics
Ethics of scale
ref: Rishi Bommasani, et.al On the Opportunities and Risks of Foundation Models. CoRR abs/2108.07258 (2022) Stanford
12. HELM(Holistic Evaluation of LMs)
A holistic evaluation of 30 models, under 42 scenarios, with 52 metrics.
Metrics Type: Accuracy, Calibration, Robustness, Fairness, Bias, Toxicity, Efficiency, Others.
13. RL from Human Feedback (RLHF)
- Target: Improve both helpful and harmless of dialogue models.
- HHH: Helpful, Harmless, Honest
ref: Yuntao Bai, et.al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) Anthropic
14. RL from Human Feedback (RLHF)
- Target: Improve both helpful and harmless of dialogue models.
- HHH: Helpful, Harmless, Honest
- HHH Distilled 52B LM \(\rightarrow\) 44K+42K
- Rejection Sampling \(\rightarrow\) 52k+2k
RLHF-Finetuned Models \(\rightarrow\) 22k
ref: Yuntao Bai, et.al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) Anthropic
15. Corpus
ref: Yuntao Bai, et.al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) Anthropic
16. Results
ref: Yuntao Bai, et.al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) Anthropic
17. Sparrow
18. Sparrow
19. Self-Correction of LLMs
Findings:
- the capacity for moral self-correction emerges at 22B model parameters
Improve Safety by “Prompt”:
ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic
20. Self-Correction of LLMs
Benchmark:
- BBQ (Bias Benchmark for QA)
- Winogender
Methods:
- Q: vanilla QA
- IF : with Instruction Following
CoT: Chain of Thoughts
ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic
21. Self-Correction of LLMs
Methods:
- Q: vanilla QA
- IF : with Instruction Following
- CoT: Chain of Thoughts
ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic
22. Self-Correction of LLMs
Methods:
- Q: vanilla QA
- IF : with Instruction Following
- CoT: Chain of Thoughts
ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic
23. Self-Correction of LLMs
ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic
24. Constitutional AI (CAI): RL with AI Feedback (RLAIF)
- Target: Less annotation cost
- Motivation: the critique ability of LLM
- Training Procedure
- Supervised Stage:
- Generate harmful responses with “toxic” prompts;
- Critique
- Revise
- Finetuning (SL)
- RL Stage:
- Finetuning a Preference Model (PM)
- RL
- Supervised Stage:
ref: Yuntao Bai, et.al. Constitutional AI: Harmlessness from AI Feedback. CoRR abs/2212.08073 (2022) Anthropic
25. SL-CAI
ref: Yuntao Bai, et.al. Constitutional AI: Harmlessness from AI Feedback. CoRR abs/2212.08073 (2022) Anthropic
26. RL with AI Feedback (RLAIF)
ref: Yuntao Bai, et.al. Constitutional AI: Harmlessness from AI Feedback. CoRR abs/2212.08073 (2022) Anthropic
27. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
- Background: Instruction-following LLMs have a potential for dual-use, where their language generation capabilities are used for malicious or nefarious ends.
- Target: Attack for producing hateful
- Observation:
- There exists the defenders before and after LLM, e.g. the input and output filter.
- LLMs Behave Like Programs
- Motivation: Bypass the defender based on the programmatic behavior of LLMs
ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University 3University of California, Berkeley
28. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
- Background: Instruction-following LLMs have a potential for dual-use, where their language generation capabilities are used for malicious or nefarious ends.
- Target: Attack for producing hateful
- Observation:
- There exists the defenders before and after LLM, e.g. the input and output filter.
- LLMs Behave Like Programs
- Motivation: Bypass the defender based on the programmatic behavior of LLMs
ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley
29. LLMs Behave Like Programs
- String concatenation
- Variable assignment
- Sequential Composition
- Branching
ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley
30. Attack mechanisms
- Obfuscation (混淆)
- Code Injection/Payload splitting
- Virtualization
ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley
31. Attack mechanisms
- Obfuscation (混淆)
- Code Injection/Payload splitting
Virtualization
ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley
32. Attack mechanisms
ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley
33. Results
- Domain: Generating hate speech, conspiracy theory promotion, phishing attacks, scams, and product astroturfing.
- We templatized the prompt for each attack and medium. Indirection achieved an overall success rate of 92% when only counting the scenarios that did not initially bypass OpenAI’s filters.
- Every prompt was generated in fewer than 10 attempts. Furthermore, we were able to generate prompts for every commonly listed scam in the US government list of common scams
ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley
34. Generation Quality &Economic Benefits
Email Generation:
- Human: $0.15~$0.45
- Text-davinci-003: $0.0064
- Estimation of ChatGPT: $0.016
ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley
35. Conclusion & Future works
- Prompt can help to improve the safety, or help to bypass the safety layers.
- Analysis of LLMs is still an interesting work.