Attacks and Defenses of LLMs: 大型语言模型的攻击与防御

1. Backgrounds: LM &LLM

Training on code (TOC): 逻辑推理，chain of thoughts(CoT)
Prompt Tuning: alignment tax
RLHF: ~

2. Background: LM & LLM

3. Background: LM & LLM

refer: https://yaofu.notion.site/GPT-3-5-360081d91ec245f29029d37b54573756

4. Background: LM & LLM

ref: Deep Ganguli, et.al. Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764 Anthropic

5. Contents

Discovery
- HELM
- Ethical and social risks of harm from LMs
- Predictability and Surprise in Large Generative Models
- On the Opportunities and Risks of Foundation
Defenses:
- RLHF
- Sparrow
- Self-Correction
- Constitutional AI (RLAIF)
Attacks
- By Code
Others

6. Ethical and social risks of harm from Language Models

Risks area:

Discrimination, Exclusion, and Toxicity
Information Hazards
Misinformation Harms
Malicious Uses
Human-Computer Interaction (conversational agents) Harms
Automation, Access, and Environmental Harms

ref: Laura Weidinger, et.al. Ethical and social risks of harm from Language Models. CoRR abs/2112.04359 (2021) Deepmind

7. Ethical and social risks of harm from Language Models

8. Predictability and Surprise in Large Generative Models

highlight a counterintuitive property of LLMs and discuss the policy implications of this property
Situation: with the development of LMs:
- helpness ↑ $\rightarrow$ scaling law
- but “difficult to anticipate the consequences of model deployment”

ref: Deep Ganguli, et.al. Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764 Anthropic

9. Predictability and Surprise in Large Generative Models

Surprise on some specific tasks:

ref: Deep Ganguli, et.al. Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764 Anthropic

10. Predictability and Surprise in Large Generative Models

Surprise from open-ended :

ref: Deep Ganguli, et.al. Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764 Anthropic

11. On the Opportunities and Risks of Foundation Models

Security and privacy
AI safety and alignment
Inequity and fairness
Misuse
Environment
Legality
Economics
Ethics of scale

ref: Rishi Bommasani, et.al On the Opportunities and Risks of Foundation Models. CoRR abs/2108.07258 (2022) Stanford

12. HELM(Holistic Evaluation of LMs)

A holistic evaluation of 30 models, under 42 scenarios, with 52 metrics.

Metrics Type: Accuracy, Calibration, Robustness, Fairness, Bias, Toxicity, Efficiency, Others.

ref: https://crfm.stanford.edu/helm/v0.1.0/

13. RL from Human Feedback (RLHF)

Target: Improve both helpful and harmless of dialogue models.
HHH: Helpful, Harmless, Honest

ref: Yuntao Bai, et.al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) Anthropic

14. RL from Human Feedback (RLHF)

Target: Improve both helpful and harmless of dialogue models.
HHH: Helpful, Harmless, Honest

HHH Distilled 52B LM $\rightarrow$ 44K+42K
Rejection Sampling $\rightarrow$ 52k+2k
RLHF-Finetuned Models $\rightarrow$ 22k

ref: Yuntao Bai, et.al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) Anthropic

15. Corpus

ref: Yuntao Bai, et.al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) Anthropic

16. Results

ref: Yuntao Bai, et.al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) Anthropic

17. Sparrow

18. Sparrow

19. Self-Correction of LLMs

Findings:

the capacity for moral self-correction emerges at 22B model parameters
Improve Safety by “Prompt”:

ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic

20. Self-Correction of LLMs

Benchmark:

BBQ (Bias Benchmark for QA)
Winogender

Methods:

Q: vanilla QA
IF : with Instruction Following
CoT: Chain of Thoughts

ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic

21. Self-Correction of LLMs

Methods:

Q: vanilla QA
IF : with Instruction Following
CoT: Chain of Thoughts

ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic

22. Self-Correction of LLMs

Methods:

Q: vanilla QA
IF : with Instruction Following
CoT: Chain of Thoughts

ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic

23. Self-Correction of LLMs

ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic

24. Constitutional AI (CAI): RL with AI Feedback (RLAIF)

Target: Less annotation cost
Motivation: the critique ability of LLM
Training Procedure
- Supervised Stage:
  - Generate harmful responses with “toxic” prompts;
  - Critique
  - Revise
  - Finetuning (SL)
- RL Stage:
  - Finetuning a Preference Model (PM)
  - RL

ref: Yuntao Bai, et.al. Constitutional AI: Harmlessness from AI Feedback. CoRR abs/2212.08073 (2022) Anthropic

25. SL-CAI

ref: Yuntao Bai, et.al. Constitutional AI: Harmlessness from AI Feedback. CoRR abs/2212.08073 (2022) Anthropic

26. RL with AI Feedback (RLAIF)

ref: Yuntao Bai, et.al. Constitutional AI: Harmlessness from AI Feedback. CoRR abs/2212.08073 (2022) Anthropic

27. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

Background: Instruction-following LLMs have a potential for dual-use, where their language generation capabilities are used for malicious or nefarious ends.
Target: Attack for producing hateful
Observation:
- There exists the defenders before and after LLM, e.g. the input and output filter.
- LLMs Behave Like Programs
Motivation: Bypass the defender based on the programmatic behavior of LLMs

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University 3University of California, Berkeley

28. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

Background: Instruction-following LLMs have a potential for dual-use, where their language generation capabilities are used for malicious or nefarious ends.
Target: Attack for producing hateful
Observation:
- There exists the defenders before and after LLM, e.g. the input and output filter.
- LLMs Behave Like Programs
Motivation: Bypass the defender based on the programmatic behavior of LLMs

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

29. LLMs Behave Like Programs

String concatenation
Variable assignment
Sequential Composition
Branching

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

30. Attack mechanisms

Obfuscation (混淆)
Code Injection/Payload splitting
Virtualization

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

31. Attack mechanisms

Obfuscation (混淆)
Code Injection/Payload splitting
Virtualization

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

32. Attack mechanisms

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

33. Results

Domain: Generating hate speech, conspiracy theory promotion, phishing attacks, scams, and product astroturfing.
We templatized the prompt for each attack and medium. Indirection achieved an overall success rate of 92% when only counting the scenarios that did not initially bypass OpenAI’s filters.
Every prompt was generated in fewer than 10 attempts. Furthermore, we were able to generate prompts for every commonly listed scam in the US government list of common scams

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

34. Generation Quality &Economic Benefits

Email Generation:

Human: $0.15~$0.45
Text-davinci-003: $0.0064
Estimation of ChatGPT: $0.016

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

35. Conclusion & Future works

Prompt can help to improve the safety, or help to bypass the safety layers.
Analysis of LLMs is still an interesting work.

Attacks and Defenses of LLMs: 大型语言模型的攻击与防御

Table of Contents

1. Backgrounds: LM &LLM

2. Background: LM & LLM

3. Background: LM & LLM

4. Background: LM & LLM

5. Contents

6. Ethical and social risks of harm from Language Models

7. Ethical and social risks of harm from Language Models

8. Predictability and Surprise in Large Generative Models

9. Predictability and Surprise in Large Generative Models

10. Predictability and Surprise in Large Generative Models

11. On the Opportunities and Risks of Foundation Models

12. HELM(Holistic Evaluation of LMs)

13. RL from Human Feedback (RLHF)

14. RL from Human Feedback (RLHF)

15. Corpus

16. Results

17. Sparrow

18. Sparrow

19. Self-Correction of LLMs

20. Self-Correction of LLMs

21. Self-Correction of LLMs

22. Self-Correction of LLMs

23. Self-Correction of LLMs

24. Constitutional AI (CAI): RL with AI Feedback (RLAIF)

25. SL-CAI

26. RL with AI Feedback (RLAIF)

27. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

28. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

29. LLMs Behave Like Programs

30. Attack mechanisms

31. Attack mechanisms

32. Attack mechanisms

33. Results

34. Generation Quality &Economic Benefits

35. Conclusion & Future works