Attacks and Defenses of LLMs: 大型语言模型的攻击与防御

Table of Contents

本文来自于之前所作报告的PPT. 现整理后可生成beamer  

1. Backgrounds: LM &LLM

screenshot_20230228_085719.png

  • Training on code (TOC): 逻辑推理,chain of thoughts(CoT)
  • Prompt Tuning: alignment tax
  • RLHF: ~

2. Background: LM & LLM

screenshot_20230228_085755.png

3. Background: LM & LLM

4. Background: LM & LLM

screenshot_20230228_085912.png

screenshot_20230228_085917.png

ref: Deep Ganguli, et.al. Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764 Anthropic

5. Contents

  • Discovery
    • HELM
    • Ethical and social risks of harm from LMs
    • Predictability and Surprise in Large Generative Models
    • On the Opportunities and Risks of Foundation
  • Defenses:
    • RLHF
    • Sparrow
    • Self-Correction
    • Constitutional AI (RLAIF)
  • Attacks
    • By Code
  • Others

6. Ethical and social risks of harm from Language Models

Risks area:

  • Discrimination, Exclusion, and Toxicity
  • Information Hazards
  • Misinformation Harms
  • Malicious Uses
  • Human-Computer Interaction (conversational agents) Harms
  • Automation, Access, and Environmental Harms

    ref: Laura Weidinger, et.al. Ethical and social risks of harm from Language Models. CoRR abs/2112.04359 (2021) Deepmind

7. Ethical and social risks of harm from Language Models

screenshot_20230228_090227.png

8. Predictability and Surprise in Large Generative Models

  • highlight a counterintuitive property of LLMs and discuss the policy implications of this property
  • Situation: with the development of LMs:
    • helpness ↑ \(\rightarrow\) scaling law
    • but “difficult to anticipate the consequences of model deployment”

screenshot_20230228_090331.png

ref: Deep Ganguli, et.al. Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764 Anthropic

9. Predictability and Surprise in Large Generative Models

Surprise on some specific tasks:

screenshot_20230228_090423.png

ref: Deep Ganguli, et.al. Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764 Anthropic

10. Predictability and Surprise in Large Generative Models

Surprise from open-ended :

screenshot_20230228_090522.png

screenshot_20230228_090529.png

ref: Deep Ganguli, et.al. Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764 Anthropic

11. On the Opportunities and Risks of Foundation Models

  • Security and privacy
  • AI safety and alignment
  • Inequity and fairness
  • Misuse
  • Environment
  • Legality
  • Economics
  • Ethics of scale

    ref: Rishi Bommasani, et.al On the Opportunities and Risks of Foundation Models. CoRR abs/2108.07258 (2022) Stanford

12. HELM(Holistic Evaluation of LMs)

A holistic evaluation of 30 models, under 42 scenarios, with 52 metrics.

Metrics Type: Accuracy, Calibration, Robustness, Fairness, Bias, Toxicity, Efficiency, Others.

screenshot_20230228_090658.png

screenshot_20230228_090704.png

screenshot_20230228_090709.png

screenshot_20230228_090716.png

ref: https://crfm.stanford.edu/helm/v0.1.0/

13. RL from Human Feedback (RLHF)

  • Target: Improve both helpful and harmless of dialogue models.
  • HHH: Helpful, Harmless, Honest

screenshot_20230228_090746.png

ref: Yuntao Bai, et.al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) Anthropic

14. RL from Human Feedback (RLHF)

  • Target: Improve both helpful and harmless of dialogue models.
  • HHH: Helpful, Harmless, Honest
  • HHH Distilled 52B LM \(\rightarrow\) 44K+42K
  • Rejection Sampling \(\rightarrow\) 52k+2k
  • RLHF-Finetuned Models \(\rightarrow\) 22k

    screenshot_20230228_090914.png

screenshot_20230228_090918.png

ref: Yuntao Bai, et.al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) Anthropic

15. Corpus

screenshot_20230228_090932.png

ref: Yuntao Bai, et.al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) Anthropic

16. Results

screenshot_20230228_090957.png

ref: Yuntao Bai, et.al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) Anthropic

17. Sparrow

screenshot_20230228_091017.png

18. Sparrow

screenshot_20230228_091028.png

19. Self-Correction of LLMs

Findings:

  • the capacity for moral self-correction emerges at 22B model parameters
  • Improve Safety by “Prompt”:

    screenshot_20230228_091058.png

    ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic

20. Self-Correction of LLMs

Benchmark:

  • BBQ (Bias Benchmark for QA)
  • Winogender

Methods:

  • Q: vanilla QA
  • IF : with Instruction Following
  • CoT: Chain of Thoughts

    screenshot_20230228_091210.png

screenshot_20230228_091215.png

ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic

21. Self-Correction of LLMs

Methods:

  • Q: vanilla QA
  • IF : with Instruction Following
  • CoT: Chain of Thoughts

screenshot_20230228_091243.png

ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic

22. Self-Correction of LLMs

Methods:

  • Q: vanilla QA
  • IF : with Instruction Following
  • CoT: Chain of Thoughts

screenshot_20230228_091302.png

ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic

23. Self-Correction of LLMs

screenshot_20230228_091315.png

ref: Deep Ganguli, et.al. The Capacity for Moral Self-Correction in Large Language Models. CoRR abs/2302.07459 (2023) Anthropic

24. Constitutional AI (CAI): RL with AI Feedback (RLAIF)

  1. Target: Less annotation cost
  2. Motivation: the critique ability of LLM
  3. Training Procedure
    • Supervised Stage:
      • Generate harmful responses with “toxic” prompts;
      • Critique
      • Revise
      • Finetuning (SL)
    • RL Stage:
      • Finetuning a Preference Model (PM)
      • RL

screenshot_20230228_091509.png

ref: Yuntao Bai, et.al. Constitutional AI: Harmlessness from AI Feedback. CoRR abs/2212.08073 (2022) Anthropic

25. SL-CAI

screenshot_20230228_091613.png

ref: Yuntao Bai, et.al. Constitutional AI: Harmlessness from AI Feedback. CoRR abs/2212.08073 (2022) Anthropic

26. RL with AI Feedback (RLAIF)

screenshot_20230228_091633.png

ref: Yuntao Bai, et.al. Constitutional AI: Harmlessness from AI Feedback. CoRR abs/2212.08073 (2022) Anthropic

27. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

  • Background: Instruction-following LLMs have a potential for dual-use, where their language generation capabilities are used for malicious or nefarious ends.
  • Target: Attack for producing hateful
  • Observation:
    • There exists the defenders before and after LLM, e.g. the input and output filter.
    • LLMs Behave Like Programs
  • Motivation: Bypass the defender based on the programmatic behavior of LLMs

screenshot_20230228_091719.png

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University 3University of California, Berkeley

28. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

  • Background: Instruction-following LLMs have a potential for dual-use, where their language generation capabilities are used for malicious or nefarious ends.
  • Target: Attack for producing hateful
  • Observation:
    • There exists the defenders before and after LLM, e.g. the input and output filter.
    • LLMs Behave Like Programs
  • Motivation: Bypass the defender based on the programmatic behavior of LLMs

screenshot_20230228_091719.png

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

29. LLMs Behave Like Programs

  • String concatenation
  • Variable assignment
  • Sequential Composition
  • Branching

screenshot_20230228_091849.png

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

30. Attack mechanisms

  • Obfuscation (混淆)
  • Code Injection/Payload splitting
  • Virtualization

screenshot_20230228_091911.png

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

31. Attack mechanisms

  • Obfuscation (混淆)
  • Code Injection/Payload splitting
  • Virtualization

    screenshot_20230228_092009.png

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

32. Attack mechanisms

screenshot_20230228_092022.png

screenshot_20230228_092032.png

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

33. Results

  • Domain: Generating hate speech, conspiracy theory promotion, phishing attacks, scams, and product astroturfing.
  • We templatized the prompt for each attack and medium. Indirection achieved an overall success rate of 92% when only counting the scenarios that did not initially bypass OpenAI’s filters.
  • Every prompt was generated in fewer than 10 attempts. Furthermore, we were able to generate prompts for every commonly listed scam in the US government list of common scams

screenshot_20230228_092119.png

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

34. Generation Quality &Economic Benefits

screenshot_20230228_092139.png

Email Generation:

  • Human: $0.15~$0.45
  • Text-davinci-003: $0.0064
  • Estimation of ChatGPT: $0.016

screenshot_20230228_092158.png

ref: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto: Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733 (2023) University of Illinois, Urbana-Champaign 2Stanford University3University of California, Berkeley

35. Conclusion & Future works

  • Prompt can help to improve the safety, or help to bypass the safety layers.
  • Analysis of LLMs is still an interesting work.

Author: Zi Liang (liangzid@stu.xjtu.edu.cn) Create Date: Tue Feb 28 08:55:07 2023 Last modified: 2024-03-09 Sat 20:56 Creator: Emacs 28.1 (Org mode 9.5.2)