The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1

Table of Contents

This article is the note when reading the paper "The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1".

1. Experimental Settings

1.1. Benchmarks

  • Safety against unsafe queries
    • AirBench
    • CyberSecEval
    • Over-refusal behavior: XStest
  • Robustness against adversarial attacks (jailbreaking)
    • WildGuard Jailbreaking
    • CyberSecEval

Overview:

screenshot_20250220_155522.png

1.2. Metrics

  • GPT-4o: as a safety classifier
  • AirBench: Code Interpreter test, MITRE tests

2. Experiment Results

Safety Score of LLMs:

screenshot_20250220_150111.png

o3mini > R1 > V3 R170b-distillation < llama3.3-70B

Safety Evaluation on AirBench and Code Interpreter test:

screenshot_20250220_155618.png

o3 is still the best (the legend in the figure is not right)

Defense agaisnt spear phishing test:

screenshot_20250220_160534.png

Over-refusal evaluation:

screenshot_20250220_155806.png

Harmfulness evaluation before and after the reasoning or distillation

screenshot_20250220_160438.png

ASR of jailbreaking:

screenshot_20250220_160604.png

Prompt Injection Jailbreaking:

screenshot_20250220_160645.png

Comparison of safety within Answers or Thinking Procedures:

screenshot_20250220_160730.png

2.1. Case Study

More detailed and structured responses provided after distillation:

screenshot_20250220_161027.png

Jailbreak situations:

screenshot_20250220_160953.png

Safety of the reasoning procedure:

screenshot_20250220_160934.png


Author: Zi Liang (zi1415926.liang@connect.polyu.hk) Create Date: Thu Feb 20 09:25:09 2025 Last modified: 2025-02-20 Thu 16:10 Creator: Emacs 29.2 (Org mode 9.6.28)