The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
Table of Contents
This article is the note when reading the paper "The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1".
1. Experimental Settings
1.1. Benchmarks
- Safety against unsafe queries
- AirBench
- CyberSecEval
- Over-refusal behavior: XStest
- Robustness against adversarial attacks (jailbreaking)
- WildGuard Jailbreaking
- CyberSecEval
Overview:
1.2. Metrics
- GPT-4o: as a safety classifier
- AirBench: Code Interpreter test, MITRE tests
2. Experiment Results
Safety Score of LLMs:
o3mini > R1 > V3 R170b-distillation < llama3.3-70B
Safety Evaluation on AirBench and Code Interpreter test:
o3 is still the best (the legend in the figure is not right)
Defense agaisnt spear phishing test:
Over-refusal evaluation:
Harmfulness evaluation before and after the reasoning or distillation
ASR of jailbreaking:
Prompt Injection Jailbreaking:
Comparison of safety within Answers or Thinking Procedures:
2.1. Case Study
More detailed and structured responses provided after distillation:
Jailbreak situations:
Safety of the reasoning procedure: