The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1

This article is the note when reading the paper "The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1".

1. Experimental Settings

Safety against unsafe queries
- AirBench
- CyberSecEval
- Over-refusal behavior: XStest
Robustness against adversarial attacks (jailbreaking)
- WildGuard Jailbreaking
- CyberSecEval

Overview:

Safety Score of LLMs:

o3mini > R1 > V3 R170b-distillation < llama3.3-70B

Safety Evaluation on AirBench and Code Interpreter test:

o3 is still the best (the legend in the figure is not right)

Defense agaisnt spear phishing test:

Over-refusal evaluation:

Harmfulness evaluation before and after the reasoning or distillation

ASR of jailbreaking:

Prompt Injection Jailbreaking:

Comparison of safety within Answers or Thinking Procedures:

More detailed and structured responses provided after distillation:

Jailbreak situations:

Safety of the reasoning procedure: