Attacks and Defenses (Watermarking) in Large Language Models

1. Attacks

Thieves on Sesame Street! Model Extraction on BERT-based APIs
- model extraction with queries which are constructed by randomly vocabulary sampling.
- [Supervised Cross-entropy][BERT][Classification]<ICLR’20>
Revealing Secrets From Pre-trained Models
- vendor-based attack for both architecture and parameter revealing.
- [BERT][Classification][‘22]
Extracted BERT Model Leaks More Information than You Think!
- attribute inference attack, to obtain privacy information from extracted model.
- [BERT][Classification][‘22]
Student Surpasses Teacher: Imitation Attack for Black-Box NLP APIs
- ensemble victim models surpass each of them.
- [BERT][CLS][‘22]
Stealing the decoding algorithms of language models
- [GPT-3][generation-decoding-strategy][ccs’23]
Watermark Stealing in Large Language Models
- stealing watermarks
- [LMs][generation][‘24]
On extracting Specialized Code Abilities from LLMs: A Feasibility Study
- code ability extraction of LLMs
- [LLMs][generation][ICSE‘23]

SSL Guard: A Watermarking Scheme for Self-supervised Learning Pre-trained Encoders
- attack and defenses in contrastive learning.
- [similarity][Representation Model][watermark]<’22>
CATER: Intellectual Property Protection on Text Generation APIs via Conditional Watermarks
- post-hoc watermarking to warp the word’s conditional distribution but have no change to the world’s distribution, to achieve a stealthy watermark
- [generation][GPT-2][watermark]<NeurIPS’22>
Distillation-Resistant Watermarking for Model Protection in NLP
- add sinusoidal noises to the probabilities of the prediction.
- [classification][watermark][2022]
Protecting Intellectual Property of LLM-based Code Generation APIs via Watermarks
- by synonyms, in code generation
- [CCS’23]
Are you copying my model? Protecting the copyright of LLMs for EaaS via Backdoor Watermark
- add the embeddings to the representation models as the backdoor.
- [embedding model (representation model)][ACL’23]
A watermark for large language models
- pre-set a green-word set in prompts, and use it to generate responses with watermark.
- [generation][OPT][‘23]
Protecting Language Generation Models via Invisible Watermarking
- Similar to the third paper, but in NLG this time. Require the probabilities of the suspected model.
- [ICML’23]