Web Memorization: What Did the LLM Absorb from the Internet?

1. Background: The Memorization Problem

LLMs "see" vast amounts of web content during training. Some of it gets memorized; some does not. Understanding what gets memorized matters—for privacy (did it memorize personal data?), copyright (did it memorize paywalled content?), and safety (did it memorize dangerous knowledge?).

With Zhiyao Wu, we proposed a new membership inference method.

2. From Text-Level to Semantic-Level

Traditional membership inference asks: "Was this specific text in the training set?" We upgraded it to the semantic level—asking: "Has the model memorized this semantic concept?"

For example, instead of asking "Has the model seen this specific news article?", we ask "Has the model memorized the semantic fact that Company X suffered a data breach?"

3. Why Semantic-Level Is Better

Text-level signals are easy to evade—change a few words, rephrase, and the membership signal disappears. But semantic-level memorization is deeper. If the model "knows" something, it cannot hide it—no matter how you rephrase.

This work appeared at WWW 2026. Applications include privacy auditing, copyright detection, and training data provenance.

4. Paper Info

Title: Decoding Web Memorization: A Semantic Membership Inference Attack on LLMs
Authors: Zhiyao Wu, Zi Liang, Haibo Hu
Status: WWW 2026
Paper: https://www.arxiv.org/abs/2510.03271