Prompt Lexical Sensitivity: Change One Word, Get a Different World

1. Background: How "Voodoo" Is Prompt Engineering, Really?

Everyone knows prompts matter. Ask the same question differently, and the model's output can change dramatically. But how severe is this sensitivity, really? Which linguistic factors drive it? Prior work has mostly stayed at the anecdotal level.

With Qipeng Xie, we set out to systematically quantify this.

2. Our Findings

The results were more striking than we expected. Given the same semantic intent, merely changing:

A synonym (e.g., "analyze" vs. "examine")
A punctuation mark (period vs. exclamation mark)
Even the position of an article

can cause significant swings in output quality—over 20% on some tasks.

We analyzed the linguistic features driving this sensitivity, including word frequency, syntactic complexity, and semantic ambiguity. Some patterns emerged:

Low-frequency words are more sensitive than high-frequency ones
Complex syntax amplifies sensitivity
Larger models are more sensitive to wording (counterintuitive!)

3. What Does This Mean?

Current prompt-based evaluations may have enormous hidden variance. The "model capability" you measure on a particular prompt may simply be an artifact of that prompt's wording. We proposed robust prompt design strategies to mitigate this.

4. Paper Info

Title: Beyond Prompt Engineering: A Systematic Analysis of Prompt Lexical Sensitivity and Its Impacts on Quality
Authors: Qipeng Xie, Zi Liang, Jiafei Wu, Yufei Chen, Weizheng Wang, Wenao Ma, Zhong Ming, Haiqin Yang, Kaishun Wu
Status: ACL 2026 Findings