An emergent safety vulnerability of Diffusion LLMs
Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities.
To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs—bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering.
How DIJA exploits Diffusion LLMs' unique vulnerabilities
Diffusion LLMs process text bidirectionally, driving the model to produce contextually consistent outputs for masked spans—even when the content is harmful. This bypasses unidirectional safety alignment.
Unlike autoregressive models that can dynamically filter content token-by-token, parallel decoding in dLLMs limits the model's ability to apply rejection sampling or dynamic safety checks during generation.
DIJA constructs adversarial prompts with strategically placed [MASK] tokens that exploit the model's infilling capability to generate harmful completions while appearing benign to safety filters.
Standard safety alignment mechanisms fail on dLLMs because they were designed for autoregressive generation. DIJA exposes this fundamental architectural vulnerability requiring new defense strategies.
Beyond security research, the masking technique enables various useful applications
Instruction rewriting and grammar correction via selective masking
JSON/YAML format generation with masked field completion
Structured information extraction from unstructured text
State-of-the-art attack success rates on Diffusion LLMs
ASR-e on Dream-Instruct (JailbreakBench)
Keyword-based ASR across different dLLMs
DIJA* performance on Dream-Instruct
DIJA maintains high ASR even with defenses
DIJA maintains high attack success rate even against various defense mechanisms
Click to view full image
Click to view full image
Click to view full image
Click to view full image
Detailed evaluation across multiple benchmarks
| Method | LLaDA-Instruct | LLaDA-1.5 | Dream-Instruct | MMaDA-MixCoT | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ASR-k | ASR-e | HS | ASR-k | ASR-e | HS | ASR-k | ASR-e | HS | ASR-k | ASR-e | HS | |
| Zeroshot | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 25.0 | 33.0 | 2.8 |
| GCG | 23.0 | 12.0 | 1.9 | 23.0 | 15.0 | 2.0 | 21.0 | 5.2 | 1.5 | 83.0 | 38.5 | 3.3 |
| AIM | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 17.0 | 17.0 | 2.5 |
| PAIR | 38.0 | 29.0 | 3.1 | 51.0 | 39.0 | 3.6 | 1.0 | 0.0 | 1.0 | 78.0 | 42.0 | 4.4 |
| ReNeLLM | 96.0 | 80.0 | 4.8 | 95.0 | 76.0 | 4.8 | 82.7 | 11.5 | 2.5 | 47.0 | 4.0 | 1.8 |
| DIJA | 95.0 | 81.0 | 4.6 | 94.0 | 79.0 | 4.6 | 99.0 | 90.0 | 4.6 | 98.0 | 79.0 | 4.7 |
| DIJA* | 99.0 | 81.0 | 4.8 | 100.0 | 82.0 | 4.8 | 100.0 | 88.0 | 4.9 | 100.0 | 81.0 | 4.7 |
ASR-k: Keyword-based ASR (%) | ASR-e: Evaluator-based ASR (%) | HS: Harmfulness Score
| Method | LLaDA-Instruct | LLaDA-1.5 | Dream-Instruct | MMaDA-MixCoT | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ASR-k | ASR-e | HS | ASR-k | ASR-e | HS | ASR-k | ASR-e | HS | ASR-k | ASR-e | HS | |
| Zeroshot | 49.8 | 17.7 | 2.8 | 48.8 | 16.7 | 2.9 | 2.8 | 0.0 | 2.8 | 87.3 | 29.0 | 3.4 |
| GCG | 55.3 | 24.3 | 2.9 | 57.8 | 28.3 | 3.0 | 24.2 | 6.7 | 1.5 | 81.0 | 19.3 | 2.8 |
| AIM | 4.8 | 0.0 | 1.4 | 4.2 | 0.0 | 1.4 | 0.0 | 0.0 | 1.0 | 32.0 | 26.0 | 2.5 |
| PAIR | 63.7 | 43.6 | 3.6 | 63.5 | 41.4 | 3.6 | 20.2 | 1.5 | 1.6 | 93.0 | 40.0 | 4.0 |
| ReNeLLM | 98.0 | 34.2 | 4.5 | 95.8 | 38.0 | 4.5 | 83.9 | 6.5 | 2.7 | 42.5 | 2.5 | 1.8 |
| DIJA | 96.3 | 55.5 | 4.1 | 95.8 | 56.8 | 4.1 | 98.3 | 57.5 | 3.9 | 97.5 | 46.8 | 3.9 |
| DIJA* | 98.0 | 60.0 | 4.1 | 99.3 | 58.8 | 4.1 | 99.0 | 60.5 | 3.9 | 99.0 | 47.3 | 3.9 |
ASR-k: Keyword-based ASR (%) | ASR-e: Evaluator-based ASR (%) | HS: Harmfulness Score
| Method | LLaDA-Instruct | LLaDA-1.5 | Dream-Instruct | MMaDA-MixCoT | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ASR-k | SRS | HS | ASR-k | SRS | HS | ASR-k | SRS | HS | ASR-k | SRS | HS | |
| Zeroshot | 13.1 | 13.4 | 1.7 | 13.4 | 14.0 | 1.8 | 0.0 | 0.1 | 1.0 | 85.6 | 30.0 | 4.3 |
| GCG | 20.1 | 13.3 | 1.9 | 23.3 | 17.2 | 2.0 | 0.6 | 0.2 | 1.0 | 81.0 | 19.3 | 3.5 |
| AIM | 0.0 | 0.8 | 1.0 | 0.0 | 0.5 | 1.0 | 0.0 | 0.2 | 1.0 | 25.9 | 26.2 | 3.1 |
| PAIR | 45.0 | 31.5 | 2.4 | 45.7 | 32.3 | 2.5 | 38.0 | 0.8 | 1.9 | 88.2 | 29.4 | 4.0 |
| ReNeLLM | 93.3 | 57.4 | 4.6 | 93.6 | 60.5 | 4.6 | 96.8 | 14.5 | 2.7 | 92.7 | 9.4 | 2.6 |
| DIJA | 92.7 | 60.8 | 4.7 | 93.3 | 61.8 | 4.7 | 96.6 | 49.8 | 4.7 | 97.1 | 43.0 | 4.7 |
| DIJA* | 99.7 | 62.4 | 4.8 | 99.4 | 63.3 | 4.8 | 99.7 | 52.2 | 4.7 | 99.0 | 47.6 | 4.8 |
ASR-k: Keyword-based ASR (%) | SRS: StrongREJECT Score | HS: Harmfulness Score
@article{wen2025dija,
title={The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs},
author={Wen, Zichen and Qu, Jiashu and Chen, Zhaorun and Lu, Xiaoya and Liu, Dongrui and Liu, Zhiyuan and Wu, Ruixi and Yang, Yicun and Jin, Xiangqi and Xu, Haoyun and Liu, Xuyang and Li, Weijia and Lu, Chaochao and Shao, Jing and He, Conghui and Zhang, Linfeng},
journal={arXiv preprint arXiv:2507.11097},
year={2025}
}
Our research identifies a significant safety vulnerability in diffusion-based large language models (dLLMs) and proposes targeted defense solutions. We believe the benefits of disclosing this vulnerability outweigh the risks. Our research was conducted with integrity in a controlled environment to foster safer AI development, and we do not condone the use of our methods to cause harm.