⚠️ This paper contains content that may be offensive and disturbing in nature

The Devil behind the mask

An emergent safety vulnerability of Diffusion LLMs

ICLR 2026 Accepted

Zichen Wen^1,2

Jiashu Qu²

Zhaorun Chen³

Xiaoya Lu²

Dongrui Liu^2*

Zhiyuan Liu^1,2

Ruixi Wu^1,2

Yicun Yang¹

Xiangqi Jin¹

Haoyun Xu¹

Xuyang Liu¹

Weijia Li²

Chaochao Lu²

Jing Shao²

Conghui He^2†

Linfeng Zhang^1†

¹EPIC Lab, Shanghai Jiao Tong University ²Shanghai AI Laboratory ³University of Chicago

^*Project Lead ^†Corresponding Authors

Paper Code

Abstract

Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities.

To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs—bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering.

Key Insights

How DIJA exploits Diffusion LLMs' unique vulnerabilities

Bidirectional Modeling

Diffusion LLMs process text bidirectionally, driving the model to produce contextually consistent outputs for masked spans—even when the content is harmful. This bypasses unidirectional safety alignment.

Parallel Decoding

Unlike autoregressive models that can dynamically filter content token-by-token, parallel decoding in dLLMs limits the model's ability to apply rejection sampling or dynamic safety checks during generation.

Interleaved Mask-Text Prompts

DIJA constructs adversarial prompts with strategically placed [MASK] tokens that exploit the model's infilling capability to generate harmful completions while appearing benign to safety filters.

Alignment Bypass

Standard safety alignment mechanisms fail on dLLMs because they were designed for autoregressive generation. DIJA exposes this fundamental architectural vulnerability requiring new defense strategies.

Broader Applications

Beyond security research, the masking technique enables various useful applications

Text Editing

Instruction rewriting and grammar correction via selective masking

Structured Generation

JSON/YAML format generation with masked field completion

Info Extraction

Structured information extraction from unstructured text

Key Results

State-of-the-art attack success rates on Diffusion LLMs

100%

Keyword-based ASR

on Dream-Instruct (JailbreakBench)

90.0%

Evaluator-based ASR

on Dream-Instruct (JailbreakBench)

+78.5%

Over ReNeLLM

on JailbreakBench (Avg. ASR-e)

JailbreakBench: DIJA vs Baselines

DIJA*

90%

DIJA

88%

ReNeLLM

11.5%

GCG

5.2%

Others

~0%

ASR-e on Dream-Instruct (JailbreakBench)

DIJA Performance Across Models

95%

LLaDA-8B

94%

LLaDA-1.5

99%

Dream

98%

MMaDA

Keyword-based ASR across different dLLMs

Performance Across Benchmarks

JailbreakBench

90% ASR-e 100% ASR-k

HarmBench

60.5% ASR-e 99% ASR-k

StrongREJECT

52.2 SRS 99.7% ASR-k

DIJA* performance on Dream-Instruct

Defense Robustness

No Defense

99%

Self-Reminder

98%

RPO Defense

87%

DIJA maintains high ASR even with defenses

Figure Gallery

DIJA Against Defense Methods

DIJA maintains high attack success rate even against various defense mechanisms

DIJA vs PAIR Attack Cases

Click to view full image

Diffusion LLM vs Auto-Regressive LLM

Click to view full image

Attack Success Rate by Evaluator

Click to view full image

Mask Token Number Analysis

Click to view full image

Main Results

Detailed evaluation across multiple benchmarks

JailbreakBench Results

Method	LLaDA-Instruct			LLaDA-1.5			Dream-Instruct			MMaDA-MixCoT
Method	ASR-k	ASR-e	HS	ASR-k	ASR-e	HS	ASR-k	ASR-e	HS	ASR-k	ASR-e	HS
Zeroshot	0.0	0.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0	25.0	33.0	2.8
GCG	23.0	12.0	1.9	23.0	15.0	2.0	21.0	5.2	1.5	83.0	38.5	3.3
AIM	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	17.0	17.0	2.5
PAIR	38.0	29.0	3.1	51.0	39.0	3.6	1.0	0.0	1.0	78.0	42.0	4.4
ReNeLLM	96.0	80.0	4.8	95.0	76.0	4.8	82.7	11.5	2.5	47.0	4.0	1.8
DIJA	95.0	81.0	4.6	94.0	79.0	4.6	99.0	90.0	4.6	98.0	79.0	4.7
DIJA*	99.0	81.0	4.8	100.0	82.0	4.8	100.0	88.0	4.9	100.0	81.0	4.7

ASR-k: Keyword-based ASR (%) | ASR-e: Evaluator-based ASR (%) | HS: Harmfulness Score

HarmBench Results

Method	LLaDA-Instruct			LLaDA-1.5			Dream-Instruct			MMaDA-MixCoT
Method	ASR-k	ASR-e	HS	ASR-k	ASR-e	HS	ASR-k	ASR-e	HS	ASR-k	ASR-e	HS
Zeroshot	49.8	17.7	2.8	48.8	16.7	2.9	2.8	0.0	2.8	87.3	29.0	3.4
GCG	55.3	24.3	2.9	57.8	28.3	3.0	24.2	6.7	1.5	81.0	19.3	2.8
AIM	4.8	0.0	1.4	4.2	0.0	1.4	0.0	0.0	1.0	32.0	26.0	2.5
PAIR	63.7	43.6	3.6	63.5	41.4	3.6	20.2	1.5	1.6	93.0	40.0	4.0
ReNeLLM	98.0	34.2	4.5	95.8	38.0	4.5	83.9	6.5	2.7	42.5	2.5	1.8
DIJA	96.3	55.5	4.1	95.8	56.8	4.1	98.3	57.5	3.9	97.5	46.8	3.9
DIJA*	98.0	60.0	4.1	99.3	58.8	4.1	99.0	60.5	3.9	99.0	47.3	3.9

ASR-k: Keyword-based ASR (%) | ASR-e: Evaluator-based ASR (%) | HS: Harmfulness Score

StrongREJECT Results

Method	LLaDA-Instruct			LLaDA-1.5			Dream-Instruct			MMaDA-MixCoT
Method	ASR-k	SRS	HS	ASR-k	SRS	HS	ASR-k	SRS	HS	ASR-k	SRS	HS
Zeroshot	13.1	13.4	1.7	13.4	14.0	1.8	0.0	0.1	1.0	85.6	30.0	4.3
GCG	20.1	13.3	1.9	23.3	17.2	2.0	0.6	0.2	1.0	81.0	19.3	3.5
AIM	0.0	0.8	1.0	0.0	0.5	1.0	0.0	0.2	1.0	25.9	26.2	3.1
PAIR	45.0	31.5	2.4	45.7	32.3	2.5	38.0	0.8	1.9	88.2	29.4	4.0
ReNeLLM	93.3	57.4	4.6	93.6	60.5	4.6	96.8	14.5	2.7	92.7	9.4	2.6
DIJA	92.7	60.8	4.7	93.3	61.8	4.7	96.6	49.8	4.7	97.1	43.0	4.7
DIJA*	99.7	62.4	4.8	99.4	63.3	4.8	99.7	52.2	4.7	99.0	47.6	4.8

ASR-k: Keyword-based ASR (%) | SRS: StrongREJECT Score | HS: Harmfulness Score

BibTeX

@article{wen2025dija,
  title={The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs},
  author={Wen, Zichen and Qu, Jiashu and Chen, Zhaorun and Lu, Xiaoya and Liu, Dongrui and Liu, Zhiyuan and Wu, Ruixi and Yang, Yicun and Jin, Xiangqi and Xu, Haoyun and Liu, Xuyang and Li, Weijia and Lu, Chaochao and Shao, Jing and He, Conghui and Zhang, Linfeng},
  journal={arXiv preprint arXiv:2507.11097},
  year={2025}
}

Ethics Statement

Our research identifies a significant safety vulnerability in diffusion-based large language models (dLLMs) and proposes targeted defense solutions. We believe the benefits of disclosing this vulnerability outweigh the risks. Our research was conducted with integrity in a controlled environment to foster safer AI development, and we do not condone the use of our methods to cause harm.