⚠️ This paper contains content that may be offensive and disturbing in nature

The Devil behind the mask

An emergent safety vulnerability of Diffusion LLMs

ICLR 2026 Accepted
Zichen Wen1,2
Jiashu Qu2
Zhaorun Chen3
Xiaoya Lu2
Dongrui Liu2*
Zhiyuan Liu1,2
Ruixi Wu1,2
Yicun Yang1
Xiangqi Jin1
Haoyun Xu1
Xuyang Liu1
Weijia Li2
Chaochao Lu2
Jing Shao2
Conghui He2†
Linfeng Zhang1†
1EPIC Lab, Shanghai Jiao Tong University 2Shanghai AI Laboratory 3University of Chicago
*Project Lead    Corresponding Authors

Abstract

Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities.

To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs—bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering.

Key Insights

How DIJA exploits Diffusion LLMs' unique vulnerabilities

Bidirectional Modeling

Diffusion LLMs process text bidirectionally, driving the model to produce contextually consistent outputs for masked spans—even when the content is harmful. This bypasses unidirectional safety alignment.

Parallel Decoding

Unlike autoregressive models that can dynamically filter content token-by-token, parallel decoding in dLLMs limits the model's ability to apply rejection sampling or dynamic safety checks during generation.

Interleaved Mask-Text Prompts

DIJA constructs adversarial prompts with strategically placed [MASK] tokens that exploit the model's infilling capability to generate harmful completions while appearing benign to safety filters.

Alignment Bypass

Standard safety alignment mechanisms fail on dLLMs because they were designed for autoregressive generation. DIJA exposes this fundamental architectural vulnerability requiring new defense strategies.

Broader Applications

Beyond security research, the masking technique enables various useful applications

Text Editing

Instruction rewriting and grammar correction via selective masking

Structured Generation

JSON/YAML format generation with masked field completion

Info Extraction

Structured information extraction from unstructured text

Key Results

State-of-the-art attack success rates on Diffusion LLMs

100%
Keyword-based ASR
on Dream-Instruct (JailbreakBench)
90.0%
Evaluator-based ASR
on Dream-Instruct (JailbreakBench)
+78.5%
Over ReNeLLM
on JailbreakBench (Avg. ASR-e)

JailbreakBench: DIJA vs Baselines

DIJA*
90%
DIJA
88%
ReNeLLM
11.5%
GCG
5.2%
Others
~0%

ASR-e on Dream-Instruct (JailbreakBench)

DIJA Performance Across Models

95%
LLaDA-8B
94%
LLaDA-1.5
99%
Dream
98%
MMaDA

Keyword-based ASR across different dLLMs

Performance Across Benchmarks

JailbreakBench
90% ASR-e 100% ASR-k
HarmBench
60.5% ASR-e 99% ASR-k
StrongREJECT
52.2 SRS 99.7% ASR-k

DIJA* performance on Dream-Instruct

Defense Robustness

No Defense
99%
Self-Reminder
98%
RPO Defense
87%

DIJA maintains high ASR even with defenses

Figure Gallery

DIJA Against Defense Methods

DIJA Against Defense

DIJA maintains high attack success rate even against various defense mechanisms

DIJA vs PAIR Attack Cases

DIJA vs PAIR

Click to view full image

Diffusion LLM vs Auto-Regressive LLM

dLLM vs AR LLM

Click to view full image

Attack Success Rate by Evaluator

ASR Results

Click to view full image

Mask Token Number Analysis

Mask Token Analysis

Click to view full image

Main Results

Detailed evaluation across multiple benchmarks

JailbreakBench Results

Method LLaDA-Instruct LLaDA-1.5 Dream-Instruct MMaDA-MixCoT
ASR-kASR-eHS ASR-kASR-eHS ASR-kASR-eHS ASR-kASR-eHS
Zeroshot0.00.01.00.01.01.00.00.01.025.033.02.8
GCG23.012.01.923.015.02.021.05.21.583.038.53.3
AIM0.00.01.00.00.01.00.00.01.017.017.02.5
PAIR38.029.03.151.039.03.61.00.01.078.042.04.4
ReNeLLM96.080.04.895.076.04.882.711.52.547.04.01.8
DIJA95.081.04.694.079.04.699.090.04.698.079.04.7
DIJA*99.081.04.8100.082.04.8100.088.04.9100.081.04.7

ASR-k: Keyword-based ASR (%) | ASR-e: Evaluator-based ASR (%) | HS: Harmfulness Score

HarmBench Results

Method LLaDA-Instruct LLaDA-1.5 Dream-Instruct MMaDA-MixCoT
ASR-kASR-eHS ASR-kASR-eHS ASR-kASR-eHS ASR-kASR-eHS
Zeroshot49.817.72.848.816.72.92.80.02.887.329.03.4
GCG55.324.32.957.828.33.024.26.71.581.019.32.8
AIM4.80.01.44.20.01.40.00.01.032.026.02.5
PAIR63.743.63.663.541.43.620.21.51.693.040.04.0
ReNeLLM98.034.24.595.838.04.583.96.52.742.52.51.8
DIJA96.355.54.195.856.84.198.357.53.997.546.83.9
DIJA*98.060.04.199.358.84.199.060.53.999.047.33.9

ASR-k: Keyword-based ASR (%) | ASR-e: Evaluator-based ASR (%) | HS: Harmfulness Score

StrongREJECT Results

Method LLaDA-Instruct LLaDA-1.5 Dream-Instruct MMaDA-MixCoT
ASR-kSRSHS ASR-kSRSHS ASR-kSRSHS ASR-kSRSHS
Zeroshot13.113.41.713.414.01.80.00.11.085.630.04.3
GCG20.113.31.923.317.22.00.60.21.081.019.33.5
AIM0.00.81.00.00.51.00.00.21.025.926.23.1
PAIR45.031.52.445.732.32.538.00.81.988.229.44.0
ReNeLLM93.357.44.693.660.54.696.814.52.792.79.42.6
DIJA92.760.84.793.361.84.796.649.84.797.143.04.7
DIJA*99.762.44.899.463.34.899.752.24.799.047.64.8

ASR-k: Keyword-based ASR (%) | SRS: StrongREJECT Score | HS: Harmfulness Score

BibTeX

@article{wen2025dija,
  title={The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs},
  author={Wen, Zichen and Qu, Jiashu and Chen, Zhaorun and Lu, Xiaoya and Liu, Dongrui and Liu, Zhiyuan and Wu, Ruixi and Yang, Yicun and Jin, Xiangqi and Xu, Haoyun and Liu, Xuyang and Li, Weijia and Lu, Chaochao and Shao, Jing and He, Conghui and Zhang, Linfeng},
  journal={arXiv preprint arXiv:2507.11097},
  year={2025}
}

Ethics Statement

Our research identifies a significant safety vulnerability in diffusion-based large language models (dLLMs) and proposes targeted defense solutions. We believe the benefits of disclosing this vulnerability outweigh the risks. Our research was conducted with integrity in a controlled environment to foster safer AI development, and we do not condone the use of our methods to cause harm.