Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model’s parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via ProgressIve Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.
Performance on 10 visual understanding benchmarks. Res.
is resolution, and #Vision Tokens
is the number of vision
tokens. Both training and inference employ DART as the token compression strategy for our methods. Parentheses in
Avg.(%)
column show diffs vs. LLaVA-v1.5.
Inference efficiency analysis of EPIC
.
Δ denotes the reduction ratio.
All experiments are on POPE (8,910 samples) using a NVIDIA A100 80GB GPU. Token compression is fixed at the 2nd layer.
We further show that our framework EPIC enables strong generalization across different token compression strategies. Even when trained with a single strategy (e.g., DART), the model performs well with other methods like FastV and Random at inference, consistently improving results. Our approach also narrows the performance gap between strategies, especially boosting those that previously lagged behind.
Generalization Across Methods. Following LLaVA-v1.5's architecture and data, we apply DART
for token consistency distillation. ``w/o train''
denotes vanilla LLaVA. At inference, all methods use 88.9% token compression.
Many token compression methods aggressively reduce tokens to very few, but our analysis shows this can significantly hurt performance. While fewer tokens always save memory, they don't always make inference much faster. As shown in our experiments, reducing tokens from 576 to 128 brings large efficiency gains, but further reduction yields diminishing returns in speed and a sharp drop in accuracy. Retaining around 64 tokens preserves most model performance and offers the best trade-off—this is the High ROI area. Compressing further brings little extra speedup but much worse results (Low ROI area), as the system becomes memory-bound. Thus, extreme compression is unnecessary; it's better to balance latency and performance.
Analysis. All experiments use the model trained following LLaVA-v1.5. FLOPs and latency are measured on the POPE. Visual token and latency experiments are repeated three times for reliability.
@article{wen2025efficient,
title={Efficient Multi-modal Large Language Models via Progressive Consistency Distillation},
author={Wen, Zichen and Wang, Shaobo and Zhou, Yufa and Zhang, Junyuan and Zhang, Qintong and Gao, Yifeng and Chen,
Zhaorun and Wang, Bin and Li, Weijia and He, Conghui and others},
journal={arXiv preprint arXiv:2510.00515},
year={2025}
}