Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

NeurIPS 2025

1EPIC Lab, Shanghai Jiao Tong University, 2Shanghai AI Laboratory, 3Duke University,
4The University of Hong Kong, 5Peking University, 6University of Chicago, 7Sun Yat-sen University,
*Corresponding authors

Abstract

Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model’s parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via ProgressIve Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.

Motivation

When implementing token compression (e.g., pruning 80% of visual tokens) in MLLMs training, this essentially introduces significant perturbations in the feature space. Direct training makes it difficult for the model's parameter space to adapt to such disturbances and converge to the desired optimum. Therefore, it is necessary to design more suitable learning strategies tailored to token compression. Inspired by curriculum learning, which advocates learning from easy samples before gradually advancing to harder ones, we propose modeling the perturbations introduced by token compression in the feature space as a progressive learning trajectory from simple to complex.

Motivation

Motivation. Progressive Consistency Distillation vs. Direct Training. Each subplot shows the loss landscape under the corresponding token compression ratio, with the optimum indicated. Our method reaches the objective via progressive learning trajectories, while direct training remains challenging.

Overview

EPIC Framework Overview

An overview of Progressive Consistency Distillation. (i) Token Consistency Distillation progressively increases token compression ratio over time. (ii) Layer Consistency Distillation shifts token compression from deep to shallow layers, promoting layer-wise consistency during training.

Results

Main Results

Main Results

Performance on 10 visual understanding benchmarks. Res. is resolution, and #Vision Tokens is the number of vision tokens. Both training and inference employ DART as the token compression strategy for our methods. Parentheses in Avg.(%) column show diffs vs. LLaVA-v1.5.

Inference Efficiency

Inference Efficiency

Inference efficiency analysis of EPIC. Δ denotes the reduction ratio. All experiments are on POPE (8,910 samples) using a NVIDIA A100 80GB GPU. Token compression is fixed at the 2nd layer.

Generalization

We further show that our framework EPIC enables strong generalization across different token compression strategies. Even when trained with a single strategy (e.g., DART), the model performs well with other methods like FastV and Random at inference, consistently improving results. Our approach also narrows the performance gap between strategies, especially boosting those that previously lagged behind.

Generalization Across Methods

Generalization Across Methods. Following LLaVA-v1.5's architecture and data, we apply DART for token consistency distillation. ``w/o train'' denotes vanilla LLaVA. At inference, all methods use 88.9% token compression.

Analysis

Many token compression methods aggressively reduce tokens to very few, but our analysis shows this can significantly hurt performance. While fewer tokens always save memory, they don't always make inference much faster. As shown in our experiments, reducing tokens from 576 to 128 brings large efficiency gains, but further reduction yields diminishing returns in speed and a sharp drop in accuracy. Retaining around 64 tokens preserves most model performance and offers the best trade-off—this is the High ROI area. Compressing further brings little extra speedup but much worse results (Low ROI area), as the system becomes memory-bound. Thus, extreme compression is unnecessary; it's better to balance latency and performance.

Analysis

Analysis. All experiments use the model trained following LLaVA-v1.5. FLOPs and latency are measured on the POPE. Visual token and latency experiments are repeated three times for reliability.

BibTeX

@article{wen2025efficient,
    title={Efficient Multi-modal Large Language Models via Progressive Consistency Distillation},
    author={Wen, Zichen and Wang, Shaobo and Zhou, Yufa and Zhang, Junyuan and Zhang, Qintong and Gao, Yifeng and Chen,
    Zhaorun and Wang, Bin and Li, Weijia and He, Conghui and others},
    journal={arXiv preprint arXiv:2510.00515},
    year={2025}
}