Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources.

Abstract

The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 220 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency.
The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL.
We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: (1) the training codebase, (2) detailed data filtering techniques, and (3) all pre-training and supervised fine-tuning data used to develop the model.

Data Selection, Filtering and Mixture

We find that mixing CC3M-CC12M-SBU filtered by CLIP and DataComp-128M filtered by both DFN and MLM-Filter can achieve the best model performance on the multimodal benchmarks.

Multimodal Sequence Packing

Multimodal sequence packing is vital to reduce the padding tokens and sequence length imbalance. We propose a sequence packing algorithm based on First-fit-decreasing (FFD) bin packing algorithm to significantly enhance the training efficiency.

Model Architecture

Overview of the proposed model architecture of Open-Qwen2VL. We adopt Adaptive Average Pooling layer in the pre-training stage to project 729 image patch ouputs from SigLIP vision encoder to 144 image tokens, then followed by MLP layer. In the SFT stage, the AveragePooling is discarded and only MLP layer is used.

Results

1. Final Performance Comparisons with SOTA Close-Source MLLMs

We compare Open-Qwen2VL with SOTA 2B-parameter models of InternVL-2.5-2B-MPO, DeepSeekVL-2-Tiny, and Qwen2-VL-2B-Instruct. Our model outperforms Qwen2-VL-2B-Instruct on most of benchmarks, while it is trained on only 0.36% tokens of Qwen2-VL-2B-Instruct. Our model achieves incredible pre-training efficiency.

2. Scaling Effects of Visual SFT

We adopt MAmmoTH-VL-Single-Image-10M sft data and evaluate the model checkpoints every 2M examples. We find that scaled up post-trianing significantly enhance the model performance on multimodal benchmarks.

3. Cases

BibTeX

@article{Open-Qwen2VL,
    title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources},
    author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng},
    journal={arXiv preprint arXiv:2504.00595},
    year={2025}
  }

Acknowledgement

We would like to thank for Facebook (now Meta) for donating the 8xA100-40G GPUs for conducting the experiments. We appreciate the codebase of prismatic-vlms and vlm-evaluation, on which we build our codebase.