Train a Unified Multimodal Data Quality Classifier with Synthetic Data.

Abstract

The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter).To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter.We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.

Synthetic Data Generation for Training UniFilter

We adopt a semi-synthetic approach to generate synthetic data for training UniFilter. For each image-text caotion or interleaved document, we select the original images only and generate the text paragraphs or captions following the designated quality levels across 0, 1, 2, 3. Then, we can construct (data_sample, score) pairs for training UniFilter.

UniFilter Architecture and Classification Performance

We regard the quality score generation task as a standard clasification task. We replace the LM Head of an MLLM with a classification head with one floating number logit output. Then a MSE loss is applied on minizing the difference between the predicted quality score and the ground truth quality score.
The UniFilter with Qwen2.5-1.5B-Instruct LLM Backbone, SigLIP-SO400M-384px, and AvgPooling projection layer achieves best trade-off between classification accuracy and model size. We also train a stronger UniFilter after the paper acceptance with Qwen3-0.6B, SigLIP-2-SO400M, and AvgPooling projection layer, released in [UniFilter-Qwen3-0.6B](https://huggingface.co/weizhiwang/UniFilter-Qwen3-0.6B).

Results

1. UniFilter's Superiority on Curating Image-Text Caption Data

2. UniFilter's Superiority on Curating Image-Text Interleaved Document Data

3. SFT-ed MLLMs benefit from the high-quality pre-training data curated by MLLM

BibTeX

@article{UniFilter,
    title={Train a Unified Multimodal Data Quality Classifier with Synthetic Data},
    author={Wang, Weizhi and Lin, Rongmei and Li, Shiyang and Lockard, Colin and Sarkhel, Ritesh and Lokegaonkar, Sanket and Shang, Jingbo and Yan, Xifeng and Zalmout, Nasser and Li, Xian},
    journal={arXiv preprint arXiv:2510.15162},
    year={2025}
  }