SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model

Yongting Zhang^1,3*, Lu Chen^2,3*, Guodong Zheng^2,3, Yifeng Gao², Rui Zheng², Jinlan Fu³, Zhenfei Yin³, Senjie Jin², Yu Qiao³, Xuanjing Huang², Feng Zhao¹, Tao Gui^2,3†, Jing Shao^3†

¹University of Science and Technology of China,
²Fudan University,
³Shanghai Artificial Intelligence Laboratory
^*Indicates Equal Contribution. Authorship order determined by coin clip.
^†Indicates Corresponding Authors.

Paper Dataset Code 🤗Checkpoints

Overview of SPA-VL Dataset. It is built in three stages: 1) Image Collection, 2) Questions Constrution and 3) Preference Construction. The dataset examples shows vision-question-preferences pairs that comprise three types of questions: easy questions, hard questions, and hard statements.

Comparison of different open source models and various dataset trained models on harmlessness. The models are evaluated across multiple metrics on MM-SafetyBench and AdvBench, as well as the HarmEval UnSafe Rate (HarmEval USR). After training on our proposed dataset, SPA-VL, the model achieves the best scores according all metric on both dpo and ppo methods.

Abstract

The emergence of Vision Language Models (VLMs) has brought unprecedented advances in understanding multimodal information. The combination of textual and visual semantics in VLMs is highly complex and diverse, making the safety alignment of these models challenging. Furthermore, due to the limited study on the safety alignment of VLMs, there is a lack of large-scale, high-quality datasets. To address these limitations, we propose a Safety Preference Alignment dataset for Vision Language Models named SPA-VL. In terms of breadth, SPA-VL covers 6 harmfulness domains, 13 categories, and 53 subcategories, and contains 100,788 samples of the quadruple (question, image, chosen response, rejected response). In terms of depth, the responses are collected from 12 open-source (e.g., QwenVL) and closed-source (e.g., Gemini) VLMs to ensure diversity. The experimental results indicate that models trained with alignment techniques on the SPA-VL dataset exhibit substantial improvements in harmlessness and helpfulness while maintaining core capabilities. SPA-VL, as a large-scale, high-quality, and diverse dataset, represents a significant milestone in ensuring that VLMs achieve both harmlessness and helpfulness.

Overview

Data Statistics. Our SPA-VL dataset comprises four parts: the training set, the validation set, and two test sets, HarmEval and HelpEval, which are used to evaluate harmfulness and helpfulness, respectively. The number of samples in each part is 93,258, 7,000, 265, and 265, respectively. Following table hows the dataset statistics of the training set.

To detect the unsafe content covered by our SPA-VL dataset, we utilize the MD-Judge evaluator to calculate the unsafe rate of the collected questions and VLMs' responses. Nearly half of the collected questions are unsafe, while the unsafe rate for the chosen response and rejected response is 11.7% and 42.23%, respectively. The HarmEval test set includes a substantial number of harmful questions, while the HelpEval test set primarily comprises questions that involve instruction following or require the expression of opinions.

Diverse Domains. A diverse and representative set of images is essential for training models to handle vision data safely and effectively. Our primary challenge is ensuring diversity while maintaining relevance to harmful content categories. To address this, we establish a comprehensive harm content categorization framework. As shown in above figure, our SPA-VL adopts 6 primary domains, 15 secondary categories, and 53 tertiary categories, ensuring comprehensive coverage and granularity for precise harm detection and response alignment.

Data Formats. We gather preference data by choosing the better response from two generated by VLMs, based on predefined criteria of harmlessness and helpfulness. Finally, a quadruple (question, image, chosen response, rejected response) reflecting preferences is collected, where the chosen response is the better response selected under the principle of harmlessness and helpfulness.

Results

Main Results. As shown in the first table, the models trained on our SPA-VL dataset, LLaVA-SPA-VL-DPO and LLaVA-SPA-VL-PPO, which are the best safety models from our training, exhibit superior safety performance. They surpass the baseline model LLAVA-1.5 (7B) and other open-source models, whether or not those models have undergone safety alignment.Specifically, our models achieve best safe result on MM-SafetyBench, AdvBench and HarmEval tests. Notably, the LLAVA-HH-Harmless-PPO model, trained on the purely language-based Anthropic Harmless preference dataset, performs well in the AdvBench dataset and the text-only components of MM-SafetyBench. However, its performance drops significantly in safety tests involving images. This underscores the necessity of incorporating image data into safety alignment datasets for VLMs. In addition to evaluating the safety performance, we also validate our models' general ability.

Data Scale. We delve into the impact of varying amounts of data on the performance of alignment models. Across different data quantities (around 100, 1k, 5k, 10k, 30k, and 90k), we conduct experiments encompassing various evaluation metrics.

Response Model Selection. We examine the impact of response diversity and safety in our dataset on model training. We conducted four groups of experiments, each group is trained using DPO on around 10K samples. Safe Group consists of response pairs from the three safest models (InternLMXComposer, QwenVL, Gemini1.0 Pro Vision). Relative Safe Group includes pairs from relative safe models(LAMM_SFT, LLaVA1.5, InternLMXComposer, QwenVL, gemini). Unsafe Group comprises pairs from the five least safe models(mPLUG-Owl, Otter, InstructBLIP, LLaMA-Adapter-v2, Gemini-Jailbreak) and the All group consists of the complete set of 12 models.

Question Types. We also analyze the impact of three different question types(Easy questions, Hard questions, and Hard statements) on the experimental results. We compare these individual results with the combined results of all three question types. For each experiment, we select training dataset of approximately 10k instances and using dpo to train our model.%and present the harmlessness validation results following the DPO experiments.

BibTeX

@misc{zhang2024spavlcomprehensivesafetypreference,
        title={SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model}, 
        author={Yongting Zhang and Lu Chen and Guodong Zheng and Yifeng Gao and Rui Zheng and Jinlan Fu and Zhenfei Yin and Senjie Jin and Yu Qiao and Xuanjing Huang and Feng Zhao and Tao Gui and Jing Shao},
        year={2024},
        eprint={2406.12030},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2406.12030}, 
  }