The emergence of Vision Language Models (VLMs) has brought unprecedented advances in understanding multimodal information. The combination of textual and visual semantics in VLMs is highly complex and diverse, making the safety alignment of these models challenging. Furthermore, due to the limited study on the safety alignment of VLMs, there is a lack of large-scale, high-quality datasets. To address these limitations, we propose a Safety Preference Alignment dataset for Vision Language Models named SPA-VL. In terms of breadth, SPA-VL covers 6 harmfulness domains, 13 categories, and 53 subcategories, and contains 100,788 samples of the quadruple (question, image, chosen response, rejected response). In terms of depth, the responses are collected from 12 open-source (e.g., QwenVL) and closed-source (e.g., Gemini) VLMs to ensure diversity. The experimental results indicate that models trained with alignment techniques on the SPA-VL dataset exhibit substantial improvements in harmlessness and helpfulness while maintaining core capabilities. SPA-VL, as a large-scale, high-quality, and diverse dataset, represents a significant milestone in ensuring that VLMs achieve both harmlessness and helpfulness.
Data Statistics. Our SPA-VL dataset comprises four parts: the training set, the validation
set, and two test sets, HarmEval and HelpEval, which are used to evaluate harmfulness and helpfulness,
respectively.
The number of samples in each part is 93,258, 7,000, 265, and 265, respectively.
Following table hows the dataset statistics of the training set.
Diverse Domains. A diverse and representative set of images is essential for training models to handle vision data safely and effectively. Our primary challenge is ensuring diversity while maintaining relevance to harmful content categories. To address this, we establish a comprehensive harm content categorization framework. As shown in above figure, our SPA-VL adopts 6 primary domains, 15 secondary categories, and 53 tertiary categories, ensuring comprehensive coverage and granularity for precise harm detection and response alignment.
Data Formats. We gather preference data by choosing the better response from two generated by VLMs, based on predefined criteria of harmlessness and helpfulness. Finally, a quadruple (question, image, chosen response, rejected response) reflecting preferences is collected, where the chosen response is the better response selected under the principle of harmlessness and helpfulness.
Main Results. As shown in the first table, the models trained on our SPA-VL dataset,
LLaVA-SPA-VL-DPO and LLaVA-SPA-VL-PPO, which are the best safety models from our training, exhibit superior
safety performance. They surpass the baseline model LLAVA-1.5 (7B) and other open-source models, whether or not
those models have undergone safety alignment.Specifically, our models achieve best safe result on
MM-SafetyBench, AdvBench and HarmEval tests. Notably, the LLAVA-HH-Harmless-PPO model, trained on the purely
language-based Anthropic Harmless preference
dataset, performs well in the AdvBench dataset and the text-only components of MM-SafetyBench. However, its
performance drops significantly in safety tests involving images. This underscores the necessity of
incorporating image data into safety alignment datasets for VLMs. In addition to evaluating the safety
performance, we also validate our models' general ability.
Data Scale. We delve into the impact of varying amounts of data on the performance of
alignment models. Across different data quantities (around 100, 1k, 5k, 10k, 30k, and 90k), we
conduct experiments encompassing various evaluation metrics.
Response Model Selection. We examine the impact of response diversity and safety in our
dataset on model training. We conducted four groups of experiments, each group is trained using DPO on around
10K samples. Safe Group consists of response pairs from the three safest models (InternLMXComposer, QwenVL,
Gemini1.0 Pro Vision).
Relative Safe Group includes pairs from relative safe models(LAMM_SFT, LLaVA1.5, InternLMXComposer, QwenVL,
gemini). Unsafe Group comprises pairs from the five least safe models(mPLUG-Owl, Otter, InstructBLIP,
LLaMA-Adapter-v2, Gemini-Jailbreak) and the All group consists of the complete set of 12 models.
Question Types. We also analyze the impact of three different question types(Easy
questions, Hard questions, and Hard statements) on the experimental results. We compare these individual
results with the combined results of all three question types. For each experiment, we select training dataset
of approximately 10k instances and using dpo to train our model.%and present the harmlessness validation
results following the DPO experiments.
@misc{zhang2024spavlcomprehensivesafetypreference,
title={SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model},
author={Yongting Zhang and Lu Chen and Guodong Zheng and Yifeng Gao and Rui Zheng and Jinlan Fu and Zhenfei Yin and Senjie Jin and Yu Qiao and Xuanjing Huang and Feng Zhao and Tao Gui and Jing Shao},
year={2024},
eprint={2406.12030},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.12030},
}