Purdue University Researchers Introduce ETA: A Two-Phase AI Framework for Enhancing Safety in Vision-Language Models During Inference
Vision-language models (VLMs) represent an advanced field within artificial intelligence, integrating computer vision and natural language processing to handle multimodal data. These models allow systems to simultaneously understand and process images and text, enabling applications like medical imaging, automated systems, and digital content analysis. Their ability to bridge the gap between visual & textual data […] The post Purdue University Researchers Introduce ETA: A Two-Phase AI Framework for Enhancing Safety in Vision-Language Models During Inference appeared first on MarkTechPost.
Vision-language models (VLMs) represent an advanced field within artificial intelligence, integrating computer vision and natural language processing to handle multimodal data. These models allow systems to simultaneously understand and process images and text, enabling applications like medical imaging, automated systems, and digital content analysis. Their ability to bridge the gap between visual & textual data has made them a cornerstone of multimodal intelligence research.
One of the main challenges facing the development of VLMs involves the safety guarantee of their output. Visual input streams are bound to contain malicious or unsafe information that can sometimes evade model defense mechanisms, resulting in dangerous or insensitive answers. Stronger textual countermeasures provide protection, but such is not yet the case in visual modalities because visual embedding is continuous and, therefore, vulnerable to such attacks. In this respect, the task becomes much tougher to evaluate under multimodal input streams, especially regarding safety.
Current approaches to VLM safety are mainly fine-tuning and inference-based defenses. The fine-tuning methods include supervised fine-tuning and reinforcement learning from human feedback, which is effective but very resource-intensive, using lots of data, labor, and computational power. These methods also risk compromising the model’s general utility. Inference-based methods use safety evaluators to check the outputs against predefined criteria. These methods, however, focus mostly on textual inputs, neglecting the safety implications of visual content. Thus, unsafe visual inputs are passed through without evaluation, leading to poor model performance and reliability.
Researchers from Purdue University introduced the “Evaluating Then Aligning” (ETA) framework to address these issues. This new inference-time method ensures VLM safety without needing extra data or fine-tuning. ETA addresses the limitations of current methods by breaking down the safety mechanism into two phases, namely multimodal evaluation and bi-level alignment. The researchers designed ETA as a plug-and-play solution adaptable to various VLM architectures while maintaining computational efficiency.
The ETA framework works in two stages. The pre-generation evaluation stage checks the safety of visual inputs by applying a predefined safety guard that relies on CLIP scores. It thus filters out potentially harmful visual content before generating the response. Then, in the post-generation evaluation stage, a reward model is used to evaluate the safety of textual outputs. If unsafe behavior is detected, the framework applies two alignment strategies. Shallow alignment uses interference prefixes to shift the model’s generative distribution toward safer outputs, while deep alignment conducts sentence-level optimization to refine responses further. This combination ensures both the safety and utility of the generated outputs.
The researchers extensively tested ETA against multiple benchmarks to evaluate its performance. The framework reduced the unsafe response rate by 87.5% in cross-modality attacks and significantly outperformed existing methods like ECSO. ETA markedly improved in experiments on the SPA-VL Harm dataset, decreasing the unsafe rate from 46.04% to 16.98%. In multimodal datasets such as MM-SafetyBench and FigStep, ETA reliably showed better mechanisms for safety in handling adversarial and harmful visual inputs. Notably, it achieved a win-tie rate of 96.6% in GPT-4 evaluations for helpfulness, showing its ability to maintain model utility and enhanced safety. The researchers also demonstrated the framework’s efficiency, adding only 0.1 seconds to inference time compared to the 0.39 seconds overhead of competing methods like ECSO.
This is how the proposed method achieves safety and utility through the root cause of vulnerabilities in VLMs: the continuous nature of visual token embeddings. The ETA framework aligns visual and textual data such that existing safety mechanisms can work effectively by mapping visual token embeddings into discrete textual embeddings. This ensures that the visual and textual inputs are put through rigorous safety checks, making it impossible for harmful content to escape.
Through their work, the research team has provided one of the scalable and efficient solutions to one of the most challenging tasks in multimodal AI systems. The ETA framework shows how strategic evaluation and alignment strategies can flip VLM safety while retaining all their general capabilities. This advancement deals with current safety and lays the foundation for further developments and the deployment of VLMs with much more confidence in real applications.
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.
What's Your Reaction?