💡
Key Empirical Insights
1. Unsafe images pose greater risks than unsafe text: Analysis shows that UIST scenarios consistently yield higher ASRs compared to UIUT and SIUT conditions across all models and judges, indicating VLMs' heightened vulnerability to unsafe visual inputs.
2. Open-weight VLMs
show highest vulnerability: These models exhibit the highest ASRs
(52-78%) with refusal rates of 0.5-2.5% on safe inputs, demonstrating
significant safety challenges.
3. Closed-weight VLMs
achieve moderate safety: While showing improved safety (e.g.,
Claude-3.5-Sonnet), these models still face challenges with ASRs up to 67% under
certain judges, though maintaining low refusal rates (0-1.7%).
4. Safety-tuned VLMs
achieve the lowest ASRs overall,
albeit with modestly higher refusal rates.: SafeLLaVA models achieve the
lowest ASRs (below 7% under Claude, below 15% under other judges), with
SafeLLaVA-13B reaching just 1.09% ASR, albeit with a slightly higher refusal
rate of 6.09%.
5. Judge consistency in model ranking: While absolute metrics vary by
judge, the relative safety ranking (open-weight ≫ closed-weight ≫ safety-tuned)
remains consistent across all evaluation methods.
6. Strong correlation with string matching: Automatic string matching
shows high concordance with AI judges (ρ=0.98 with GPT-4o/Gemini), suggesting
its viability as a cost-effective safety evaluation method.