Attackers probing large language models rarely give up after a single refusal. They reframe their approach, build context across multiple turns, adopt various personas, and escalate their attacks gradually over time. This persistent behavior makes them more effective. New research from Cisco’s AI threat intelligence team reveals a critical flaw.
Attackers probing large language models rarely give up after a single refusal. They reframe their approach, build context across multiple turns, adopt various personas, and escalate their attacks gradually over time. This persistent behavior makes them more effective.
New research from Cisco’s AI threat intelligence team reveals a critical flaw. Safety benchmarks used industry-wide often miss almost all of this sophisticated multi-turn attack behavior. This oversight leads to a significant gap between published safety scores and the actual observed resilience of leading AI models.
The discrepancy is so wide that it can misrank leading models, giving a false sense of security. The study highlights the difference between single-turn versus multi-turn ASR by model, including approximate 95% confidence half-widths on single-turn results.
