Most LLM "red teaming" today is ad-hoc jailbreaking. Someone tries a few known prompt injection techniques, documents what worked, and calls it a day. That's not a methodology - it's a demo.
Real adversarial assessment of LLMs requires structure.
Phase 1: Reconnaissance
Before you attack the model, understand it. What's the system prompt? What tools does it have access to? What data sources feed its context? What's the intended behavior boundary?
This phase maps the model's operational profile - not just its technical surface.
Phase 2: Boundary Testing
Systematically probe the model's behavioral constraints. This isn't about finding one clever jailbreak. It's about understanding the shape of the guardrails:
- Where do they hold?
- Where do they bend?
- What categories of harmful output are filtered vs. which are not?
Phase 3: Attack Chain Development
The real value isn't in isolated bypasses - it's in attack chains that demonstrate business impact. A prompt injection that exfiltrates customer data. A guardrail bypass that generates legally actionable output. A tool manipulation that triggers unauthorized actions.
Phase 4: Monitoring Resilience
Can the organization detect these attacks? Most can't. We test detection and response capabilities alongside the model itself, because finding the vulnerability is only half the equation.
Why This Matters
ISO 42001 is here. Regulatory frameworks are being built around the assumption that organizations can demonstrate they've tested their AI systems against adversarial threats. Ad-hoc testing won't meet that bar.