Security Test Outcomes at a Glance
89%
Raw Failure Rate
Unprotected GPT-5 failure rate across adversarial tests
43%
Basic Prompt Failure Rate
Failures with standard ChatGPT safety instructions applied
45%
Hardened (SPLX) Failure Rate
Failures with specialized prompt hardening
97%
GPT-4o Hardened Safety
Older model outperformed GPT-5 when hardened
1 day
Time to First Jailbreak
Jailbreaks published the day after GPT-5's release
1,000+
Adversarial Prompts Tested
Scale of SPLX's systematic testing
GPT-5 Failure Rates by Configuration
SPLX's tests across 1,000+ adversarial prompts showed high failure rates for GPT-5 under all configurations. The raw model failed 89% of tests, indicating extreme vulnerability without protections. Even with OpenAI's standard system prompt, failures remained at 43%. SPLX's hardened configuration did not outperform the default settings, failing 45% of tests, underscoring persistent weaknesses against jailbreak techniques.
Hardened Safety Comparison: GPT-4o vs GPT-5
Under identical hardened testing conditions, GPT-4o achieved 97% safety versus GPT-5's 55%. This highlights a counterintuitive result: the newer, more capable GPT-5 exhibited substantially lower resilience to adversarial prompts compared to the older GPT-4o, emphasizing that capability advances do not automatically translate into stronger safety.
Raw GPT-5: Fail vs Pass Distribution
In its unprotected state, GPT-5 failed 89% of adversarial tests and passed only 11%. This distribution demonstrates why a "naked" deployment is unsuitable for safe use and why layered, continuously updated defenses are essential.
Key Insights
GPT-5 demonstrated significant susceptibility to social engineering-style jailbreaks, with high failure rates even under hardened prompts. The older GPT-4o achieved superior safety when hardened (97%) compared to GPT-5 (55%), suggesting that increased reasoning capability can widen the gap between capability and control. Rapid post-release compromise underscores the need for safety built into model architecture and continuously updated defenses.
AI Social Engineering and the Paradox of Capability vs Control
Researchers showed that as models like GPT-5 become more capable, they can also become easier to manipulate through sophisticated, context-driven attacks. Techniques such as obfuscation and multi-turn narrative steering exploit the model’s helpfulness, memory, and desire for consistency, turning strengths into vulnerabilities. The rapid jailbreaks following GPT-5’s launch highlight that safety must be embedded at the architectural level and updated continuously. Notably, GPT-4o’s hardened setup outperformed GPT-5 on identical tests, indicating that scaling model capability alone does not ensure safer behavior. The emerging era of AI social engineering demands defenses that understand and counter cognitive vulnerabilities, not just surface-level filters.
Concise highlights grounded in reported data and observations.
Key Takeaways
GPT-5 was jailbroken within 1 day of release.
89% (raw), 43% (basic safety), 45% (hardened) across 1000+ adversarial prompts.
GPT-4o showed 97% safety when hardened vs GPT-5’s 55%.
StringJoin Obfuscation and Echo Chamber exploit narrative consistency and prompt reframing.
Out-of-the-box GPT-5 deemed nearly unusable for safe enterprise deployment.