AI Social Engineering: GPT-5 Jailbreaks and Safety Gaps

Findings from SPLX and NeuralTrust testing following the August 2025 GPT-5 release

Security Test Outcomes at a Glance

89%
Raw Failure Rate
Unprotected GPT-5 failure rate across adversarial tests
43%
Basic Prompt Failure Rate
Failures with standard ChatGPT safety instructions applied
45%
Hardened (SPLX) Failure Rate
Failures with specialized prompt hardening
97%
GPT-4o Hardened Safety
Older model outperformed GPT-5 when hardened
1 day
Time to First Jailbreak
Jailbreaks published the day after GPT-5's release
1,000+
Adversarial Prompts Tested
Scale of SPLX's systematic testing

GPT-5 Failure Rates by Configuration

SPLX's tests across 1,000+ adversarial prompts showed high failure rates for GPT-5 under all configurations. The raw model failed 89% of tests, indicating extreme vulnerability without protections. Even with OpenAI's standard system prompt, failures remained at 43%. SPLX's hardened configuration did not outperform the default settings, failing 45% of tests, underscoring persistent weaknesses against jailbreak techniques.

Hardened Safety Comparison: GPT-4o vs GPT-5

Under identical hardened testing conditions, GPT-4o achieved 97% safety versus GPT-5's 55%. This highlights a counterintuitive result: the newer, more capable GPT-5 exhibited substantially lower resilience to adversarial prompts compared to the older GPT-4o, emphasizing that capability advances do not automatically translate into stronger safety.

Raw GPT-5: Fail vs Pass Distribution

In its unprotected state, GPT-5 failed 89% of adversarial tests and passed only 11%. This distribution demonstrates why a "naked" deployment is unsuitable for safe use and why layered, continuously updated defenses are essential.

Key Insights

GPT-5 demonstrated significant susceptibility to social engineering-style jailbreaks, with high failure rates even under hardened prompts. The older GPT-4o achieved superior safety when hardened (97%) compared to GPT-5 (55%), suggesting that increased reasoning capability can widen the gap between capability and control. Rapid post-release compromise underscores the need for safety built into model architecture and continuously updated defenses.

AI Social Engineering and the Paradox of Capability vs Control

Researchers showed that as models like GPT-5 become more capable, they can also become easier to manipulate through sophisticated, context-driven attacks. Techniques such as obfuscation and multi-turn narrative steering exploit the model’s helpfulness, memory, and desire for consistency, turning strengths into vulnerabilities. The rapid jailbreaks following GPT-5’s launch highlight that safety must be embedded at the architectural level and updated continuously. Notably, GPT-4o’s hardened setup outperformed GPT-5 on identical tests, indicating that scaling model capability alone does not ensure safer behavior. The emerging era of AI social engineering demands defenses that understand and counter cognitive vulnerabilities, not just surface-level filters.

The Arsenal of Attack Techniques

Obfuscation
1
StringJoin Obfuscation Attack
Harmful intent is split into character-by-character fragments with hyphens, framed as an "encryption challenge" to coax the model into "decrypting" the request.
Harmful intent is split into character-by-character fragments with hyphens, framed as an "encryption challenge" to coax the model into "decrypting" the request.
Narrative Exploit
2
Echo Chamber (Multi-turn Manipulation)
Attackers exploit conversational memory and narrative consistency, gradually steering the model toward dangerous outputs while keeping each single turn seemingly benign.
Attackers exploit conversational memory and narrative consistency, gradually steering the model toward dangerous outputs while keeping each single turn seemingly benign.

How the Echo Chamber Attack Progresses

1
Initiate with Innocuous Task
Request simple sentence generation with benign and sensitive terms mixed together.
2
Leverage Model Outputs
Feed prior responses back into the conversation to increase specificity gradually.
3
Apply Fictional Framing
Present the exchange as part of a fictional survival narrative.
4
Slip Past Single-Turn Filters
Each prompt appears benign; the cumulative arc elicits prohibited content.

Release-to-Breach Timeline and Test Flow

1
Launch Day
August 7, 2025: OpenAI and Microsoft release GPT-5 with improved reasoning and safety claims.
2
Rapid Jailbreaks
August 8, 2025: SPLX and NeuralTrust publish working jailbreaks, revealing immediate vulnerabilities.
3
Systematic Adversarial Testing
SPLX runs 1,000+ prompts across three configurations to quantify safety performance.
4
Cross-Model Comparison
GPT-4o vs GPT-5 under hardened settings: 97% vs 55% safety, respectively.
5
Enterprise Implications
SPLX deems raw GPT-5 nearly unusable out of the box, citing risks to data security, compliance, and reputation.

Industry-Wide Vulnerabilities

⚠️
GLM-4.5
Similar jailbreak vulnerabilities observed, indicating broader industry patterns.
🧩
Kimi K2
Exhibits mirrored weaknesses under adversarial prompting.
🔓
Grok 4
Findings suggest susceptibility to social engineering-style attacks.
🌐
Systemic Weaknesses
Capability growth is outpacing security across leading LLMs.

Enterprise Risks and Implications

⚖️
Liability Exposure
Outputs that contravene policy or law can create legal exposure for organizations.
🔐
Data Security Concerns
Manipulations can coax models into leaking proprietary or sensitive information.
📉
Reputation Damage
Unsafe or inappropriate responses erode trust with customers and stakeholders.
📝
Regulatory Scrutiny
Failures in safety and alignment may trigger audits and compliance actions.

Concise highlights grounded in reported data and observations.

Key Takeaways

⏱️
Rapid compromise
GPT-5 was jailbroken within 1 day of release.
⚠️
High failure rates
89% (raw), 43% (basic safety), 45% (hardened) across 1000+ adversarial prompts.
🏆
Older model stronger
GPT-4o showed 97% safety when hardened vs GPT-5’s 55%.
🧩
Techniques to watch
StringJoin Obfuscation and Echo Chamber exploit narrative consistency and prompt reframing.
🏢
Enterprise caution
Out-of-the-box GPT-5 deemed nearly unusable for safe enterprise deployment.