Infographic

Security Test Outcomes at a Glance

89%

Raw Failure Rate

Unprotected GPT-5 failure rate across adversarial tests

43%

Basic Prompt Failure Rate

Failures with standard ChatGPT safety instructions applied

45%

Hardened (SPLX) Failure Rate

Failures with specialized prompt hardening

97%

GPT-4o Hardened Safety

Older model outperformed GPT-5 when hardened

1 day

Time to First Jailbreak

Jailbreaks published the day after GPT-5's release

1,000+

Adversarial Prompts Tested

Scale of SPLX's systematic testing

GPT-5 Failure Rates by Configuration

SPLX's tests across 1,000+ adversarial prompts showed high failure rates for GPT-5 under all configurations. The raw model failed 89% of tests, indicating extreme vulnerability without protections. Even with OpenAI's standard system prompt, failures remained at 43%. SPLX's hardened configuration did not outperform the default settings, failing 45% of tests, underscoring persistent weaknesses against jailbreak techniques.

Hardened Safety Comparison: GPT-4o vs GPT-5

Under identical hardened testing conditions, GPT-4o achieved 97% safety versus GPT-5's 55%. This highlights a counterintuitive result: the newer, more capable GPT-5 exhibited substantially lower resilience to adversarial prompts compared to the older GPT-4o, emphasizing that capability advances do not automatically translate into stronger safety.

Raw GPT-5: Fail vs Pass Distribution

In its unprotected state, GPT-5 failed 89% of adversarial tests and passed only 11%. This distribution demonstrates why a "naked" deployment is unsuitable for safe use and why layered, continuously updated defenses are essential.

Key Insights

GPT-5 demonstrated significant susceptibility to social engineering-style jailbreaks, with high failure rates even under hardened prompts. The older GPT-4o achieved superior safety when hardened (97%) compared to GPT-5 (55%), suggesting that increased reasoning capability can widen the gap between capability and control. Rapid post-release compromise underscores the need for safety built into model architecture and continuously updated defenses.

AI Social Engineering and the Paradox of Capability vs Control

Researchers showed that as models like GPT-5 become more capable, they can also become easier to manipulate through sophisticated, context-driven attacks. Techniques such as obfuscation and multi-turn narrative steering exploit the model’s helpfulness, memory, and desire for consistency, turning strengths into vulnerabilities. The rapid jailbreaks following GPT-5’s launch highlight that safety must be embedded at the architectural level and updated continuously. Notably, GPT-4o’s hardened setup outperformed GPT-5 on identical tests, indicating that scaling model capability alone does not ensure safer behavior. The emerging era of AI social engineering demands defenses that understand and counter cognitive vulnerabilities, not just surface-level filters.

The Arsenal of Attack Techniques

How the Echo Chamber Attack Progresses

1

Initiate with Innocuous Task

Request simple sentence generation with benign and sensitive terms mixed together.

2

Leverage Model Outputs

Feed prior responses back into the conversation to increase specificity gradually.

3

Apply Fictional Framing

Present the exchange as part of a fictional survival narrative.

4

Slip Past Single-Turn Filters

Each prompt appears benign; the cumulative arc elicits prohibited content.

Release-to-Breach Timeline and Test Flow

1

Launch Day

August 7, 2025: OpenAI and Microsoft release GPT-5 with improved reasoning and safety claims.

2

Rapid Jailbreaks

August 8, 2025: SPLX and NeuralTrust publish working jailbreaks, revealing immediate vulnerabilities.

3

Systematic Adversarial Testing

SPLX runs 1,000+ prompts across three configurations to quantify safety performance.

4

Cross-Model Comparison

GPT-4o vs GPT-5 under hardened settings: 97% vs 55% safety, respectively.

5

Enterprise Implications

SPLX deems raw GPT-5 nearly unusable out of the box, citing risks to data security, compliance, and reputation.

Industry-Wide Vulnerabilities

⚠️

GLM-4.5

Similar jailbreak vulnerabilities observed, indicating broader industry patterns.

🧩

Kimi K2

Exhibits mirrored weaknesses under adversarial prompting.

🔓

Grok 4

Findings suggest susceptibility to social engineering-style attacks.

🌐

Systemic Weaknesses

Capability growth is outpacing security across leading LLMs.

Enterprise Risks and Implications

⚖️

Liability Exposure

Outputs that contravene policy or law can create legal exposure for organizations.

🔐

Data Security Concerns

Manipulations can coax models into leaking proprietary or sensitive information.

📉

Reputation Damage

Unsafe or inappropriate responses erode trust with customers and stakeholders.

📝

Regulatory Scrutiny

Failures in safety and alignment may trigger audits and compliance actions.

Concise highlights grounded in reported data and observations.

Key Takeaways

⏱️

Rapid compromise

GPT-5 was jailbroken within 1 day of release.

⚠️

High failure rates

89% (raw), 43% (basic safety), 45% (hardened) across 1000+ adversarial prompts.

🏆

Older model stronger

GPT-4o showed 97% safety when hardened vs GPT-5’s 55%.

🧩

Techniques to watch

StringJoin Obfuscation and Echo Chamber exploit narrative consistency and prompt reframing.

🏢

Enterprise caution

Out-of-the-box GPT-5 deemed nearly unusable for safe enterprise deployment.