GLM-4.5: Open-Source Model Challenging Proprietary AI

Benchmarks, scale, capabilities, and real-world implications from the report

Benchmark Performance Overview

3rd
Overall Rank
Position across a comprehensive suite of 12 key benchmarks
12
Benchmarks Covered
Scope of the comparative evaluation
90.6%
Tool Calling
Best-in-class success rate for tool usage
+7.6 pp
Web Browsing Lead
BrowseComp advantage over Claude 4 Opus (26.4% vs 18.8%)
98.2%
Math Peak
MATH 500 score, matching Claude 4 Opus
64.2%
Coding (SWE-bench)
Verified performance; competitive but not leading

Web Browsing (BrowseComp) — Higher Is Better

On the BrowseComp benchmark for web browsing tasks, GLM-4.5 achieves 26.4% vs Claude 4 Opus at 18.8%, a +7.6 percentage point advantage. This supports the report’s claim that GLM-4.5 outperforms Claude 4 Opus on web browsing.

Tool Calling Success Rate — Higher Is Better

GLM-4.5 posts a best-in-class tool-calling success rate of 90.6%, edging out Claude 4 Sonnet at 89.5%. This validates the model’s strength in agentic, tool-using workflows.

Reasoning (MMLU Pro) — Higher Is Better

On MMLU Pro, GLM-4.5 scores 84.6%, close to Gemini 2.5 Pro (86.2%) and Claude 4 Opus (87.3%). This places GLM-4.5 in a strong middle tier for reasoning.

Mathematical Reasoning (AIME24) — Higher Is Better

On AIME24, GLM-4.5 reaches 91.0%, trailing Grok 4 at 94.3%. While not the top score, the performance is competitive with frontier models.

Mathematics (MATH 500) — Higher Is Better

GLM-4.5 achieves 98.2% on MATH 500, matching Claude 4 Opus. This indicates best-in-class mathematical performance on this benchmark per the report.

Coding — SWE-bench Verified

On SWE-bench Verified, GLM-4.5 scores 64.2% vs Claude 4 Sonnet at 70.4%. The model is solid but not leading on this coding benchmark.

Coding — Terminal-Bench

On Terminal-Bench, GLM-4.5 posts 37.5% vs Claude 4 Opus at 43.2%. While not the top performer, the report notes GLM-4.5’s strengths in full-stack and complex artifact generation.

Key Insights

GLM-4.5’s standout competencies lie in agentic tasks: it leads on web browsing, achieves best-in-class tool-calling success, and delivers near-frontier reasoning—matching top-tier math performance on MATH 500 and posting strong MMLU Pro and AIME24 scores. Coding benchmarks show solid, if not leading, results, aligning with the report’s characterization of strong full-stack capabilities.

Scale and Context Capabilities

128,000 tokens
Context Window
Strong multi-file understanding within large contexts
96
Attention Heads
2.5× more than typical, improving reasoning benchmarks
2
Processing Modes
Thinking (deliberate) and Non-Thinking (instant)
2
Model Variants
GLM-4.5 and GLM-4.5-Air

Parameters by Variant (Total vs Active)

Both variants use a Mixture of Experts (MoE). GLM-4.5 has 355B total/32B active parameters; GLM-4.5-Air has 106B total/12B active. The large gap between total and active reflects MoE efficiency—only a subset of experts are active per token, enabling faster inference without sacrificing capacity.

Active vs Inactive Parameters — GLM-4.5

GLM-4.5 activates 32B parameters per token, with 323B inactive at any given step. This illustrates the efficiency of deeper, narrower MoE design highlighted in the report.

Active vs Inactive Parameters — GLM-4.5-Air

GLM-4.5-Air activates 12B of 106B total parameters per token (94B inactive), providing a resource-friendlier alternative for constrained environments.

Context Window Size Comparison

GLM-4.5 offers a 128,000-token context window. The report notes that Gemini offers over 1 million tokens; the bar uses 1,000,000 as a conservative floor to reflect that it is at least this large. Despite being smaller, GLM-4.5 demonstrates strong multi-file coherence in real tests.

Key Insights

The MoE design enables high capacity with limited active parameters per token, balancing speed and accuracy. While Gemini’s context window exceeds 1M tokens, GLM-4.5’s 128K window proved practically effective for multi-file coding tasks in the report’s tests.

Agent-Native Capabilities and Function Calling

GLM-4.5 is described as an agent-native model with dual processing modes: Thinking Mode for step-by-step, complex reasoning and Non-Thinking Mode for instant responses to straightforward queries. Native function calling is built in, making the model well-suited for agentic applications without external frameworks. In practical testing, the model handled multi-file code analysis (~2000 lines) coherently and successfully merged charts across three HTML files—tasks requiring reasoning, tool use, and cross-file dependency tracking. The report notes that comparable systems (Microsoft Copilot and Google Gemini) failed on the chart integration task, reinforcing GLM-4.5’s strength in agentic workflows.

Real-World Applications

Agentic
1
Agentic Coding
Integrates with coding toolkits like Claude Code, Roo Code, and CodeGeex. Demonstrates end-to-end web app development across frontend, backend, and databases.
Integrates with coding toolkits like Claude Code, Roo Code, and CodeGeex. Demonstrates end-to-end web app development across frontend, backend, and databases.
Research
2
Web Browsing and Research
Conducts complex research, gathering and synthesizing information from multiple sources using native browsing capabilities.
Conducts complex research, gathering and synthesizing information from multiple sources using native browsing capabilities.
Creative
3
Content Creation
Produces slides, posters, and presentation materials; performance improves when paired with retrieval and agentic tools.
Produces slides, posters, and presentation materials; performance improves when paired with retrieval and agentic tools.
Advanced
4
Complex Code Generation
Builds interactive mini-games and physics simulations across multiple languages and formats.
Builds interactive mini-games and physics simulations across multiple languages and formats.

Limitations and Considerations

Hardware
1
Resource Requirements
Running the full GLM-4.5 requires significant compute. GLM-4.5-Air is a more accessible alternative for constrained environments.
Running the full GLM-4.5 requires significant compute. GLM-4.5-Air is a more accessible alternative for constrained environments.
Scope
2
Specialized Training Focus
RL training emphasizes verifiable tasks (information-seeking QA, software engineering), potentially limiting performance in highly specialized domains.
RL training emphasizes verifiable tasks (information-seeking QA, software engineering), potentially limiting performance in highly specialized domains.
Performance
3
Benchmark Gaps
Competitive overall but not state-of-the-art on all benchmarks; top models like OpenAI’s o3 lead in some areas.
Competitive overall but not state-of-the-art on all benchmarks; top models like OpenAI’s o3 lead in some areas.
Caveat
4
Language Model Limits
Can hallucinate, especially on very recent or highly specialized topics.
Can hallucinate, especially on very recent or highly specialized topics.

Availability and Access

Platform
1
Z.ai Platform
Direct web access at https://z.ai
Direct web access at https://z.ai
API
2
API Access
OpenAI-compatible API for application integration (Z.ai Documentation, 2025).
OpenAI-compatible API for application integration (Z.ai Documentation, 2025).
Open-Source
3
Open Weights
Available on HuggingFace and ModelScope for local deployment.
Available on HuggingFace and ModelScope for local deployment.
Serving
4
Inference Frameworks
Supports vLLM and SGLang for efficient serving.
Supports vLLM and SGLang for efficient serving.

Toolkits GLM-4.5 can integrate with for agentic coding workflows

Coding Toolkits Mentioned

Claude CodeAgentic coding toolkit integration
Roo CodeCoding assistant compatibility
CodeGeexEcosystem integration

Technical Architecture Highlights

🧠
Deeper, Narrower MoE
Increases depth while reducing width to boost reasoning capacity.
🎯
Enhanced Attention
96 attention heads (2.5× typical) consistently improve reasoning benchmarks.
QK-Norm + Speculative Decoding
Stabilizes training and accelerates inference.
🚀
Muon Optimizer
Faster convergence and larger batch sizes during training.
🔌
Native Function Calling
Built-in tool use for agentic applications without external frameworks.

Multi-File Chart Integration Test (Workflow)

1
Provide Inputs
Three HTML files with charts are supplied (Files 1, 2, and 3).
2
Set Objective
Take one chart from File 1 and another from File 2, then merge both into File 3.
3
Analyze Structure
Parse HTML, identify chart libraries, and map dependencies and references.
4
Extract Charts
Locate and isolate the target chart definitions from Files 1 and 2.
5
Integrate into Target
Insert charts into File 3, resolving scripts, styles, and data bindings.
6
Validate Output
Verify rendering and cross-file references. Result: GLM-4.5 succeeds; Copilot and Gemini fail per the report.

Open-Source Advantage and Market Impact

GLM-4.5’s open weights, local deployability, and strong performance profile create significant economic and strategic benefits. Organizations avoid recurring API costs, enable sovereign and compliant deployments, and gain the ability to fine-tune for domain needs. Transparency allows auditing for security and bias—capabilities often unavailable with closed models. These attributes exert competitive pressure on proprietary vendors and may accelerate innovation by reducing barriers to research and integration. The report concludes that GLM-4.5 ranks 3rd overall among leading systems while offering unprecedented accessibility, marking a pivotal moment for open-source AI.

Key Facts Summary

- Variants: GLM-4.5 (355B total / 32B active), GLM-4.5-Air (106B total / 12B active) - Context: 128K tokens; Gemini exceeds 1M tokens - Attention: 96 heads (2.5× typical) - Agent Modes: Thinking and Non-Thinking; native function calling - Benchmarks: 3rd overall across 12 metrics; best-in-class tool calling (90.6%); web browsing lead (26.4% vs 18.8%); strong reasoning (MMLU Pro 84.6%, AIME24 91.0%, MATH 500 98.2%); coding solid (SWE-bench 64.2%, Terminal-Bench 37.5%) - Access: Z.ai platform, OpenAI-compatible API, open weights on HuggingFace/ModelScope, supports vLLM and SGLang