Home / Announcing the Open-Source Release of the Agentic Paper Review System

Announcing the Open-Source Release of the Agentic Paper Review System

In January, we published From Months to Days: AI-Assisted Peer Review with Human Oversight, describing how a multi-agent AI system could assist with large-scale structured academic paper review while keeping human judgement at the center. Today, we’re open-sourcing the tool behind that article.

What it does

The Agentic Paper Review System deploys a team of specialized AI agents, each with a distinct role:

  • Specialist — reads each paper in full, once per evaluation criterion, extracting scores backed by evidence and direct quotes
  • Editor — synthesizes all specialist findings into a polished, structured review with a weighted overall score
  • Judge — when multiple AI models review the same paper and disagree, an independent judge adjudicates the conflict
  • Librarian — searches Semantic Scholar, Arxiv, and other databases for related work to establish a baseline for novelty assessment
  • Fact-Checker — verifies suspicious claims (e.g., “first study of its kind”) through targeted literature searches

The result is a consolidated spreadsheet of every paper, scored and ranked against your criteria along with detailed per-paper reviews with supporting quotes.

The system is domain-agnostic: criteria, prompts, and scoring are fully customisable, so it works across any field, from development economics to computer science to biomedical research. And because the evaluation criteria are entirely configurable, it can be applied beyond academic papers to any document that needs structured assessment, such as grant proposals, project reports, or policy briefs.

Designed for human oversight

This is an AI-assisted tool, and human review remains critical. While the AI produces structured reviews and scores, the final accept or reject decisions are always yours. Multi-model comparison surfaces disagreements between different LLMs so reviewers can focus attention where it matters most. The system does not replace peer review; it helps reviewers work more efficiently by providing a structured first pass and highlighting where models disagree.

What’s new since January

Since the original article, we’ve added two major capabilities:

  • Web dashboard — a browser-based interface for running reviews, monitoring progress in real time, browsing results, and editing criteria, prompts, and configuration, all without touching the command line; a CLI remains available for batch processing and automation.
  • Literature grounding — the system can now optionally search Semantic Scholar, Arxiv, World Bank, and other configurable sources for related work, verify novelty claims against existing research, and produce novelty-adjusted scores that position each paper within the research landscape.

Key features

  • Multi-model comparison — run the same papers through different LLMs (OpenAI, DeepSeek, Gemini, Anthropic, or your own endpoint) and have an AI Judge adjudicate where they disagree.
  • Process at Scale — distribute papers across batch runs with automatic progress tracking and resumption after interruptions.
  • Token accounting — per-paper and per-run cost tracking across all providers, with custom model pricing for accurate budgeting.
  • Domain-agnostic — applicable beyond academic papers to grant proposals, policy briefs, project reports, or other documents requiring structured evaluation.
  • Custom criteria — define exactly what matters for your domain and review context (methodology, novelty, policy relevance, data quality, or anything else). Weights, scoring scales, and recommendation thresholds are all configurable.

Get started

We welcome feedback, contributions, and testing across different domains and workflows.