1. What Is This System?

This system ingests large collections of academic papers (e.g.,β€―500+), evaluates each according to your custom criteria, and produces a comprehensive score. It does so through a coordinated team of AI agentsβ€”Specialist, Editor, Judge, and Librarian (optional)β€”that work together to ensure thorough, consistent, and rule‑driven assessments.

Agent Roles

Instead of you manually skimming 500 abstracts, this tool will:

2. How It Works: Your AI Review Team

The system's architecture is best understood as a three-person team: a **Specialist**, an **Editor**, and a **Senior Judge**.

1. Ingestion

Your .pdf and .md files are read and converted to plain text.

2. Agent 1: The Specialist

If you have 8 criteria, this agent reads the paper 8 separate times, focusing on just one criterion each time.

3. Agent 2: The Editor

Gathers all 8 specialist reports and writes one final, polished review with an overall score.

4. Output

You get a detailed .md review (example MD) and a row in the master .csv spreadsheet (example CSV link)

5. Discrepancy Check

A separate script compares the .csv files from two different runs (e.g., GPT-4o vs. GPT-5). Example CSV report

6. Agent 3: The Judge

A new AI "Judge" reads the paper and the two conflicting reviews, then issues a final, tie-breaking verdict. Example CSV report

Why this way?

This multi-agent workflow keeps the AI engaged, structured, and accountable. Instead of producing a single, vague response, each agent has a distinct and focused role:

  • The Specialist gathers concrete evidence and details.
  • The Editor synthesizes those findings into a coherent, well-written review.
  • The Judge performs final quality control, ensuring accuracy, balance, and completeness.

By separating responsibilities, the system encourages precision over generalization β€” reducing the risk of "lazy" or superficial outputs and ensuring each stage adds measurable value.

NEW: Literature-Grounded Mode

When literature grounding is enabled (via --literature-grounding flag), the system adds specialized agents for literature analysis.

See Section 4b: Literature-Grounded Reviews for detailed information.

Complete Pipeline Flow (Literature-Enhanced)

The run_review_with_dir_literature.py script supports a complete pipeline with optional literature stages:

run_review_with_dir_literature.py
βœ“ 1. Ingestion
⚑ 2. Literature Grounding (if --literature-grounding enabled)
β”œβ”€ Librarian β†’ create_baseline_reference()
└─ Searches multiple sources (Semantic Scholar + optional World Bank)
β”œβ”€ Reader (during extraction) β†’ adds novelty rankings
└─ Compares paper against baseline literature
└─ Fact-Checker (after extraction) β†’ run_fact_checks()
└─ Verifies suspicious claims (e.g., "first study")
β—† 3. Agent 1: Extractor (standard or literature-enhanced)
└─ Reads paper N times (once per criterion)
β”œβ”€ Standard: process_paper_extractions()
└─ Literature: process_paper_extractions_literature()
● 4. Agent 2: Synthesizer (standard or literature-enhanced)
β”œβ”€ Standard: synthesize_review() β†’ Review
└─ Literature: synthesize_grounded_review() β†’ GroundedReview
β”œβ”€ Research Trajectory analysis
β”œβ”€ Novelty-adjusted scoring
└─ Fact-check integration
β˜… 5. Output (with literature context if enabled)
β”œβ”€ Individual reviews: *_review.md
β”œβ”€ Consolidated CSV: report_consolidated_*.csv
└─ Literature sections (if enabled):
β”œβ”€ Research Trajectory
β”œβ”€ Novelty Rankings (1-5)
β”œβ”€ Fact-Check Results
└─ Novelty-Adjusted Score

Key insight: Literature stages (2) are completely optional. When disabled, the pipeline runs as a standard 3-stage review (Ingestion β†’ Extractor β†’ Synthesizer β†’ Output).

3. Getting Started (Installation)

You only need to do this once.

  1. Install Dependencies: Open your terminal, navigate to the project folder, and run:
    pip install -r requirements.txt
  2. Set Up Your Keys: Create a .env file at the project root (same level as run_review_with_dir.py). This is your private "password" file. Open it and paste in your API keys from AI providers.
    # This tells the system your "password" for OpenAI
    OPENAI_API_KEY=sk-...
    
    # This tells the system your "password" for your custom model
    CUSTOM_OPENAI_API_KEY=your-custom-key-here
    
    # This tells the system the *address* of your custom model
    CUSTOM_OPENAI_API_BASE=http://your-server-address:port/v1
    

4. How to Run Your First Review

This is your main workflow. Just follow these steps.

  1. Set Up a Run Directory: Create a dedicated directory for your review run:
    python setup_run.py --run-dir my_review_run
  2. Add Your Papers: Drop all your .pdf, .md, etc., files into the my_review_run/papers/ folder.
  3. Choose Your AI: The system now allows you to use different AI models for different tasks. For example, to use DeepSeek for extraction and OpenAI for synthesis:
    python run_with_custom_params.py \
      --run-dir my_review_run \
      --provider-extraction deepseek \
      --extractor-model deepseek-reasoner \
      --provider-synthesis openai \
      --synthesizer-model gpt-4o-mini
    
  4. Customize Your Criteria: Open my_review_run/input/criteria.yaml and tell the AI what to look for. (See the next section for a full guide!)
  5. (Optional) Choose Your Judge: The system's "AI Judge" (see Section 7) defaults to using gpt-4o. You can override this by adding these lines to your my_review_run/input/.env file:
    JUDGE_PROVIDER=gemini
    JUDGE_MODEL=gemini-2.5-pro
    
  6. Run the System: The command in step 3 will automatically start the review process.
  7. Check Your Results: Look in the my_review_run/outputs/reviews/ and my_review_run/outputs/reports/ folders to see the final reviews!
Recommended LLM Configuration:

Based on our analysis, we recommend using different models for different tasks:

  • Agent 1 (Extraction): Use a faster, more cost-effective model like gpt-4o-mini, deepseek-chat, or claude-haiku-4-5. This task is less complex and runs multiple times per paper.
  • Agent 2 (Synthesis): Use your most capable model like gpt-5, deepseek-reasoner, or claude-sonnet-4-5. This task requires higher-level reasoning and runs only once per paper.

4b. Literature-Grounded Reviews (New Feature!)

The system now supports literature-grounded reviews that automatically search for and analyze related papers to ground the assessment in existing research. This feature helps verify novelty claims, identify missing citations, and position the paper within the research landscape.

Pipeline Architecture: Dual-Mode Operation

The literature-enhanced pipeline (run_review_with_dir_literature.py) supports two modes controlled by the --literature-grounding flag:

Standard Mode (default)

When --literature-grounding is NOT set:

run_review_with_dir_literature.py
β”œβ”€ Stage 1: Reader β†’ process_paper_extractions()
└─ Stage 2: Synthesizer β†’ synthesize_review()
└─ Output: Review

Literature Mode (enabled)

When --literature-grounding IS set:

run_review_with_dir_literature.py
β”œβ”€ Stage 1: Librarian β†’ create_baseline_reference()
β”œβ”€ Stage 2: Reader β†’ process_paper_extractions_literature()
β”œβ”€ Stage 3: Fact-Checker β†’ run_fact_checks()
└─ Stage 4: Critic β†’ synthesize_grounded_review()
└─ Output: GroundedReview

Key characteristics:

What Is Literature Grounding?

When enabled, literature grounding adds a 4-stage process that goes beyond standard review:

Stage 1: Librarian

Searches multiple literature sources (Semantic Scholar + optional World Bank) for relevant papers in the sub-topic, ranks them by citation count, and extracts key findings from the top 5 (configurable in config/literature_sources.yaml) to create a baseline reference.

Stage 2: Reader

Extracts evidence from the target paper AND ranks novelty (1-5) against the baseline literature.

Stage 3: Fact-Checker

Verifies suspicious claims (e.g., "first study") through targeted searches.

Stage 4: Critic

Synthesizes review with research trajectory and novelty-adjusted scoring.

Key Benefits

How to Enable Literature Grounding

Literature grounding is completely optional and can be enabled in three ways:

Method 1: Setup Script

Use the dedicated setup script for literature-grounded runs:

python setup_run_literature.py \\
  --run-dir my_literature_review

This automatically copies literature_sources.yaml and sets up the environment.

Method 2: Configuration File

Edit config/literature_sources.yaml:

# Set to false to disable
enabled: true

Method 3: Runtime Flag

Use the --literature-grounding flag when running:

# Enable literature grounding
python run_review_with_dir_literature.py \\
  --run-dir my_review_run \\
  --literature-grounding

# Standard mode (no literature)
python run_review_with_dir_literature.py \\
  --run-dir my_review_run

Literature Sources Configuration

The system supports multiple literature sources that can be enabled/disabled in config/literature_sources.yaml:

Source Enabled By Default Citation Data Specialization
Semantic Scholar βœ… Yes βœ… Yes Broad academic coverage (all fields)
Arxiv βœ… Yes (new!) ❌ No Preprints, CS/physics/math, rate-limit resistant
World Bank ❌ No (opt-in) ❌ No Development economics, policy reports

Enabling/Disabling Sources

Edit config/literature_sources.yaml to control which sources are active:

sources:
  semantic_scholar:
    enabled: true   # Primary source, has citations
  arxiv:
    enabled: true    # Good fallback for Semantic Scholar rate limits
  world_bank:
    enabled: false  # Set to true to enable

Semantic Scholar API Setup

Semantic Scholar offers a free-tier API (100 requests/minute). For higher limits, get an API key:

  1. Visit https://www.semanticscholar.org/product/api#api-key
  2. Sign up for a free API key
  3. Add to your global .env file:
    SEMANTIC_SCHOLAR_API_KEY=your_key_here

Arxiv API

Arxiv is a free, open-access archive that serves as an excellent fallback when Semantic Scholar rate limits are reached. When enabled, it provides:

Note: Arxiv papers lack citation counts and will rank lower than Semantic Scholar papers in results. Use as a fallback when Semantic Scholar is rate-limited.

World Bank API

The World Bank API is free and requires no authentication. When enabled, it provides:

Note: World Bank papers lack citation counts and will rank lower than Semantic Scholar papers in results.

Source Priority: When multiple sources are enabled, the system uses a quota-based allocation with spill-over to ensure fair representation. Papers with citations rank above those without. Each source gets an equal quota (baseline_papers_count Γ· number of sources), with unused slots spilling over to other sources.
Important: Literature Quality Depends on Search Results

The literature grounding feature is optional and its quality depends on the literature search results:

  • Search Quality: The Librarian agent searches for relevant papers, but quality varies by field. Well-established fields (e.g., development economics) yield better results than niche or emerging topics.
  • No Results Found: If the Librarian cannot find relevant papers, the baseline reference will be empty or low-quality, reducing the value of novelty assessment and research trajectory analysis.
  • Manual Baseline: For critical reviews, consider manually providing baseline papers in the literature/ directory to ensure high-quality comparison.
  • Verification Limits: Fact-checking only verifies specific suspicious claims (e.g., "first study"). It does not comprehensively verify all claims in the paper.

Recommendation: Test literature grounding with a small sample first to assess search quality for your field before running on large batches.

Literature-Grounded Output

When enabled, reviews include additional sections:

Configuration Options

The config/literature_sources.yaml file controls all aspects:

Setting Default Description
librarian.baseline_papers_count 5 Number of papers to fetch for baseline
librarian.recency_years 5 How many years back to search
fact_checker.max_verifications_total 10 Max verification searches per review
fact_checker.triggers.*.enabled true Enable/disable specific triggers

Batch Processing with Literature Grounding

For batch runs with literature grounding:

python setup_batch_runs_literature.py \\
  --master-papers-dir papers_master \\
  --base-run-dir lit_run \\
  --num-runs 5 \\
  --create-batch-script
Recommendation:

Use literature grounding for final reviews or important decisions. For initial screening of large batches, consider running without literature grounding first to identify the most promising papers, then re-run with literature grounding on the top candidates.

Important: Literature grounding adds significant processing time (10-20 minutes per paper for literature search). Always test with a small batch first to ensure the configuration works as expected.

4c. Standalone Literature Review Pipeline

The standalone literature review pipeline is a dedicated workflow for papers that require comprehensive literature analysis. Unlike the literature-grounded enhancement (Section 4b), this pipeline uses only the literature-grounded agents and does not include a fallback to standard review.

Pipeline Architecture

The standalone pipeline (run_review_literature.py) implements a pure 4-stage literature-grounded workflow:

run_review_literature.py

Standalone Pipeline

β”œβ”€ Stage 1: Librarian
└─ create_baseline_reference()
β”œβ”€ Stage 2: Reader
└─ process_paper_extractions_literature()
β”œβ”€ Stage 3: Fact-Checker
└─ run_fact_checks()
└─ Stage 4: Critic
└─ synthesize_grounded_review()

Output

GroundedReview

  • Research Trajectory
  • Novelty Rankings (1-5)
  • Fact-Check Results
  • Novelty-Adjusted Score

Key characteristics:

Key Differences from Literature-Grounded Enhancement

Feature Literature Enhancement (4b) Standalone Pipeline (4c)
Configuration File Uses criteria.yaml for evaluation criteria Uses literature_sources.yaml for literature configuration
Pipeline Optional: Can disable and fall back to standard review Always runs 4-stage literature pipeline
Setup Script setup_run_literature.py setup_literature_review.py
Runner Script run_review_with_dir_literature.py --literature-grounding run_review_literature.py
Output Type Review (with optional literature sections) GroundedReview (always includes literature analysis)

When to Use the Standalone Pipeline

The standalone pipeline is designed for:

Getting Started with Standalone Literature Review

Step 1: Set Up the Run Directory

Use the dedicated setup script:

python setup_literature_review.py \
  --run-dir my_literature_review

This creates:

Step 2: Configure Literature Sources

Edit literature_sources.yaml to define your baseline literature:

# Literature sources (baseline papers)
sources:
  - id: "baseline_001"
    title: "Foundational Paper in Your Field"
    authors: ["Author One", "Author Two"]
    year: 2024
    venue: "Top Conference"
    topics: ["topic1", "topic2"]
    key_contributions:
      - "Established the methodology"
      - "Introduced key framework"

# Research trajectory definition
research_trajectory:
  starting_point: "earliest_work"
  progression:
    - stage: "initial_concepts"
      description: "Early foundational work"
    - stage: "current_state"
      description: "State-of-the-art approaches"
    - stage: "open_challenges"
      description: "Current limitations"

Step 3: Configure LLM Parameters

The standalone pipeline uses stage-specific LLM settings in .env:

# Stage 1: Librarian (Baseline Reference Creation)
PROVIDER_LIBRARIAN=openai
LIBRARIAN_MODEL=gpt-4o
LIBRARIAN_TEMPERATURE=0.2

# Stage 2: Reader (Novelty Ranking & Extraction)
PROVIDER_READER=openai
READER_MODEL=gpt-4o-mini
READER_TEMPERATURE=0.3

# Stage 3: Fact-Checker (Claim Verification)
PROVIDER_FACT_CHECKER=openai
FACT_CHECKER_MODEL=gpt-4o
FACT_CHECKER_TEMPERATURE=0.1

# Stage 4: Critic (Grounded Synthesis)
PROVIDER_CRITIC=deepseek
CRITIC_MODEL=deepseek-reasoner
CRITIC_TEMPERATURE=0.2
Recommended Configuration:
  • Librarian: Use a capable model like gpt-4o for accurate paper retrieval and summarization
  • Reader: Use a faster model like gpt-4o-mini for extraction tasks
  • Fact-Checker: Use a reliable model like gpt-4o for verification
  • Critic: Use your best reasoning model like deepseek-reasoner or claude-sonnet-4-5 for synthesis

Step 4: Run the Literature Review

Execute the standalone pipeline:

python run_review_literature.py --run-dir my_literature_review

Visual Progress Indicators

The pipeline provides clear visual feedback for each stage:

LITERATURE-GROUNDED REVIEW [1/5]: paper.pdf
============================================================

[Stage 1/4] Librarian: Creating baseline reference...
[Librarian] βœ“ Created baseline with 5 papers

[Stage 2/4] Reader: Extracting evidence with novelty ranking...
[Reader] βœ“ Completed 8 criterion extractions

[Stage 3/4] Fact-Checker: Running verification checks...
[Fact-Checker] βœ“ Completed 3 verification checks
[Fact-Checker] ⚠️  2 claims require further review

[Stage 4/4] Critic: Synthesizing grounded review...
[Critic] βœ“ Review synthesized (score: 65.2)

[Summary]
  Overall Score: 65.2
  Recommendation: REVISE AND RESUBMIT
  Avg Novelty: 3.45/5
  Novelty Adjustment: +2.5

Batch Processing with Standalone Pipeline

For processing multiple directories with the standalone pipeline:

python setup_batch_literature_review.py \
  --master-papers-dir papers_to_review \
  --master-literature-dir baseline_literature \
  --base-run-dir literature_run \
  --num-runs 3 \
  --create-batch-script

This creates batch directories with:

Run the batch:

# Sequential execution
python run_batch_literature_review.py

# Parallel execution
python run_batch_literature_review.py --parallel --max-workers 4

Output Differences

Standalone literature reviews always include:

Important: The standalone pipeline requires literature_sources.yaml. If this file is missing, the system will create a default template that you must customize before running.

Comparison: Which Should You Use?

Literature Enhancement (4b)

Use when:

  • You want optional literature analysis
  • You have existing criteria.yaml files
  • You need flexibility to enable/disable
  • Standard review is acceptable as fallback

Standalone Pipeline (4c)

Use when:

  • Literature analysis is essential
  • You're doing systematic reviews
  • You need research trajectory analysis
  • You want dedicated 4-stage pipeline

5a. The Most Important Part: Customizing Your Criteria

This is the system's most powerful feature. You do **not** need to touch any Python code to completely change the review. The *only* file you need to edit is my_review_run/input/criteria.yaml.

The One Big Rule: Weights Must = 100

The system uses a 100-point scale. The weight: value you give to each criterion tells the system how much it "matters." Before you run the script, you **must** ensure all weight values in your file add up to exactly 100.

Anatomy of a Criterion

Let's break down one criterion block from the file:

  - id: empirical_rigor
    name: Empirical Rigor
    description: |
      Assesses the quality of the empirical methods, data,
      and execution. Look for research design, causal
      identification, and statistical analysis.
    weight: 20
    scale:
      type: numeric
      range: [1, 5]
      labels:
        1: "Fundamentally flawed"
        2: "Significant weaknesses"
        3: "Adequate"
        4: "Strong and robust"
        5: "Exceptional / state-of-the-art"

Example: How to Add a New Criterion

Let's add a new criterion for "Statistical Robustness" with a weight of 10.

Step 1: Add the new block

Copy-paste an existing block and edit it. Now your file looks like this:

criteria:
  - id: theoretical_contribution
    name: Theoretical Contribution
    weight: 15
    # ... (other fields) ...
    
  - id: empirical_rigor
    name: Empirical Rigor
    weight: 20
    # ... (other fields) ...
    
  # ... (all your other criteria) ...
    
  # --- OUR NEW CRITERION ---
  - id: statistical_robustness
    name: Statistical Robustness
    description: |
      Evaluates the quality of statistical tests, power
      analysis, and sensitivity/robustness checks.
    weight: 10
    scale:
      type: numeric
      range: [1, 5]
      labels:
        1: "Statistically flawed"
        2: "Weak / Inappropriate"
        3: "Adequate"
        4: "Robust"
        5: "Exceptional"

Step 2: Adjust the weights to sum to 100

Our total weight is now 110 (the original 100 + our new 10). We must remove 10 points from other criteria.

Let's reduce theoretical_contribution from 15 to 10, and empirical_rigor from 20 to 15.
(15 + 20) = 35. (10 + 15) = 25. We've removed 10 points. Our total is 100 again.

Your file is now valid and ready to run. The system will automatically add a new "Statistical Robustness" section to all reviews and a new statistical_robustness_score column to your CSV.

How to Remove a Criterion

  1. Delete the entire block (from - id: ... to the last label: ...).
  2. Adjust the remaining weights to add up to 100.

Recommendation Rules Based on Final Score

The system determines the recommendation using the following default thresholds, which can be changed in my_review_run/input/criteria.yaml:

Specialization Domain

The domain section in the criteria yaml file is what "specializes" the agents. You can reference it in the my_review_run/input/prompts/extractor_system.txt, like: "You are an expert reviewer in the field of {domain}." "Analyze the following paper based on criteria for {domain}." By changing this one line in criteria.yaml, you can pivot your entire review system to a new field (e.g., "machine_learning" or "clinical_psychology") and the extractor agents will adjust their analysis accordingly.

5b. Controlling Agent Behavior with Prompts

The "brain" of your AI agents is controlled by a set of simple text files located in your run directory. By editing these "prompts," you can change the agents' personalities, their analytical focus, and the style of their writing.

The "System" vs. "User" Prompt: A Key Concept

You'll notice that both the Extractor and Synthesizer agents have two prompt files. This is the most important concept to understand:

1. The _system.txt Prompt (The "Role")

This file defines the agent's personality, role, and general instructions. It's the "job description" you give the AI.

  • Example: "You are a senior academic editor..." or "You are an expert academic reviewer...".
  • Analogy: You are telling an actor who they are (e.g., "You are a skeptical detective").

2. The _user.txt Prompt (The "Task")

This file provides the specific task and data for the agent to work on. It's the "assignment" you give the AI for this specific job.

  • Example: "Here is the paper. Fill out the following JSON based on the criteria...".
  • Analogy: You are telling the actor what to do (e.g., "Analyze this evidence and report your findings").

File-by-File Breakdown

File Name Agent What It Controls
extractor_system.txt Extractor
  • Role: "Expert academic reviewer".
  • Context: Specialized in the {domain} (e.g., "development economics").
  • Output Format: Must be "objective" and "thorough".
extractor_user.txt Extractor
  • Input: The paper's content ({paper_markdown}) and the specific criterion ({criterion_name}).
  • Task: Read the paper and provide a score, justification, supporting quotes, strengths, and weaknesses.
  • Output Schema: Defines the JSON keys the script expects (score, score_justification, evidence, etc.).
synthesizer_system.txt Synthesizer
  • Role: "Senior academic editor"
  • Tone: Must be "constructive"
  • Task: Synthesize assessments from "junior reviewers" (the Extractor)
synthesizer_user.txt Synthesizer
  • Input: All extractions ({json_dump_of_extractions}), the score ({calculated_score}), and the recommendation ({calculated_recommendation})
  • Task: Write the final, human-readable review
  • Output Schema: Defines the final review structure (executive_summary, detailed_assessment, recommendation, etc.)

Practical Examples: What Can I Change?

⚠️ Important Warning: Be Careful with JSON Structure

The Python scripts (agent_extractor.py and agent_synthesizer.py) expect the AI to output a valid JSON object that matches the exact schema defined in the _user.txt files.

You can safely edit the parts of the prompts that control style, tone, or perspective (like the _system.txt files or the prose descriptions in the _user.txt files).

However, if you change the actual JSON keys (e.g., renaming "executive_summary" to "summary") in the prompt *without* also updating the Pydantic models in core/data_models.py, the program will crash when it tries to parse the AI's response.

6. Understanding Your Results

After a run, you will find your results in the my_review_run/outputs/ folder.

1. The Detailed Reviews: my_review_run/outputs/reviews/

This folder contains the full, human-readable Markdown (.md) file for every paper. The filename tells you exactly how it was generated:

[PaperName]_[ExtractorModel]_[SynthesizerModel]_[Timestamp].md

2. The Master Spreadsheet: my_review_run/outputs/reports/

This folder contains your report_consolidated_...csv file. This is the "big picture" view of your entire run. It's perfect for opening in Excel or Google Sheets to sort and filter.

We've designed this report for usability:

7. Advanced: Comparing Models with an AI Judge

Want to know if gpt-5 is a better reviewer than gemini-2.5-pro? This system is built for that. You can not only compare their results but also have a **third AI model act as an "AI Judge"** to review any disagreements.

Remember: Because we cache your papers (in my_review_run/ingestion_cache.json), your second, third, and fourth runs will be *much faster* than the first!

The Complete 3-Step Workflow

Comparing models and resolving disagreements is a manual 3-step process that gives you full control over each stage:

1

Step 1: Run Multiple Reviews

First, run the review system with different LLM configurations to generate multiple reports. For example:

Run 1: GPT-4o Configuration

python run_with_custom_params.py \
  --run-dir my_review_run \
  --provider-extraction openai \
  --extractor-model gpt-4o-mini \
  --provider-synthesis openai \
  --synthesizer-model gpt-4o-mini

Run 2: DeepSeek Configuration

python run_with_custom_params.py \
  --run-dir my_review_run \
  --provider-extraction deepseek \
  --extractor-model deepseek-reasoner \
  --provider-synthesis deepseek \
  --synthesizer-model deepseek-chat

After these runs, you'll have multiple report_consolidated_...csv files in your my_review_run/outputs/reports/ directory.

2

Step 2: Compare Reports to Find Conflicts

Now, run the comparison script to identify papers where the different models disagreed:

python compare_reports.py --run-dir my_review_run

This script will:

  • Read all the consolidated report files
  • Identify papers with conflicting recommendations
  • Create a new HUMAN_REVIEW_discrepancies_...csv file listing all conflicts
Important: Review this discrepancy file before proceeding to the judge step. This gives you a chance to see how many conflicts there are and decide if you want to proceed with the (potentially expensive) judge step.
3

Step 3: Run the AI Judge to Resolve Conflicts

Finally, run the judge script to have an AI review the conflicts and make a final decision:

python judge_conflicts.py --run-dir my_review_run

The judge will:

  • Read each paper that had conflicting reviews from the most recent discrepancies identified in Step 2
  • Read both conflicting reviews
  • Make a final, authoritative decision
  • Create a JUDGE_VERDICTS_report_...csv file with the final verdicts
Read the full logic of adjudication here

Understanding Your Final Results

The JUDGE_VERDICTS_report_...csv file is your final review list. It contains all the papers the models disagreed on, with these key columns:

This saves you from manually reading all the conflicts. You can now simply read the Judge's rationale and finalize the decision, turning hours of re-review into minutes.

πŸ“‹ The Judge Role: Deep Dive

The Judge component is a sophisticated arbitration system that resolves disagreements between multiple AI models when they produce conflicting reviews of the same paper. Understanding how it works will help you trust its verdicts and use it effectively.

Core Responsibilities

The Judge serves three primary functions in the review system:

The "Champion vs Champion" Strategy

Instead of adjudicating between all possible pairs of reviews, the Judge uses an efficient "champion selection" approach:

Step 1: Faction Grouping

When multiple models review a paper, the Judge groups reviews by their recommendation type:

  • Accept Faction: All reviews recommending "Accept" or "Accept with Revisions"
  • Reject Faction: All reviews recommending "Reject" or "Revise and Resubmit"

Step 2: Champion Selection

From each faction, the Judge selects the highest-scoring review as that faction's "champion":

  • The champion is the review with the highest weighted score across all criteria
  • This ensures only the best arguments from each side are presented to the Judge
  • Dramatically reduces computational cost compared to pairwise comparison
Example: In a 3-vs-2 conflict where three models say "Accept" and two say "Reject", the Judge only reviews the best-scoring "Accept" review vs the best-scoring "Reject" review, i.e., the winner-takes-all-per-side, not all 10 possible pairs!

Step 3: LLM-as-Judge Adjudication

An independent AI model (default: GPT-4o) acts as the Judge and receives:

  • The Original Paper: Full text content (truncated to 400,000 characters if needed)
  • Both Champion Reviews: Complete reviews with scores, justifications, and model identities
  • Impartiality Instructions: System prompt emphasizing objectivity and expertise

The Judge evaluates both reviews and produces a structured verdict:

{
  "judge_recommendation": "Accept w/ Revisions",
  "judge_rationale": "Review A more accurately captures...",
  "winning_review": "A",
  "winning_review_rationale": "Review A demonstrates deeper analysis..."
}

Scoring System & Weighted Rules

The Judge does not assign scoresβ€”that's done by the Extractor and Editor agents. Instead, the Judge uses the existing weighted scoring system to evaluate champion quality:

Eight Weighted Criteria

Each review is evaluated on eight dimensions with specific weights:

CriterionWeight
Empirical Rigor18
Identification Strategy15
Statistical Robustness15
Theoretical Contribution12
Policy Relevance12
Originality/Novelty10
Data Quality10
Presentation/Clarity8

Recommendation Thresholds

Final scores (0-100 scale) map to recommendations:

  • 85+: Accept
  • 70-84: Accept with Revisions
  • 50-69: Revise and Resubmit
  • <50: Reject

Configurable in criteria.yaml

Configuring the Judge

You can customize which AI model acts as Judge by setting environment variables in your run directory's .env file:

# Default configuration (can be omitted)
JUDGE_PROVIDER=openai
JUDGE_MODEL=gpt-4o

# Alternative: Use Gemini as Judge
JUDGE_PROVIDER=gemini
JUDGE_MODEL=gemini-2.5-pro

# Alternative: Use Claude as Judge
JUDGE_PROVIDER=anthropic
JUDGE_MODEL=claude-sonnet-4-5
Recommendation:

Use a highly capable model for the Judge role. The Judge's decisions are final and should represent the most careful, impartial analysis possible. GPT-4o, Claude Sonnet 4.5, or Gemini 2.5 Pro are excellent choices.

Progress Tracking & Cost Management

The Judge system includes sophisticated progress tracking to avoid redundant adjudication:

Cost Warning: The Judge step can be expensive for large conflict sets. Always review the HUMAN_REVIEW_discrepancies_...csv file before running the Judge to see how many conflicts exist. Consider adjudicating only the most important conflicts if budget is a concern.

Understanding Judge Verdicts

The final JUDGE_VERDICTS_...csv file contains these key columns:

The Judge's rationale is your most valuable assetβ€”it provides a neutral, expert assessment of which review was more accurate and why. This can help you:

Key Strengths of the Judge System

  1. Efficiency: Champion selection reduces adjudication complexity from O(nΒ²) to O(n)
  2. Transparency: Detailed rationales explain every decision
  3. Robustness: Progress tracking prevents redundant work and enables resumption
  4. Configurability: Weights, thresholds, and Judge model are all customizable
  5. Cost-Effectiveness: Tracks API usage and avoids unnecessary adjudications

The Judge component transforms the multi-agent review system from a simple consensus tool into a sophisticated peer review mechanismβ€”ensuring that final decisions are based on the best available analysis, not just the first available agreement.

8. Processing Large Batches with Run Directories

For large-scale processing, the system distributes papers across multiple run directories, enabling controlled resumption and improved load management to minimize the risk of exceeding LLM rate limits. This configuration facilitates experimentation with different settings, including review criteria, and supports parallel execution when rate limits are not a concern.

Setting Up Multiple Run Directories

If you have a master directory with hundreds of papers, you can distribute them across multiple run directories:

python setup_batch_runs.py \
  --master-papers-dir /path/to/master/papers \
  --base-run-dir run_dir \
  --num-runs 10 \
  --papers-per-run 50 \
  --create-batch-script

This will create 10 run directories (run_dir1, run_dir2, etc.) with 50 papers each, plus a batch script to run them all.

Running Multiple Directories

After setting up multiple directories, you can run them all at once:

# Sequential execution
python run_batch.py

# Parallel execution (up to 4 workers)
python run_batch.py --parallel

# Parallel execution with custom number of workers
python run_batch.py --parallel --max-workers 8

Resuming After Interruptions

The system automatically tracks progress and can resume from where it left off if interrupted. Each run directory maintains its own progress tracking, so you can:

Important: The system tracks progress based on the LLM configuration. If you change the LLM parameters, it will automatically reprocess all papers with the new configuration.

9. Troubleshooting & Common Questions

Q: I'm getting a `FAILED criterion` or `JSON error`!

A: This almost always means the AI's response was too long and got cut off. Your AI is being *too* detailed!
The Fix: Open core/llm_wrapper.py and find the max_completion_tokens or max_tokens settings. Increase them from 4096 to 8192 and try again.

Q: How do I change the "Judge" LLM?

A: The Judge is set by two environment variables in your my_review_run/input/.env file: JUDGE_PROVIDER and JUDGE_MODEL. If they aren't there, the script defaults to openai and gpt-4o. You can add these lines to change it to a powerful model, like Gemini Pro:

JUDGE_PROVIDER=gemini
JUDGE_MODEL=gemini-2.5-pro

Q: I'm getting a `BadRequestError: Unsupported parameter: 'max_tokens'`!

A: This is a known issue with some custom endpoints. The system is designed to "surgically bypass" this problem for the custom_openai/gpt-5 combination. If you see this error with a *different* model, it means a developer needs to add a new bypass rule to core/llm_wrapper.py.

Q: How much will this cost?

A: Be very careful with large batches. The cost is:
(Number of Papers) x (Number of Criteria) x (Cost of Extractor Model)
+
(Number of Papers) x (Cost of Synthesizer Model)
+
(Number of Conflicts) x (Cost of Judge Model)

A 500-paper batch with 8 criteria is **4,000** calls to the Extractor, plus 500 calls to the Synthesizer. Always test with a small batch of 5-10 papers first!

Q: How do I clear the cache and re-parse my papers?

A: Simply delete the my_review_run/ingestion_cache.json file. The system will fully re-parse all your papers from scratch on the next run.

Q: How do I clear the paper review cache?

A: Simply delete the my_review_run/progress.json file. The system will review all the papers from scratch on the next run.

Q: How do I clear the judge review cache?

A: Simply delete the my_review_run/judge_progress.json file. The system will adjudicate all the papers from scratch on the next run.

Q: Can I mix and match different LLM providers?

A: Yes! The system fully supports using different LLM providers for extraction and synthesis. For example:

python run_with_custom_params.py \
  --run-dir my_review_run \
  --provider-extraction deepseek \
  --extractor-model deepseek-reasoner \
  --provider-synthesis openai \
  --synthesizer-model gpt-4o-mini
This will use DeepSeek's reasoning model for extraction and OpenAI's GPT-4o-mini for synthesis.

Q: What if I have fewer papers than run directories?

A: The system will distribute papers as evenly as possible. For example, if you have 10 papers and 3 run directories, it will create directories with 4, 3, and 3 papers respectively.

Q: Do I need to run compare_reports.py before judge_conflicts.py?

A: Yes! The 3-step workflow is designed to be manual:

  1. Run reviews with different configurations
  2. Run compare_reports.py to find conflicts
  3. Run judge_conflicts.py to resolve conflicts
This gives you control over the process and allows you to review conflicts before deciding whether to run the (potentially expensive) judge step.