1. What Is This System?

This system ingests large collections of academic papers (e.g., 500+), evaluates each according to your custom criteria, and produces a comprehensive score. It does so through a coordinated team of AI agents—Specialist, Editor, Judge, and Librarian (optional)—that work together to ensure thorough, consistent, and rule‑driven assessments.

Agent Roles

Specialist: Conducts detailed, domain‑specific analyses of each paper, extracting methodological quality, novelty, data integrity, and other metrics defined by your rules.
Editor: Synthesizes the Specialist's findings into clear, concise summaries, pulls out key figures/tables, and formats the information for downstream review.
Judge: Resolves conflicts between multiple AI model reviews through an intelligent "champion vs champion" adjudication process, selecting the best assessment from conflicting recommendations.
Librarian (Literature Mode): Searches academic databases for related papers, extracts key findings from the literature, and provides baseline context for novelty assessment.
Fact-Checker (Literature Mode): Verifies suspicious claims (e.g., "first study") through targeted searches to ensure accurate novelty claims.
Critic (Literature Mode): Synthesizes reviews with research trajectory analysis and novelty-adjusted scoring.

Instead of you manually skimming 500 abstracts, this tool will:

Read the full text of every paper.
Score each one against criteria *you* define (like "Empirical Rigor" or "Originality").
Provide a detailed written review with direct quotes to back up its scores.
Generate a single spreadsheet (CSV) so you can sort all papers from best to worst.
NEW (Optional): Search for related papers to establish baseline literature (quality depends on search results).
NEW (Optional): Position each paper within the research trajectory with novelty-adjusted scoring.
(Optional) Compare runs from different AI models and have an "AI Judge" review any disagreements.
Handle interruptions gracefully and resume from where it left off.
Allow you to experiment with different LLM providers for different tasks.

2. How It Works: Your AI Review Team

The system's architecture is best understood as a three-person team: a **Specialist**, an **Editor**, and a **Senior Judge**.

1. Ingestion

Your .pdf and .md files are read and converted to plain text.

→

2. Agent 1: The Specialist

If you have 8 criteria, this agent reads the paper 8 separate times, focusing on just one criterion each time.

→

3. Agent 2: The Editor

Gathers all 8 specialist reports and writes one final, polished review with an overall score.

4. Output

You get a detailed .md review (example MD) and a row in the master .csv spreadsheet (example CSV link)

→

5. Discrepancy Check

A separate script compares the .csv files from two different runs (e.g., GPT-4o vs. GPT-5). Example CSV report

→

6. Agent 3: The Judge

A new AI "Judge" reads the paper and the two conflicting reviews, then issues a final, tie-breaking verdict. Example CSV report

Why this way?

This multi-agent workflow keeps the AI engaged, structured, and accountable. Instead of producing a single, vague response, each agent has a distinct and focused role:

The Specialist gathers concrete evidence and details.
The Editor synthesizes those findings into a coherent, well-written review.
The Judge performs final quality control, ensuring accuracy, balance, and completeness.

By separating responsibilities, the system encourages precision over generalization — reducing the risk of "lazy" or superficial outputs and ensuring each stage adds measurable value.

NEW: Literature-Grounded Mode

When literature grounding is enabled (via --literature-grounding flag), the system adds specialized agents for literature analysis.

See Section 4b: Literature-Grounded Reviews for detailed information.

Complete Pipeline Flow (Literature-Enhanced)

The run_review_with_dir_literature.py script supports a complete pipeline with optional literature stages:

run_review_with_dir_literature.py

                            ✓ 1. Ingestion
                        
⚡2. Literature Grounding(if --literature-grounding enabled)├─ Librarian → create_baseline_reference()
└─ Searches multiple sources (Semantic Scholar + optional World Bank)
├─ Reader (during extraction) → adds novelty rankings
└─ Compares paper against baseline literature
└─ Fact-Checker (after extraction) → run_fact_checks()
└─ Verifies suspicious claims (e.g., "first study")
◆3. Agent 1: Extractor(standard or literature-enhanced)└─ Reads paper N times (once per criterion)
├─ Standard: process_paper_extractions()
└─ Literature: process_paper_extractions_literature()
●4. Agent 2: Synthesizer(standard or literature-enhanced)├─ Standard: synthesize_review() → Review
└─ Literature: synthesize_grounded_review() → GroundedReview
├─ Research Trajectory analysis
├─ Novelty-adjusted scoring
└─ Fact-check integration
★5. Output(with literature context if enabled)├─ Individual reviews: *_review.md
├─ Consolidated CSV: report_consolidated_*.csv
└─ Literature sections (if enabled):
├─ Research Trajectory
├─ Novelty Rankings (1-5)
├─ Fact-Check Results
└─ Novelty-Adjusted Score

Key insight: Literature stages (2) are completely optional. When disabled, the pipeline runs as a standard 3-stage review (Ingestion → Extractor → Synthesizer → Output).

3. Getting Started (Installation)

You only need to do this once.

Install Dependencies: Open your terminal, navigate to the project folder, and run:
```
pip install -r requirements.txt
```

Set Up Your Keys: Create a .env file at the project root (same level as run_review_with_dir.py). This is your private "password" file. Open it and paste in your API keys from AI providers.

# This tells the system your "password" for OpenAI
OPENAI_API_KEY=sk-...

# This tells the system your "password" for your custom model
CUSTOM_OPENAI_API_KEY=your-custom-key-here

# This tells the system the *address* of your custom model
CUSTOM_OPENAI_API_BASE=http://your-server-address:port/v1

4. How to Run Your First Review

This is your main workflow. Just follow these steps.

Set Up a Run Directory: Create a dedicated directory for your review run:
```
python setup_run.py --run-dir my_review_run
```
Add Your Papers: Drop all your .pdf, .md, etc., files into the my_review_run/papers/ folder.

Choose Your AI: The system now allows you to use different AI models for different tasks. For example, to use DeepSeek for extraction and OpenAI for synthesis:

python run_with_custom_params.py \
  --run-dir my_review_run \
  --provider-extraction deepseek \
  --extractor-model deepseek-reasoner \
  --provider-synthesis openai \
  --synthesizer-model gpt-4o-mini

Customize Your Criteria: Open my_review_run/input/criteria.yaml and tell the AI what to look for. (See the next section for a full guide!)
(Optional) Choose Your Judge: The system's "AI Judge" (see Section 7) defaults to using gpt-4o. You can override this by adding these lines to your my_review_run/input/.env file:
```
JUDGE_PROVIDER=gemini
JUDGE_MODEL=gemini-2.5-pro
```
Run the System: The command in step 3 will automatically start the review process.
Check Your Results: Look in the my_review_run/outputs/reviews/ and my_review_run/outputs/reports/ folders to see the final reviews!

Recommended LLM Configuration:

Based on our analysis, we recommend using different models for different tasks:

Agent 1 (Extraction): Use a faster, more cost-effective model like gpt-4o-mini, deepseek-chat, or claude-haiku-4-5. This task is less complex and runs multiple times per paper.
Agent 2 (Synthesis): Use your most capable model like gpt-5, deepseek-reasoner, or claude-sonnet-4-5. This task requires higher-level reasoning and runs only once per paper.

4b. Literature-Grounded Reviews (New Feature!)

The system now supports literature-grounded reviews that automatically search for and analyze related papers to ground the assessment in existing research. This feature helps verify novelty claims, identify missing citations, and position the paper within the research landscape.

Pipeline Architecture: Dual-Mode Operation

The literature-enhanced pipeline (run_review_with_dir_literature.py) supports two modes controlled by the --literature-grounding flag:

Standard Mode (default)

When --literature-grounding is NOT set:

run_review_with_dir_literature.py

├─ Stage 1: Reader → process_paper_extractions()

└─ Stage 2: Synthesizer → synthesize_review()

└─ Output: Review

Literature Mode (enabled)

When --literature-grounding IS set:

run_review_with_dir_literature.py

├─ Stage 1: Librarian → create_baseline_reference()

├─ Stage 2: Reader → process_paper_extractions_literature()

├─ Stage 3: Fact-Checker → run_fact_checks()

└─ Stage 4: Critic → synthesize_grounded_review()

└─ Output: GroundedReview

Key characteristics:

Uses criteria.yaml for evaluation criteria (same as standard pipeline)
Literature stages are optional — can be enabled/disabled per run
Falls back to standard review if literature stages fail
Literature config in config/literature_sources.yaml

What Is Literature Grounding?

When enabled, literature grounding adds a 4-stage process that goes beyond standard review:

Stage 1: Librarian

Searches multiple literature sources (Semantic Scholar + optional World Bank) for relevant papers in the sub-topic, ranks them by citation count, and extracts key findings from the top 5 (configurable in config/literature_sources.yaml) to create a baseline reference.

→

Stage 2: Reader

Extracts evidence from the target paper AND ranks novelty (1-5) against the baseline literature.

→

Stage 3: Fact-Checker

Verifies suspicious claims (e.g., "first study") through targeted searches.

→

Stage 4: Critic

Synthesizes review with research trajectory and novelty-adjusted scoring.

Key Benefits

Baseline Literature Context: Establishes what's known in the research area by finding relevant papers
Novelty Rankings: Per-criterion novelty assessment (1-5 scale) comparing the paper against baseline literature
Research Trajectory: Shows where the paper fits in the evolution of the field
Fact-Checking: Verifies suspicious claims like "first study" when detected
Novelty-Adjusted Scoring: Adjusts scores based on actual contribution vs. prior work

How to Enable Literature Grounding

Literature grounding is completely optional and can be enabled in three ways:

Method 1: Setup Script

Use the dedicated setup script for literature-grounded runs:

python setup_run_literature.py \\
  --run-dir my_literature_review

This automatically copies literature_sources.yaml and sets up the environment.

Method 2: Configuration File

Edit config/literature_sources.yaml:

# Set to false to disable
enabled: true

Method 3: Runtime Flag

Use the --literature-grounding flag when running:

# Enable literature grounding
python run_review_with_dir_literature.py \\
  --run-dir my_review_run \\
  --literature-grounding

# Standard mode (no literature)
python run_review_with_dir_literature.py \\
  --run-dir my_review_run

Literature Sources Configuration

The system supports multiple literature sources that can be enabled/disabled in config/literature_sources.yaml:

Source	Enabled By Default	Citation Data	Specialization
Semantic Scholar	✅ Yes	✅ Yes	Broad academic coverage (all fields)
Arxiv	✅ Yes (new!)	❌ No	Preprints, CS/physics/math, rate-limit resistant
World Bank	❌ No (opt-in)	❌ No	Development economics, policy reports

Enabling/Disabling Sources

Edit config/literature_sources.yaml to control which sources are active:

sources:
  semantic_scholar:
    enabled: true   # Primary source, has citations
  arxiv:
    enabled: true    # Good fallback for Semantic Scholar rate limits
  world_bank:
    enabled: false  # Set to true to enable

Semantic Scholar API Setup

Semantic Scholar offers a free-tier API (100 requests/minute). For higher limits, get an API key:

Visit https://www.semanticscholar.org/product/api#api-key
Sign up for a free API key
Add to your global .env file:
```
SEMANTIC_SCHOLAR_API_KEY=your_key_here
```

Arxiv API

Arxiv is a free, open-access archive that serves as an excellent fallback when Semantic Scholar rate limits are reached. When enabled, it provides:

Preprints and published papers across CS, physics, math, and more
No API key required
No strict rate limits (1 request per 3 seconds recommended)
Open-access PDF availability

Note: Arxiv papers lack citation counts and will rank lower than Semantic Scholar papers in results. Use as a fallback when Semantic Scholar is rate-limited.

World Bank API

The World Bank API is free and requires no authentication. When enabled, it provides:

Development economics working papers and reports
Policy documents and publications
Open-access PDF availability

Note: World Bank papers lack citation counts and will rank lower than Semantic Scholar papers in results.

Source Priority: When multiple sources are enabled, the system uses a quota-based allocation with spill-over to ensure fair representation. Papers with citations rank above those without. Each source gets an equal quota (baseline_papers_count ÷ number of sources), with unused slots spilling over to other sources.

Important: Literature Quality Depends on Search Results

The literature grounding feature is optional and its quality depends on the literature search results:

Search Quality: The Librarian agent searches for relevant papers, but quality varies by field. Well-established fields (e.g., development economics) yield better results than niche or emerging topics.
No Results Found: If the Librarian cannot find relevant papers, the baseline reference will be empty or low-quality, reducing the value of novelty assessment and research trajectory analysis.
Manual Baseline: For critical reviews, consider manually providing baseline papers in the literature/ directory to ensure high-quality comparison.
Verification Limits: Fact-checking only verifies specific suspicious claims (e.g., "first study"). It does not comprehensively verify all claims in the paper.

Recommendation: Test literature grounding with a small sample first to assess search quality for your field before running on large batches.

Literature-Grounded Output

When enabled, reviews include additional sections:

Research Trajectory: Position of the paper in the literature landscape
Novelty Assessment: Per-criterion novelty rankings (1-5 scale)
Fact-Check Summary: Verification of bold or suspicious claims
Novelty-Adjusted Score: Score adjusted based on actual novelty

Configuration Options

The config/literature_sources.yaml file controls all aspects:

Setting	Default	Description
`librarian.baseline_papers_count`	5	Number of papers to fetch for baseline
`librarian.recency_years`	5	How many years back to search
`fact_checker.max_verifications_total`	10	Max verification searches per review
`fact_checker.triggers.*.enabled`	true	Enable/disable specific triggers

Batch Processing with Literature Grounding

For batch runs with literature grounding:

python setup_batch_runs_literature.py \\
  --master-papers-dir papers_master \\
  --base-run-dir lit_run \\
  --num-runs 5 \\
  --create-batch-script

Recommendation:

Use literature grounding for final reviews or important decisions. For initial screening of large batches, consider running without literature grounding first to identify the most promising papers, then re-run with literature grounding on the top candidates.

Important: Literature grounding adds significant processing time (10-20 minutes per paper for literature search). Always test with a small batch first to ensure the configuration works as expected.

4c. Standalone Literature Review Pipeline

The standalone literature review pipeline is a dedicated workflow for papers that require comprehensive literature analysis. Unlike the literature-grounded enhancement (Section 4b), this pipeline uses only the literature-grounded agents and does not include a fallback to standard review.

Pipeline Architecture

The standalone pipeline (run_review_literature.py) implements a pure 4-stage literature-grounded workflow:

run_review_literature.py

Standalone Pipeline

├─ Stage 1: Librarian

└─ create_baseline_reference()

├─ Stage 2: Reader

└─ process_paper_extractions_literature()

├─ Stage 3: Fact-Checker

└─ run_fact_checks()

└─ Stage 4: Critic

└─ synthesize_grounded_review()

→

Output

GroundedReview

Research Trajectory
Novelty Rankings (1-5)
Fact-Check Results
Novelty-Adjusted Score

Key characteristics:

Configured via literature_sources.yaml (not criteria.yaml)
Always runs all 4 stages (no fallback to standard review)
Uses dedicated literature/ directory for baseline papers
Stage-specific LLM parameters (Librarian, Reader, Fact-Checker, Critic)

Key Differences from Literature-Grounded Enhancement

Feature	Literature Enhancement (4b)	Standalone Pipeline (4c)
Configuration File	Uses `criteria.yaml` for evaluation criteria	Uses `literature_sources.yaml` for literature configuration
Pipeline	Optional: Can disable and fall back to standard review	Always runs 4-stage literature pipeline
Setup Script	`setup_run_literature.py`	`setup_literature_review.py`
Runner Script	`run_review_with_dir_literature.py --literature-grounding`	`run_review_literature.py`
Output Type	`Review` (with optional literature sections)	`GroundedReview` (always includes literature analysis)

When to Use the Standalone Pipeline

The standalone pipeline is designed for:

Systematic Literature Reviews: When you need to position papers within the research landscape
Novelty Assessment: When evaluating claims of originality or "first study" assertions
Citation Analysis: When checking for missing prior work or proper attribution
Research Trajectory: When understanding how a paper advances the field

Getting Started with Standalone Literature Review

Step 1: Set Up the Run Directory

Use the dedicated setup script:

python setup_literature_review.py \
  --run-dir my_literature_review

This creates:

my_literature_review/papers/ — Place papers to review here
my_literature_review/literature/ — Place baseline reference papers here (optional)
my_literature_review/input/literature_sources.yaml — Configure literature sources
my_literature_review/input/.env — LLM parameters for each stage

Step 2: Configure Literature Sources

Edit literature_sources.yaml to define your baseline literature:

# Literature sources (baseline papers)
sources:
  - id: "baseline_001"
    title: "Foundational Paper in Your Field"
    authors: ["Author One", "Author Two"]
    year: 2024
    venue: "Top Conference"
    topics: ["topic1", "topic2"]
    key_contributions:
      - "Established the methodology"
      - "Introduced key framework"

# Research trajectory definition
research_trajectory:
  starting_point: "earliest_work"
  progression:
    - stage: "initial_concepts"
      description: "Early foundational work"
    - stage: "current_state"
      description: "State-of-the-art approaches"
    - stage: "open_challenges"
      description: "Current limitations"

Step 3: Configure LLM Parameters

The standalone pipeline uses stage-specific LLM settings in .env:

# Stage 1: Librarian (Baseline Reference Creation)
PROVIDER_LIBRARIAN=openai
LIBRARIAN_MODEL=gpt-4o
LIBRARIAN_TEMPERATURE=0.2

# Stage 2: Reader (Novelty Ranking & Extraction)
PROVIDER_READER=openai
READER_MODEL=gpt-4o-mini
READER_TEMPERATURE=0.3

# Stage 3: Fact-Checker (Claim Verification)
PROVIDER_FACT_CHECKER=openai
FACT_CHECKER_MODEL=gpt-4o
FACT_CHECKER_TEMPERATURE=0.1

# Stage 4: Critic (Grounded Synthesis)
PROVIDER_CRITIC=deepseek
CRITIC_MODEL=deepseek-reasoner
CRITIC_TEMPERATURE=0.2

Recommended Configuration:

Librarian: Use a capable model like gpt-4o for accurate paper retrieval and summarization
Reader: Use a faster model like gpt-4o-mini for extraction tasks
Fact-Checker: Use a reliable model like gpt-4o for verification
Critic: Use your best reasoning model like deepseek-reasoner or claude-sonnet-4-5 for synthesis

Step 4: Run the Literature Review

Execute the standalone pipeline:

python run_review_literature.py --run-dir my_literature_review

Visual Progress Indicators

The pipeline provides clear visual feedback for each stage:

LITERATURE-GROUNDED REVIEW [1/5]: paper.pdf
============================================================

[Stage 1/4] Librarian: Creating baseline reference...
[Librarian] ✓ Created baseline with 5 papers

[Stage 2/4] Reader: Extracting evidence with novelty ranking...
[Reader] ✓ Completed 8 criterion extractions

[Stage 3/4] Fact-Checker: Running verification checks...
[Fact-Checker] ✓ Completed 3 verification checks
[Fact-Checker] ⚠️  2 claims require further review

[Stage 4/4] Critic: Synthesizing grounded review...
[Critic] ✓ Review synthesized (score: 65.2)

[Summary]
  Overall Score: 65.2
  Recommendation: REVISE AND RESUBMIT
  Avg Novelty: 3.45/5
  Novelty Adjustment: +2.5

Batch Processing with Standalone Pipeline

For processing multiple directories with the standalone pipeline:

python setup_batch_literature_review.py \
  --master-papers-dir papers_to_review \
  --master-literature-dir baseline_literature \
  --base-run-dir literature_run \
  --num-runs 3 \
  --create-batch-script

This creates batch directories with:

Papers distributed across literature_run1/, literature_run2/, etc.
Baseline literature papers copied to each run's literature/ directory
run_batch_literature_review.py script for batch execution

Run the batch:

# Sequential execution
python run_batch_literature_review.py

# Parallel execution
python run_batch_literature_review.py --parallel --max-workers 4

Output Differences

Standalone literature reviews always include:

Research Trajectory Section: Detailed analysis of how the paper fits into the literature
Novelty Rankings: Per-criterion novelty scores (1-5 scale)
Fact-Check Results: Verification of suspicious claims
Novelty-Adjusted Score: Base score adjusted for actual contribution

Important: The standalone pipeline requires literature_sources.yaml. If this file is missing, the system will create a default template that you must customize before running.

Comparison: Which Should You Use?

Literature Enhancement (4b)

Use when:

You want optional literature analysis
You have existing criteria.yaml files
You need flexibility to enable/disable
Standard review is acceptable as fallback

Standalone Pipeline (4c)

Use when:

Literature analysis is essential
You're doing systematic reviews
You need research trajectory analysis
You want dedicated 4-stage pipeline

5a. The Most Important Part: Customizing Your Criteria

This is the system's most powerful feature. You do **not** need to touch any Python code to completely change the review. The *only* file you need to edit is my_review_run/input/criteria.yaml.

The One Big Rule: Weights Must = 100

The system uses a 100-point scale. The weight: value you give to each criterion tells the system how much it "matters." Before you run the script, you **must** ensure all weight values in your file add up to exactly 100.

Anatomy of a Criterion

Let's break down one criterion block from the file:

  - id: empirical_rigor
    name: Empirical Rigor
    description: |
      Assesses the quality of the empirical methods, data,
      and execution. Look for research design, causal
      identification, and statistical analysis.
    weight: 20
    scale:
      type: numeric
      range: [1, 5]
      labels:
        1: "Fundamentally flawed"
        2: "Significant weaknesses"
        3: "Adequate"
        4: "Strong and robust"
        5: "Exceptional / state-of-the-art"

id: The unique, one-word ID. This is used as the header in the final CSV file (e.g., empirical_rigor_score).
name: The "pretty" name used in the Markdown report.
description: This is the most important field. This text is injected directly into the prompt for Agent 1. A clear, specific description will give you a high-quality review. A vague description will give you a vague review.
weight: How much this score matters out of 100.
scale: Defines the scoring. The labels are used to tell the AI what the 1-5 scale means.

Example: How to Add a New Criterion

Let's add a new criterion for "Statistical Robustness" with a weight of 10.

Step 1: Add the new block

Copy-paste an existing block and edit it. Now your file looks like this:

criteria:
  - id: theoretical_contribution
    name: Theoretical Contribution
    weight: 15
    # ... (other fields) ...
    
  - id: empirical_rigor
    name: Empirical Rigor
    weight: 20
    # ... (other fields) ...
    
  # ... (all your other criteria) ...
    
  # --- OUR NEW CRITERION ---
  - id: statistical_robustness
    name: Statistical Robustness
    description: |
      Evaluates the quality of statistical tests, power
      analysis, and sensitivity/robustness checks.
    weight: 10
    scale:
      type: numeric
      range: [1, 5]
      labels:
        1: "Statistically flawed"
        2: "Weak / Inappropriate"
        3: "Adequate"
        4: "Robust"
        5: "Exceptional"

Step 2: Adjust the weights to sum to 100

Our total weight is now 110 (the original 100 + our new 10). We must remove 10 points from other criteria.

Let's reduce theoretical_contribution from 15 to 10, and empirical_rigor from 20 to 15.
(15 + 20) = 35. (10 + 15) = 25. We've removed 10 points. Our total is 100 again.

Your file is now valid and ready to run. The system will automatically add a new "Statistical Robustness" section to all reviews and a new statistical_robustness_score column to your CSV.

How to Remove a Criterion

Delete the entire block (from - id: ... to the last label: ...).
Adjust the remaining weights to add up to 100.

Recommendation Rules Based on Final Score

The system determines the recommendation using the following default thresholds, which can be changed in my_review_run/input/criteria.yaml:

If the final_score is 85 or higher → Accept
If the final_score is between 70 and 84 → Accept with Revisions
If the final_score is between 50 and 69 → Revise and Resubmit
If the final_score is below 50 → Reject

Specialization Domain

The domain section in the criteria yaml file is what "specializes" the agents. You can reference it in the my_review_run/input/prompts/extractor_system.txt, like: "You are an expert reviewer in the field of {domain}." "Analyze the following paper based on criteria for {domain}." By changing this one line in criteria.yaml, you can pivot your entire review system to a new field (e.g., "machine_learning" or "clinical_psychology") and the extractor agents will adjust their analysis accordingly.

5b. Controlling Agent Behavior with Prompts

The "brain" of your AI agents is controlled by a set of simple text files located in your run directory. By editing these "prompts," you can change the agents' personalities, their analytical focus, and the style of their writing.

The "System" vs. "User" Prompt: A Key Concept

You'll notice that both the Extractor and Synthesizer agents have two prompt files. This is the most important concept to understand:

1. The `_system.txt` Prompt (The "Role")

This file defines the agent's personality, role, and general instructions. It's the "job description" you give the AI.

Example: "You are a senior academic editor..." or "You are an expert academic reviewer...".
Analogy: You are telling an actor who they are (e.g., "You are a skeptical detective").

2. The `_user.txt` Prompt (The "Task")

This file provides the specific task and data for the agent to work on. It's the "assignment" you give the AI for this specific job.

Example: "Here is the paper. Fill out the following JSON based on the criteria...".
Analogy: You are telling the actor what to do (e.g., "Analyze this evidence and report your findings").

File-by-File Breakdown

File Name	Agent	What It Controls
`extractor_system.txt`	Extractor	Role: "Expert academic reviewer". Context: Specialized in the `{domain}` (e.g., "development economics"). Output Format: Must be "objective" and "thorough".
`extractor_user.txt`	Extractor	Input: The paper's content (`{paper_markdown}`) and the specific criterion (`{criterion_name}`). Task: Read the paper and provide a score, justification, supporting quotes, strengths, and weaknesses. Output Schema: Defines the JSON keys the script expects (`score`, `score_justification`, `evidence`, etc.).
`synthesizer_system.txt`	Synthesizer	Role: "Senior academic editor" Tone: Must be "constructive" Task: Synthesize assessments from "junior reviewers" (the Extractor)
`synthesizer_user.txt`	Synthesizer	Input: All extractions (`{json_dump_of_extractions}`), the score (`{calculated_score}`), and the recommendation (`{calculated_recommendation}`) Task: Write the final, human-readable review Output Schema: Defines the final review structure (`executive_summary`, `detailed_assessment`, `recommendation`, etc.)

Practical Examples: What Can I Change?

To make the synthesizer's tone more critical: Open synthesizer_system.txt and add a line to its role: "Be especially critical of methodological flaws and overstatements of findings. Do not be overly complimentary."
To add a new instruction to the extractor: Open extractor_system.txt and add a new rule: "You must always provide at least one direct quote from the paper to support your 'strengths' and 'weaknesses' rationale."
To change the final review's summary: Open synthesizer_user.txt and modify the description for executive_summary. For example, you could change it from "3-4 sentence summary..." to "a 3-bullet-point summary of the most critical findings."

⚠️ Important Warning: Be Careful with JSON Structure

The Python scripts (agent_extractor.py and agent_synthesizer.py) expect the AI to output a valid JSON object that matches the exact schema defined in the _user.txt files.

You can safely edit the parts of the prompts that control style, tone, or perspective (like the _system.txt files or the prose descriptions in the _user.txt files).

However, if you change the actual JSON keys (e.g., renaming "executive_summary" to "summary") in the prompt *without* also updating the Pydantic models in core/data_models.py, the program will crash when it tries to parse the AI's response.

6. Understanding Your Results

After a run, you will find your results in the my_review_run/outputs/ folder.

1. The Detailed Reviews: `my_review_run/outputs/reviews/`

This folder contains the full, human-readable Markdown (.md) file for every paper. The filename tells you exactly how it was generated:

[PaperName]_[ExtractorModel]_[SynthesizerModel]_[Timestamp].md

2. The Master Spreadsheet: `my_review_run/outputs/reports/`

This folder contains your report_consolidated_...csv file. This is the "big picture" view of your entire run. It's perfect for opening in Excel or Google Sheets to sort and filter.

We've designed this report for usability:

paper_filename: The original filename you provided, so you can find it easily.
title: The paper's title. (If the PDF has a bad title, this will default to the filename so it's never blank or confusing).
extractor_model_used: Which "Specialist" AI was used.
synthesizer_model_used: Which "Editor" AI was used.
overall_score: The final weighted score from 0-100.
recommendation: The AI's final call (e.g., "Accept", "Reject").
..._score: A separate column for *each* of your criteria scores.

7. Advanced: Comparing Models with an AI Judge

Want to know if gpt-5 is a better reviewer than gemini-2.5-pro? This system is built for that. You can not only compare their results but also have a **third AI model act as an "AI Judge"** to review any disagreements.

Remember: Because we cache your papers (in my_review_run/ingestion_cache.json), your second, third, and fourth runs will be *much faster* than the first!

The Complete 3-Step Workflow

Comparing models and resolving disagreements is a manual 3-step process that gives you full control over each stage:

1

Step 1: Run Multiple Reviews

First, run the review system with different LLM configurations to generate multiple reports. For example:

Run 1: GPT-4o Configuration

python run_with_custom_params.py \
  --run-dir my_review_run \
  --provider-extraction openai \
  --extractor-model gpt-4o-mini \
  --provider-synthesis openai \
  --synthesizer-model gpt-4o-mini

Run 2: DeepSeek Configuration

python run_with_custom_params.py \
  --run-dir my_review_run \
  --provider-extraction deepseek \
  --extractor-model deepseek-reasoner \
  --provider-synthesis deepseek \
  --synthesizer-model deepseek-chat

After these runs, you'll have multiple report_consolidated_...csv files in your my_review_run/outputs/reports/ directory.

2

Step 2: Compare Reports to Find Conflicts

Now, run the comparison script to identify papers where the different models disagreed:

python compare_reports.py --run-dir my_review_run

This script will:

Read all the consolidated report files
Identify papers with conflicting recommendations
Create a new HUMAN_REVIEW_discrepancies_...csv file listing all conflicts

Important: Review this discrepancy file before proceeding to the judge step. This gives you a chance to see how many conflicts there are and decide if you want to proceed with the (potentially expensive) judge step.

3

Step 3: Run the AI Judge to Resolve Conflicts

Finally, run the judge script to have an AI review the conflicts and make a final decision:

python judge_conflicts.py --run-dir my_review_run

The judge will:

Read each paper that had conflicting reviews from the most recent discrepancies identified in Step 2
Read both conflicting reviews
Make a final, authoritative decision
Create a JUDGE_VERDICTS_report_...csv file with the final verdicts

Read the full logic of adjudication here

Understanding Your Final Results

The JUDGE_VERDICTS_report_...csv file is your final review list. It contains all the papers the models disagreed on, with these key columns:

paper_filename: The paper in question.
model_a_recommendation: e.g., "Accept"
model_b_recommendation: e.g., "Reject"
judge_recommendation: The Judge's final decision (e.g., "Accept").
winning_review: Which review the Judge thought was more accurate (e.g., "A" or "B").
judge_rationale: A written explanation from the Judge on *why* it made that decision.

This saves you from manually reading all the conflicts. You can now simply read the Judge's rationale and finalize the decision, turning hours of re-review into minutes.

📋 The Judge Role: Deep Dive

The Judge component is a sophisticated arbitration system that resolves disagreements between multiple AI models when they produce conflicting reviews of the same paper. Understanding how it works will help you trust its verdicts and use it effectively.

Core Responsibilities

The Judge serves three primary functions in the review system:

Conflict Detection: Identifies papers where different AI models disagree on recommendations (e.g., one says "Accept", another says "Reject")
Champion Selection: Intelligently selects the best representative review from each conflicting faction
Final Adjudication: Makes an authoritative decision based on careful analysis of the original paper and conflicting reviews

The "Champion vs Champion" Strategy

Instead of adjudicating between all possible pairs of reviews, the Judge uses an efficient "champion selection" approach:

Step 1: Faction Grouping

When multiple models review a paper, the Judge groups reviews by their recommendation type:

Accept Faction: All reviews recommending "Accept" or "Accept with Revisions"
Reject Faction: All reviews recommending "Reject" or "Revise and Resubmit"

Step 2: Champion Selection

From each faction, the Judge selects the highest-scoring review as that faction's "champion":

The champion is the review with the highest weighted score across all criteria
This ensures only the best arguments from each side are presented to the Judge
Dramatically reduces computational cost compared to pairwise comparison

Example: In a 3-vs-2 conflict where three models say "Accept" and two say "Reject", the Judge only reviews the best-scoring "Accept" review vs the best-scoring "Reject" review, i.e., the winner-takes-all-per-side, not all 10 possible pairs!

Step 3: LLM-as-Judge Adjudication

An independent AI model (default: GPT-4o) acts as the Judge and receives:

The Original Paper: Full text content (truncated to 400,000 characters if needed)
Both Champion Reviews: Complete reviews with scores, justifications, and model identities
Impartiality Instructions: System prompt emphasizing objectivity and expertise

The Judge evaluates both reviews and produces a structured verdict:

{
  "judge_recommendation": "Accept w/ Revisions",
  "judge_rationale": "Review A more accurately captures...",
  "winning_review": "A",
  "winning_review_rationale": "Review A demonstrates deeper analysis..."
}

Scoring System & Weighted Rules

The Judge does not assign scores—that's done by the Extractor and Editor agents. Instead, the Judge uses the existing weighted scoring system to evaluate champion quality:

Eight Weighted Criteria

Each review is evaluated on eight dimensions with specific weights:

Criterion	Weight
Empirical Rigor	18
Identification Strategy	15
Statistical Robustness	15
Theoretical Contribution	12
Policy Relevance	12
Originality/Novelty	10
Data Quality	10
Presentation/Clarity	8

Recommendation Thresholds

Final scores (0-100 scale) map to recommendations:

85+: Accept
70-84: Accept with Revisions
50-69: Revise and Resubmit
<50: Reject

Configurable in criteria.yaml

Configuring the Judge

You can customize which AI model acts as Judge by setting environment variables in your run directory's .env file:

# Default configuration (can be omitted)
JUDGE_PROVIDER=openai
JUDGE_MODEL=gpt-4o

# Alternative: Use Gemini as Judge
JUDGE_PROVIDER=gemini
JUDGE_MODEL=gemini-2.5-pro

# Alternative: Use Claude as Judge
JUDGE_PROVIDER=anthropic
JUDGE_MODEL=claude-sonnet-4-5

Recommendation:

Use a highly capable model for the Judge role. The Judge's decisions are final and should represent the most careful, impartial analysis possible. GPT-4o, Claude Sonnet 4.5, or Gemini 2.5 Pro are excellent choices.

Progress Tracking & Cost Management

The Judge system includes sophisticated progress tracking to avoid redundant adjudication:

Configuration Hashing: Detects changes in Judge configuration and re-adjudicates if settings change
Progress Files: Stores adjudication results in judge_progress.json
Resume Capability: Skips already-adjudicated papers when configuration is unchanged
Cost Tracking: Monitors API usage for budget management

Cost Warning: The Judge step can be expensive for large conflict sets. Always review the HUMAN_REVIEW_discrepancies_...csv file before running the Judge to see how many conflicts exist. Consider adjudicating only the most important conflicts if budget is a concern.

Understanding Judge Verdicts

The final JUDGE_VERDICTS_...csv file contains these key columns:

paper_filename — The paper being adjudicated
model_a_recommendation — First model's recommendation
model_b_recommendation — Second model's recommendation
judge_recommendation — The Judge's final decision
winning_review — Which review the Judge preferred ("A", "B", or "Neither")
judge_rationale — Detailed explanation of the decision

The Judge's rationale is your most valuable asset—it provides a neutral, expert assessment of which review was more accurate and why. This can help you:

Understand why models disagreed
Identify systematic biases in specific models
Make final acceptance/rejection decisions with confidence
Improve your review criteria over time

Key Strengths of the Judge System

Efficiency: Champion selection reduces adjudication complexity from O(n²) to O(n)
Transparency: Detailed rationales explain every decision
Robustness: Progress tracking prevents redundant work and enables resumption
Configurability: Weights, thresholds, and Judge model are all customizable
Cost-Effectiveness: Tracks API usage and avoids unnecessary adjudications

The Judge component transforms the multi-agent review system from a simple consensus tool into a sophisticated peer review mechanism—ensuring that final decisions are based on the best available analysis, not just the first available agreement.

8. Processing Large Batches with Run Directories

For large-scale processing, the system distributes papers across multiple run directories, enabling controlled resumption and improved load management to minimize the risk of exceeding LLM rate limits. This configuration facilitates experimentation with different settings, including review criteria, and supports parallel execution when rate limits are not a concern.

Setting Up Multiple Run Directories

If you have a master directory with hundreds of papers, you can distribute them across multiple run directories:

python setup_batch_runs.py \
  --master-papers-dir /path/to/master/papers \
  --base-run-dir run_dir \
  --num-runs 10 \
  --papers-per-run 50 \
  --create-batch-script

This will create 10 run directories (run_dir1, run_dir2, etc.) with 50 papers each, plus a batch script to run them all.

Running Multiple Directories

After setting up multiple directories, you can run them all at once:

# Sequential execution
python run_batch.py

# Parallel execution (up to 4 workers)
python run_batch.py --parallel

# Parallel execution with custom number of workers
python run_batch.py --parallel --max-workers 8

Resuming After Interruptions

The system automatically tracks progress and can resume from where it left off if interrupted. Each run directory maintains its own progress tracking, so you can:

Run all directories in parallel
Stop and resume individual runs as needed
Track progress separately for each configuration

Important: The system tracks progress based on the LLM configuration. If you change the LLM parameters, it will automatically reprocess all papers with the new configuration.

9. Troubleshooting & Common Questions

Q: I'm getting a `FAILED criterion` or `JSON error`!

A: This almost always means the AI's response was too long and got cut off. Your AI is being *too* detailed!
The Fix: Open core/llm_wrapper.py and find the max_completion_tokens or max_tokens settings. Increase them from 4096 to 8192 and try again.

Q: How do I change the "Judge" LLM?

A: The Judge is set by two environment variables in your my_review_run/input/.env file: JUDGE_PROVIDER and JUDGE_MODEL. If they aren't there, the script defaults to openai and gpt-4o. You can add these lines to change it to a powerful model, like Gemini Pro:

JUDGE_PROVIDER=gemini
JUDGE_MODEL=gemini-2.5-pro

Q: I'm getting a `BadRequestError: Unsupported parameter: 'max_tokens'`!

A: This is a known issue with some custom endpoints. The system is designed to "surgically bypass" this problem for the custom_openai/gpt-5 combination. If you see this error with a *different* model, it means a developer needs to add a new bypass rule to core/llm_wrapper.py.

Q: How much will this cost?

A: Be very careful with large batches. The cost is:
(Number of Papers) x (Number of Criteria) x (Cost of Extractor Model)
+
(Number of Papers) x (Cost of Synthesizer Model)
+
(Number of Conflicts) x (Cost of Judge Model)

A 500-paper batch with 8 criteria is **4,000** calls to the Extractor, plus 500 calls to the Synthesizer. Always test with a small batch of 5-10 papers first!

Q: How do I clear the cache and re-parse my papers?

A: Simply delete the my_review_run/ingestion_cache.json file. The system will fully re-parse all your papers from scratch on the next run.

Q: How do I clear the paper review cache?

A: Simply delete the my_review_run/progress.json file. The system will review all the papers from scratch on the next run.

Q: How do I clear the judge review cache?

A: Simply delete the my_review_run/judge_progress.json file. The system will adjudicate all the papers from scratch on the next run.

Q: Can I mix and match different LLM providers?

A: Yes! The system fully supports using different LLM providers for extraction and synthesis. For example:

python run_with_custom_params.py \
  --run-dir my_review_run \
  --provider-extraction deepseek \
  --extractor-model deepseek-reasoner \
  --provider-synthesis openai \
  --synthesizer-model gpt-4o-mini

This will use DeepSeek's reasoning model for extraction and OpenAI's GPT-4o-mini for synthesis.

Q: What if I have fewer papers than run directories?

A: The system will distribute papers as evenly as possible. For example, if you have 10 papers and 3 run directories, it will create directories with 4, 3, and 3 papers respectively.

Q: Do I need to run compare_reports.py before judge_conflicts.py?

A: Yes! The 3-step workflow is designed to be manual:

Run reviews with different configurations
Run compare_reports.py to find conflicts
Run judge_conflicts.py to resolve conflicts

This gives you control over the process and allows you to review conflicts before deciding whether to run the (potentially expensive) judge step.