1. What Is This System?
This system ingests large collections of academic papers (e.g.,β―500+), evaluates each according to your custom criteria, and produces a comprehensive score. It does so through a coordinated team of AI agentsβSpecialist, Editor, Judge, and Librarian (optional)βthat work together to ensure thorough, consistent, and ruleβdriven assessments.
Agent Roles
- Specialist: Conducts detailed, domainβspecific analyses of each paper, extracting methodological quality, novelty, data integrity, and other metrics defined by your rules.
- Editor: Synthesizes the Specialist's findings into clear, concise summaries, pulls out key figures/tables, and formats the information for downstream review.
- Judge: Resolves conflicts between multiple AI model reviews through an intelligent "champion vs champion" adjudication process, selecting the best assessment from conflicting recommendations.
- Librarian (Literature Mode): Searches academic databases for related papers, extracts key findings from the literature, and provides baseline context for novelty assessment.
- Fact-Checker (Literature Mode): Verifies suspicious claims (e.g., "first study") through targeted searches to ensure accurate novelty claims.
- Critic (Literature Mode): Synthesizes reviews with research trajectory analysis and novelty-adjusted scoring.
Instead of you manually skimming 500 abstracts, this tool will:
- Read the full text of every paper.
- Score each one against criteria *you* define (like "Empirical Rigor" or "Originality").
- Provide a detailed written review with direct quotes to back up its scores.
- Generate a single spreadsheet (CSV) so you can sort all papers from best to worst.
- NEW (Optional): Search for related papers to establish baseline literature (quality depends on search results).
- NEW (Optional): Position each paper within the research trajectory with novelty-adjusted scoring.
- (Optional) Compare runs from different AI models and have an "AI Judge" review any disagreements.
- Handle interruptions gracefully and resume from where it left off.
- Allow you to experiment with different LLM providers for different tasks.
2. How It Works: Your AI Review Team
The system's architecture is best understood as a three-person team: a **Specialist**, an **Editor**, and a **Senior Judge**.
1. Ingestion
Your .pdf and .md files are read and converted to plain text.
2. Agent 1: The Specialist
If you have 8 criteria, this agent reads the paper 8 separate times, focusing on just one criterion each time.
3. Agent 2: The Editor
Gathers all 8 specialist reports and writes one final, polished review with an overall score.
4. Output
You get a detailed.md review
(example MD)
and a row in the master .csv spreadsheet
(example CSV link)
5. Discrepancy Check
A separate script compares the .csv files from two different runs (e.g., GPT-4o vs. GPT-5). Example CSV report
6. Agent 3: The Judge
A new AI "Judge" reads the paper and the two conflicting reviews, then issues a final, tie-breaking verdict. Example CSV report
This multi-agent workflow keeps the AI engaged, structured, and accountable. Instead of producing a single, vague response, each agent has a distinct and focused role:
- The Specialist gathers concrete evidence and details.
- The Editor synthesizes those findings into a coherent, well-written review.
- The Judge performs final quality control, ensuring accuracy, balance, and completeness.
By separating responsibilities, the system encourages precision over generalization β reducing the risk of "lazy" or superficial outputs and ensuring each stage adds measurable value.
When literature grounding is enabled (via --literature-grounding flag), the system adds specialized agents for literature analysis.
See Section 4b: Literature-Grounded Reviews for detailed information.
Complete Pipeline Flow (Literature-Enhanced)
The run_review_with_dir_literature.py script supports a complete pipeline with optional literature stages:
*_review.mdreport_consolidated_*.csvKey insight: Literature stages (2) are completely optional. When disabled, the pipeline runs as a standard 3-stage review (Ingestion β Extractor β Synthesizer β Output).
3. Getting Started (Installation)
You only need to do this once.
-
Install Dependencies: Open your terminal, navigate to the project folder, and run:
pip install -r requirements.txt -
Set Up Your Keys: Create a
.envfile at the project root (same level asrun_review_with_dir.py). This is your private "password" file. Open it and paste in your API keys from AI providers.# This tells the system your "password" for OpenAI OPENAI_API_KEY=sk-... # This tells the system your "password" for your custom model CUSTOM_OPENAI_API_KEY=your-custom-key-here # This tells the system the *address* of your custom model CUSTOM_OPENAI_API_BASE=http://your-server-address:port/v1
4. How to Run Your First Review
This is your main workflow. Just follow these steps.
- Set Up a Run Directory: Create a dedicated directory for your review run:
python setup_run.py --run-dir my_review_run - Add Your Papers: Drop all your
.pdf,.md, etc., files into themy_review_run/papers/folder. -
Choose Your AI: The system now allows you to use different AI models for different tasks. For example, to use DeepSeek for extraction and OpenAI for synthesis:
python run_with_custom_params.py \ --run-dir my_review_run \ --provider-extraction deepseek \ --extractor-model deepseek-reasoner \ --provider-synthesis openai \ --synthesizer-model gpt-4o-mini - Customize Your Criteria: Open
my_review_run/input/criteria.yamland tell the AI what to look for. (See the next section for a full guide!) -
(Optional) Choose Your Judge: The system's "AI Judge" (see Section 7) defaults to using
gpt-4o. You can override this by adding these lines to yourmy_review_run/input/.envfile:JUDGE_PROVIDER=gemini JUDGE_MODEL=gemini-2.5-pro - Run the System: The command in step 3 will automatically start the review process.
- Check Your Results: Look in the
my_review_run/outputs/reviews/andmy_review_run/outputs/reports/folders to see the final reviews!
Based on our analysis, we recommend using different models for different tasks:
- Agent 1 (Extraction): Use a faster, more cost-effective model like
gpt-4o-mini,deepseek-chat, orclaude-haiku-4-5. This task is less complex and runs multiple times per paper. - Agent 2 (Synthesis): Use your most capable model like
gpt-5,deepseek-reasoner, orclaude-sonnet-4-5. This task requires higher-level reasoning and runs only once per paper.
4b. Literature-Grounded Reviews (New Feature!)
The system now supports literature-grounded reviews that automatically search for and analyze related papers to ground the assessment in existing research. This feature helps verify novelty claims, identify missing citations, and position the paper within the research landscape.
Pipeline Architecture: Dual-Mode Operation
The literature-enhanced pipeline (run_review_with_dir_literature.py) supports two modes controlled by the --literature-grounding flag:
Standard Mode (default)
When --literature-grounding is NOT set:
process_paper_extractions()synthesize_review()ReviewLiterature Mode (enabled)
When --literature-grounding IS set:
create_baseline_reference()process_paper_extractions_literature()run_fact_checks()synthesize_grounded_review()GroundedReviewKey characteristics:
- Uses
criteria.yamlfor evaluation criteria (same as standard pipeline) - Literature stages are optional β can be enabled/disabled per run
- Falls back to standard review if literature stages fail
- Literature config in
config/literature_sources.yaml
What Is Literature Grounding?
When enabled, literature grounding adds a 4-stage process that goes beyond standard review:
Stage 1: Librarian
Searches multiple literature sources (Semantic Scholar + optional World Bank) for relevant papers in the sub-topic, ranks them by citation count, and extracts key findings from the top 5 (configurable in config/literature_sources.yaml) to create a baseline reference.
Stage 2: Reader
Extracts evidence from the target paper AND ranks novelty (1-5) against the baseline literature.
Stage 3: Fact-Checker
Verifies suspicious claims (e.g., "first study") through targeted searches.
Stage 4: Critic
Synthesizes review with research trajectory and novelty-adjusted scoring.
Key Benefits
- Baseline Literature Context: Establishes what's known in the research area by finding relevant papers
- Novelty Rankings: Per-criterion novelty assessment (1-5 scale) comparing the paper against baseline literature
- Research Trajectory: Shows where the paper fits in the evolution of the field
- Fact-Checking: Verifies suspicious claims like "first study" when detected
- Novelty-Adjusted Scoring: Adjusts scores based on actual contribution vs. prior work
How to Enable Literature Grounding
Literature grounding is completely optional and can be enabled in three ways:
Method 1: Setup Script
Use the dedicated setup script for literature-grounded runs:
python setup_run_literature.py \\
--run-dir my_literature_review
This automatically copies literature_sources.yaml and sets up the environment.
Method 2: Configuration File
Edit config/literature_sources.yaml:
# Set to false to disable
enabled: true
Method 3: Runtime Flag
Use the --literature-grounding flag when running:
# Enable literature grounding
python run_review_with_dir_literature.py \\
--run-dir my_review_run \\
--literature-grounding
# Standard mode (no literature)
python run_review_with_dir_literature.py \\
--run-dir my_review_run
Literature Sources Configuration
The system supports multiple literature sources that can be enabled/disabled in config/literature_sources.yaml:
| Source | Enabled By Default | Citation Data | Specialization |
|---|---|---|---|
| Semantic Scholar | β Yes | β Yes | Broad academic coverage (all fields) |
| Arxiv | β Yes (new!) | β No | Preprints, CS/physics/math, rate-limit resistant |
| World Bank | β No (opt-in) | β No | Development economics, policy reports |
Enabling/Disabling Sources
Edit config/literature_sources.yaml to control which sources are active:
sources:
semantic_scholar:
enabled: true # Primary source, has citations
arxiv:
enabled: true # Good fallback for Semantic Scholar rate limits
world_bank:
enabled: false # Set to true to enable
Semantic Scholar API Setup
Semantic Scholar offers a free-tier API (100 requests/minute). For higher limits, get an API key:
- Visit https://www.semanticscholar.org/product/api#api-key
- Sign up for a free API key
- Add to your global
.envfile:SEMANTIC_SCHOLAR_API_KEY=your_key_here
Arxiv API
Arxiv is a free, open-access archive that serves as an excellent fallback when Semantic Scholar rate limits are reached. When enabled, it provides:
- Preprints and published papers across CS, physics, math, and more
- No API key required
- No strict rate limits (1 request per 3 seconds recommended)
- Open-access PDF availability
Note: Arxiv papers lack citation counts and will rank lower than Semantic Scholar papers in results. Use as a fallback when Semantic Scholar is rate-limited.
World Bank API
The World Bank API is free and requires no authentication. When enabled, it provides:
- Development economics working papers and reports
- Policy documents and publications
- Open-access PDF availability
Note: World Bank papers lack citation counts and will rank lower than Semantic Scholar papers in results.
The literature grounding feature is optional and its quality depends on the literature search results:
- Search Quality: The Librarian agent searches for relevant papers, but quality varies by field. Well-established fields (e.g., development economics) yield better results than niche or emerging topics.
- No Results Found: If the Librarian cannot find relevant papers, the baseline reference will be empty or low-quality, reducing the value of novelty assessment and research trajectory analysis.
- Manual Baseline: For critical reviews, consider manually providing baseline papers in the
literature/directory to ensure high-quality comparison. - Verification Limits: Fact-checking only verifies specific suspicious claims (e.g., "first study"). It does not comprehensively verify all claims in the paper.
Recommendation: Test literature grounding with a small sample first to assess search quality for your field before running on large batches.
Literature-Grounded Output
When enabled, reviews include additional sections:
- Research Trajectory: Position of the paper in the literature landscape
- Novelty Assessment: Per-criterion novelty rankings (1-5 scale)
- Fact-Check Summary: Verification of bold or suspicious claims
- Novelty-Adjusted Score: Score adjusted based on actual novelty
Configuration Options
The config/literature_sources.yaml file controls all aspects:
| Setting | Default | Description |
|---|---|---|
librarian.baseline_papers_count |
5 | Number of papers to fetch for baseline |
librarian.recency_years |
5 | How many years back to search |
fact_checker.max_verifications_total |
10 | Max verification searches per review |
fact_checker.triggers.*.enabled |
true | Enable/disable specific triggers |
Batch Processing with Literature Grounding
For batch runs with literature grounding:
python setup_batch_runs_literature.py \\
--master-papers-dir papers_master \\
--base-run-dir lit_run \\
--num-runs 5 \\
--create-batch-script
Use literature grounding for final reviews or important decisions. For initial screening of large batches, consider running without literature grounding first to identify the most promising papers, then re-run with literature grounding on the top candidates.
4c. Standalone Literature Review Pipeline
The standalone literature review pipeline is a dedicated workflow for papers that require comprehensive literature analysis. Unlike the literature-grounded enhancement (Section 4b), this pipeline uses only the literature-grounded agents and does not include a fallback to standard review.
Pipeline Architecture
The standalone pipeline (run_review_literature.py) implements a pure 4-stage literature-grounded workflow:
run_review_literature.py
Standalone Pipeline
create_baseline_reference()process_paper_extractions_literature()run_fact_checks()synthesize_grounded_review()Output
GroundedReview
- Research Trajectory
- Novelty Rankings (1-5)
- Fact-Check Results
- Novelty-Adjusted Score
Key characteristics:
- Configured via
literature_sources.yaml(notcriteria.yaml) - Always runs all 4 stages (no fallback to standard review)
- Uses dedicated
literature/directory for baseline papers - Stage-specific LLM parameters (Librarian, Reader, Fact-Checker, Critic)
Key Differences from Literature-Grounded Enhancement
| Feature | Literature Enhancement (4b) | Standalone Pipeline (4c) |
|---|---|---|
| Configuration File | Uses criteria.yaml for evaluation criteria |
Uses literature_sources.yaml for literature configuration |
| Pipeline | Optional: Can disable and fall back to standard review | Always runs 4-stage literature pipeline |
| Setup Script | setup_run_literature.py |
setup_literature_review.py |
| Runner Script | run_review_with_dir_literature.py --literature-grounding |
run_review_literature.py |
| Output Type | Review (with optional literature sections) |
GroundedReview (always includes literature analysis) |
When to Use the Standalone Pipeline
The standalone pipeline is designed for:
- Systematic Literature Reviews: When you need to position papers within the research landscape
- Novelty Assessment: When evaluating claims of originality or "first study" assertions
- Citation Analysis: When checking for missing prior work or proper attribution
- Research Trajectory: When understanding how a paper advances the field
Getting Started with Standalone Literature Review
Step 1: Set Up the Run Directory
Use the dedicated setup script:
python setup_literature_review.py \
--run-dir my_literature_review
This creates:
my_literature_review/papers/β Place papers to review heremy_literature_review/literature/β Place baseline reference papers here (optional)my_literature_review/input/literature_sources.yamlβ Configure literature sourcesmy_literature_review/input/.envβ LLM parameters for each stage
Step 2: Configure Literature Sources
Edit literature_sources.yaml to define your baseline literature:
# Literature sources (baseline papers)
sources:
- id: "baseline_001"
title: "Foundational Paper in Your Field"
authors: ["Author One", "Author Two"]
year: 2024
venue: "Top Conference"
topics: ["topic1", "topic2"]
key_contributions:
- "Established the methodology"
- "Introduced key framework"
# Research trajectory definition
research_trajectory:
starting_point: "earliest_work"
progression:
- stage: "initial_concepts"
description: "Early foundational work"
- stage: "current_state"
description: "State-of-the-art approaches"
- stage: "open_challenges"
description: "Current limitations"
Step 3: Configure LLM Parameters
The standalone pipeline uses stage-specific LLM settings in .env:
# Stage 1: Librarian (Baseline Reference Creation)
PROVIDER_LIBRARIAN=openai
LIBRARIAN_MODEL=gpt-4o
LIBRARIAN_TEMPERATURE=0.2
# Stage 2: Reader (Novelty Ranking & Extraction)
PROVIDER_READER=openai
READER_MODEL=gpt-4o-mini
READER_TEMPERATURE=0.3
# Stage 3: Fact-Checker (Claim Verification)
PROVIDER_FACT_CHECKER=openai
FACT_CHECKER_MODEL=gpt-4o
FACT_CHECKER_TEMPERATURE=0.1
# Stage 4: Critic (Grounded Synthesis)
PROVIDER_CRITIC=deepseek
CRITIC_MODEL=deepseek-reasoner
CRITIC_TEMPERATURE=0.2
- Librarian: Use a capable model like
gpt-4ofor accurate paper retrieval and summarization - Reader: Use a faster model like
gpt-4o-minifor extraction tasks - Fact-Checker: Use a reliable model like
gpt-4ofor verification - Critic: Use your best reasoning model like
deepseek-reasonerorclaude-sonnet-4-5for synthesis
Step 4: Run the Literature Review
Execute the standalone pipeline:
python run_review_literature.py --run-dir my_literature_review
Visual Progress Indicators
The pipeline provides clear visual feedback for each stage:
LITERATURE-GROUNDED REVIEW [1/5]: paper.pdf
============================================================
[Stage 1/4] Librarian: Creating baseline reference...
[Librarian] β Created baseline with 5 papers
[Stage 2/4] Reader: Extracting evidence with novelty ranking...
[Reader] β Completed 8 criterion extractions
[Stage 3/4] Fact-Checker: Running verification checks...
[Fact-Checker] β Completed 3 verification checks
[Fact-Checker] β οΈ 2 claims require further review
[Stage 4/4] Critic: Synthesizing grounded review...
[Critic] β Review synthesized (score: 65.2)
[Summary]
Overall Score: 65.2
Recommendation: REVISE AND RESUBMIT
Avg Novelty: 3.45/5
Novelty Adjustment: +2.5
Batch Processing with Standalone Pipeline
For processing multiple directories with the standalone pipeline:
python setup_batch_literature_review.py \
--master-papers-dir papers_to_review \
--master-literature-dir baseline_literature \
--base-run-dir literature_run \
--num-runs 3 \
--create-batch-script
This creates batch directories with:
- Papers distributed across
literature_run1/,literature_run2/, etc. - Baseline literature papers copied to each run's
literature/directory run_batch_literature_review.pyscript for batch execution
Run the batch:
# Sequential execution
python run_batch_literature_review.py
# Parallel execution
python run_batch_literature_review.py --parallel --max-workers 4
Output Differences
Standalone literature reviews always include:
- Research Trajectory Section: Detailed analysis of how the paper fits into the literature
- Novelty Rankings: Per-criterion novelty scores (1-5 scale)
- Fact-Check Results: Verification of suspicious claims
- Novelty-Adjusted Score: Base score adjusted for actual contribution
literature_sources.yaml. If this file is missing, the system will create a default template that you must customize before running.
Comparison: Which Should You Use?
Literature Enhancement (4b)
Use when:
- You want optional literature analysis
- You have existing criteria.yaml files
- You need flexibility to enable/disable
- Standard review is acceptable as fallback
Standalone Pipeline (4c)
Use when:
- Literature analysis is essential
- You're doing systematic reviews
- You need research trajectory analysis
- You want dedicated 4-stage pipeline
5a. The Most Important Part: Customizing Your Criteria
This is the system's most powerful feature. You do **not** need to touch any Python code to completely change the review. The *only* file you need to edit is my_review_run/input/criteria.yaml.
The One Big Rule: Weights Must = 100
The system uses a 100-point scale. The weight: value you give to each criterion tells the system how much it "matters." Before you run the script, you **must** ensure all weight values in your file add up to exactly 100.
Anatomy of a Criterion
Let's break down one criterion block from the file:
- id: empirical_rigor
name: Empirical Rigor
description: |
Assesses the quality of the empirical methods, data,
and execution. Look for research design, causal
identification, and statistical analysis.
weight: 20
scale:
type: numeric
range: [1, 5]
labels:
1: "Fundamentally flawed"
2: "Significant weaknesses"
3: "Adequate"
4: "Strong and robust"
5: "Exceptional / state-of-the-art"
id: The unique, one-word ID. This is used as the header in the final CSV file (e.g.,empirical_rigor_score).name: The "pretty" name used in the Markdown report.description: This is the most important field. This text is injected directly into the prompt for Agent 1. A clear, specific description will give you a high-quality review. A vague description will give you a vague review.weight: How much this score matters out of 100.scale: Defines the scoring. Thelabelsare used to tell the AI what the 1-5 scale means.
Example: How to Add a New Criterion
Let's add a new criterion for "Statistical Robustness" with a weight of 10.
Step 1: Add the new block
Copy-paste an existing block and edit it. Now your file looks like this:
criteria:
- id: theoretical_contribution
name: Theoretical Contribution
weight: 15
# ... (other fields) ...
- id: empirical_rigor
name: Empirical Rigor
weight: 20
# ... (other fields) ...
# ... (all your other criteria) ...
# --- OUR NEW CRITERION ---
- id: statistical_robustness
name: Statistical Robustness
description: |
Evaluates the quality of statistical tests, power
analysis, and sensitivity/robustness checks.
weight: 10
scale:
type: numeric
range: [1, 5]
labels:
1: "Statistically flawed"
2: "Weak / Inappropriate"
3: "Adequate"
4: "Robust"
5: "Exceptional"
Step 2: Adjust the weights to sum to 100
Our total weight is now 110 (the original 100 + our new 10). We must remove 10 points from other criteria.
Let's reduce theoretical_contribution from 15 to 10, and empirical_rigor from 20 to 15.
(15 + 20) = 35. (10 + 15) = 25. We've removed 10 points. Our total is 100 again.
Your file is now valid and ready to run. The system will automatically add a new "Statistical Robustness" section to all reviews and a new statistical_robustness_score column to your CSV.
How to Remove a Criterion
- Delete the entire block (from
- id: ...to the lastlabel: ...). - Adjust the remaining weights to add up to 100.
Recommendation Rules Based on Final Score
The system determines the recommendation using the following default thresholds, which can be changed in my_review_run/input/criteria.yaml:
- If the
final_scoreis 85 or higher β Accept - If the
final_scoreis between 70 and 84 β Accept with Revisions - If the
final_scoreis between 50 and 69 β Revise and Resubmit - If the
final_scoreis below 50 β Reject
Specialization Domain
The domain section in the criteria yaml file is what "specializes" the agents. You can reference it in the my_review_run/input/prompts/extractor_system.txt, like:
"You are an expert reviewer in the field of {domain}."
"Analyze the following paper based on criteria for {domain}."
By changing this one line in criteria.yaml, you can pivot your entire review system to a new field (e.g., "machine_learning" or "clinical_psychology") and the extractor agents will adjust their analysis accordingly.
5b. Controlling Agent Behavior with Prompts
The "brain" of your AI agents is controlled by a set of simple text files located in your run directory. By editing these "prompts," you can change the agents' personalities, their analytical focus, and the style of their writing.
The "System" vs. "User" Prompt: A Key Concept
You'll notice that both the Extractor and Synthesizer agents have two prompt files. This is the most important concept to understand:
1. The _system.txt Prompt (The "Role")
This file defines the agent's personality, role, and general instructions. It's the "job description" you give the AI.
- Example: "You are a senior academic editor..." or "You are an expert academic reviewer...".
- Analogy: You are telling an actor who they are (e.g., "You are a skeptical detective").
2. The _user.txt Prompt (The "Task")
This file provides the specific task and data for the agent to work on. It's the "assignment" you give the AI for this specific job.
- Example: "Here is the paper. Fill out the following JSON based on the criteria...".
- Analogy: You are telling the actor what to do (e.g., "Analyze this evidence and report your findings").
File-by-File Breakdown
| File Name | Agent | What It Controls |
|---|---|---|
extractor_system.txt |
Extractor |
|
extractor_user.txt |
Extractor |
|
synthesizer_system.txt |
Synthesizer |
|
synthesizer_user.txt |
Synthesizer |
|
Practical Examples: What Can I Change?
- To make the synthesizer's tone more critical:
Open
synthesizer_system.txtand add a line to its role: "Be especially critical of methodological flaws and overstatements of findings. Do not be overly complimentary." - To add a new instruction to the extractor:
Open
extractor_system.txtand add a new rule: "You must always provide at least one direct quote from the paper to support your 'strengths' and 'weaknesses' rationale." - To change the final review's summary:
Open
synthesizer_user.txtand modify the description forexecutive_summary. For example, you could change it from "3-4 sentence summary..." to "a 3-bullet-point summary of the most critical findings."
β οΈ Important Warning: Be Careful with JSON Structure
The Python scripts (agent_extractor.py and agent_synthesizer.py) expect the AI to output a valid JSON object that matches the exact schema defined in the _user.txt files.
You can safely edit the parts of the prompts that control style, tone, or perspective (like the _system.txt files or the prose descriptions in the _user.txt files).
However, if you change the actual JSON keys (e.g., renaming "executive_summary" to "summary") in the prompt *without* also updating the Pydantic models in core/data_models.py, the program will crash when it tries to parse the AI's response.
6. Understanding Your Results
After a run, you will find your results in the my_review_run/outputs/ folder.
1. The Detailed Reviews: my_review_run/outputs/reviews/
This folder contains the full, human-readable Markdown (.md) file for every paper. The filename tells you exactly how it was generated:
[PaperName]_[ExtractorModel]_[SynthesizerModel]_[Timestamp].md
2. The Master Spreadsheet: my_review_run/outputs/reports/
This folder contains your report_consolidated_...csv file. This is the "big picture" view of your entire run. It's perfect for opening in Excel or Google Sheets to sort and filter.
We've designed this report for usability:
paper_filename: The original filename you provided, so you can find it easily.title: The paper's title. (If the PDF has a bad title, this will default to the filename so it's never blank or confusing).extractor_model_used: Which "Specialist" AI was used.synthesizer_model_used: Which "Editor" AI was used.overall_score: The final weighted score from 0-100.recommendation: The AI's final call (e.g., "Accept", "Reject")...._score: A separate column for *each* of your criteria scores.
7. Advanced: Comparing Models with an AI Judge
Want to know if gpt-5 is a better reviewer than gemini-2.5-pro? This system is built for that. You can not only compare their results but also have a **third AI model act as an "AI Judge"** to review any disagreements.
my_review_run/ingestion_cache.json), your second, third, and fourth runs will be *much faster* than the first!
The Complete 3-Step Workflow
Comparing models and resolving disagreements is a manual 3-step process that gives you full control over each stage:
Step 1: Run Multiple Reviews
First, run the review system with different LLM configurations to generate multiple reports. For example:
Run 1: GPT-4o Configuration
python run_with_custom_params.py \
--run-dir my_review_run \
--provider-extraction openai \
--extractor-model gpt-4o-mini \
--provider-synthesis openai \
--synthesizer-model gpt-4o-mini
Run 2: DeepSeek Configuration
python run_with_custom_params.py \
--run-dir my_review_run \
--provider-extraction deepseek \
--extractor-model deepseek-reasoner \
--provider-synthesis deepseek \
--synthesizer-model deepseek-chat
After these runs, you'll have multiple report_consolidated_...csv files in your my_review_run/outputs/reports/ directory.
Step 2: Compare Reports to Find Conflicts
Now, run the comparison script to identify papers where the different models disagreed:
python compare_reports.py --run-dir my_review_run
This script will:
- Read all the consolidated report files
- Identify papers with conflicting recommendations
- Create a new
HUMAN_REVIEW_discrepancies_...csvfile listing all conflicts
Step 3: Run the AI Judge to Resolve Conflicts
Finally, run the judge script to have an AI review the conflicts and make a final decision:
python judge_conflicts.py --run-dir my_review_run
The judge will:
- Read each paper that had conflicting reviews from the most recent discrepancies identified in Step 2
- Read both conflicting reviews
- Make a final, authoritative decision
- Create a
JUDGE_VERDICTS_report_...csvfile with the final verdicts
Understanding Your Final Results
The JUDGE_VERDICTS_report_...csv file is your final review list. It contains all the papers the models disagreed on, with these key columns:
paper_filename: The paper in question.model_a_recommendation: e.g., "Accept"model_b_recommendation: e.g., "Reject"judge_recommendation: The Judge's final decision (e.g., "Accept").winning_review: Which review the Judge thought was more accurate (e.g., "A" or "B").judge_rationale: A written explanation from the Judge on *why* it made that decision.
This saves you from manually reading all the conflicts. You can now simply read the Judge's rationale and finalize the decision, turning hours of re-review into minutes.
π The Judge Role: Deep Dive
The Judge component is a sophisticated arbitration system that resolves disagreements between multiple AI models when they produce conflicting reviews of the same paper. Understanding how it works will help you trust its verdicts and use it effectively.
Core Responsibilities
The Judge serves three primary functions in the review system:
- Conflict Detection: Identifies papers where different AI models disagree on recommendations (e.g., one says "Accept", another says "Reject")
- Champion Selection: Intelligently selects the best representative review from each conflicting faction
- Final Adjudication: Makes an authoritative decision based on careful analysis of the original paper and conflicting reviews
The "Champion vs Champion" Strategy
Instead of adjudicating between all possible pairs of reviews, the Judge uses an efficient "champion selection" approach:
Step 1: Faction Grouping
When multiple models review a paper, the Judge groups reviews by their recommendation type:
- Accept Faction: All reviews recommending "Accept" or "Accept with Revisions"
- Reject Faction: All reviews recommending "Reject" or "Revise and Resubmit"
Step 2: Champion Selection
From each faction, the Judge selects the highest-scoring review as that faction's "champion":
- The champion is the review with the highest weighted score across all criteria
- This ensures only the best arguments from each side are presented to the Judge
- Dramatically reduces computational cost compared to pairwise comparison
Step 3: LLM-as-Judge Adjudication
An independent AI model (default: GPT-4o) acts as the Judge and receives:
- The Original Paper: Full text content (truncated to 400,000 characters if needed)
- Both Champion Reviews: Complete reviews with scores, justifications, and model identities
- Impartiality Instructions: System prompt emphasizing objectivity and expertise
The Judge evaluates both reviews and produces a structured verdict:
{
"judge_recommendation": "Accept w/ Revisions",
"judge_rationale": "Review A more accurately captures...",
"winning_review": "A",
"winning_review_rationale": "Review A demonstrates deeper analysis..."
}
Scoring System & Weighted Rules
The Judge does not assign scoresβthat's done by the Extractor and Editor agents. Instead, the Judge uses the existing weighted scoring system to evaluate champion quality:
Eight Weighted Criteria
Each review is evaluated on eight dimensions with specific weights:
| Criterion | Weight |
|---|---|
| Empirical Rigor | 18 |
| Identification Strategy | 15 |
| Statistical Robustness | 15 |
| Theoretical Contribution | 12 |
| Policy Relevance | 12 |
| Originality/Novelty | 10 |
| Data Quality | 10 |
| Presentation/Clarity | 8 |
Recommendation Thresholds
Final scores (0-100 scale) map to recommendations:
- 85+: Accept
- 70-84: Accept with Revisions
- 50-69: Revise and Resubmit
- <50: Reject
Configurable in criteria.yaml
Configuring the Judge
You can customize which AI model acts as Judge by setting environment variables in your run directory's .env file:
# Default configuration (can be omitted)
JUDGE_PROVIDER=openai
JUDGE_MODEL=gpt-4o
# Alternative: Use Gemini as Judge
JUDGE_PROVIDER=gemini
JUDGE_MODEL=gemini-2.5-pro
# Alternative: Use Claude as Judge
JUDGE_PROVIDER=anthropic
JUDGE_MODEL=claude-sonnet-4-5
Use a highly capable model for the Judge role. The Judge's decisions are final and should represent the most careful, impartial analysis possible. GPT-4o, Claude Sonnet 4.5, or Gemini 2.5 Pro are excellent choices.
Progress Tracking & Cost Management
The Judge system includes sophisticated progress tracking to avoid redundant adjudication:
- Configuration Hashing: Detects changes in Judge configuration and re-adjudicates if settings change
- Progress Files: Stores adjudication results in
judge_progress.json - Resume Capability: Skips already-adjudicated papers when configuration is unchanged
- Cost Tracking: Monitors API usage for budget management
HUMAN_REVIEW_discrepancies_...csv file before running the Judge to see how many conflicts exist. Consider adjudicating only the most important conflicts if budget is a concern.
Understanding Judge Verdicts
The final JUDGE_VERDICTS_...csv file contains these key columns:
paper_filenameβ The paper being adjudicatedmodel_a_recommendationβ First model's recommendationmodel_b_recommendationβ Second model's recommendationjudge_recommendationβ The Judge's final decisionwinning_reviewβ Which review the Judge preferred ("A", "B", or "Neither")judge_rationaleβ Detailed explanation of the decision
The Judge's rationale is your most valuable assetβit provides a neutral, expert assessment of which review was more accurate and why. This can help you:
- Understand why models disagreed
- Identify systematic biases in specific models
- Make final acceptance/rejection decisions with confidence
- Improve your review criteria over time
Key Strengths of the Judge System
- Efficiency: Champion selection reduces adjudication complexity from O(nΒ²) to O(n)
- Transparency: Detailed rationales explain every decision
- Robustness: Progress tracking prevents redundant work and enables resumption
- Configurability: Weights, thresholds, and Judge model are all customizable
- Cost-Effectiveness: Tracks API usage and avoids unnecessary adjudications
The Judge component transforms the multi-agent review system from a simple consensus tool into a sophisticated peer review mechanismβensuring that final decisions are based on the best available analysis, not just the first available agreement.
8. Processing Large Batches with Run Directories
For large-scale processing, the system distributes papers across multiple run directories, enabling controlled resumption and improved load management to minimize the risk of exceeding LLM rate limits. This configuration facilitates experimentation with different settings, including review criteria, and supports parallel execution when rate limits are not a concern.
Setting Up Multiple Run Directories
If you have a master directory with hundreds of papers, you can distribute them across multiple run directories:
python setup_batch_runs.py \
--master-papers-dir /path/to/master/papers \
--base-run-dir run_dir \
--num-runs 10 \
--papers-per-run 50 \
--create-batch-script
This will create 10 run directories (run_dir1, run_dir2, etc.) with 50 papers each, plus a batch script to run them all.
Running Multiple Directories
After setting up multiple directories, you can run them all at once:
# Sequential execution
python run_batch.py
# Parallel execution (up to 4 workers)
python run_batch.py --parallel
# Parallel execution with custom number of workers
python run_batch.py --parallel --max-workers 8
Resuming After Interruptions
The system automatically tracks progress and can resume from where it left off if interrupted. Each run directory maintains its own progress tracking, so you can:
- Run all directories in parallel
- Stop and resume individual runs as needed
- Track progress separately for each configuration
9. Troubleshooting & Common Questions
Q: I'm getting a `FAILED criterion` or `JSON error`!
A: This almost always means the AI's response was too long and got cut off. Your AI is being *too* detailed!
The Fix: Open core/llm_wrapper.py and find the max_completion_tokens or max_tokens settings. Increase them from 4096 to 8192 and try again.
Q: How do I change the "Judge" LLM?
A: The Judge is set by two environment variables in your my_review_run/input/.env file: JUDGE_PROVIDER and JUDGE_MODEL. If they aren't there, the script defaults to openai and gpt-4o. You can add these lines to change it to a powerful model, like Gemini Pro:
JUDGE_PROVIDER=gemini
JUDGE_MODEL=gemini-2.5-pro
Q: I'm getting a `BadRequestError: Unsupported parameter: 'max_tokens'`!
A: This is a known issue with some custom endpoints. The system is designed to "surgically bypass" this problem for the custom_openai/gpt-5 combination. If you see this error with a *different* model, it means a developer needs to add a new bypass rule to core/llm_wrapper.py.
Q: How much will this cost?
A: Be very careful with large batches. The cost is:
(Number of Papers) x (Number of Criteria) x (Cost of Extractor Model)
+
(Number of Papers) x (Cost of Synthesizer Model)
+
(Number of Conflicts) x (Cost of Judge Model)
A 500-paper batch with 8 criteria is **4,000** calls to the Extractor, plus 500 calls to the Synthesizer. Always test with a small batch of 5-10 papers first!
Q: How do I clear the cache and re-parse my papers?
A: Simply delete the my_review_run/ingestion_cache.json file. The system will fully re-parse all your papers from scratch on the next run.
Q: How do I clear the paper review cache?
A: Simply delete the my_review_run/progress.json file. The system will review all the papers from scratch on the next run.
Q: How do I clear the judge review cache?
A: Simply delete the my_review_run/judge_progress.json file. The system will adjudicate all the papers from scratch on the next run.
Q: Can I mix and match different LLM providers?
A: Yes! The system fully supports using different LLM providers for extraction and synthesis. For example:
python run_with_custom_params.py \
--run-dir my_review_run \
--provider-extraction deepseek \
--extractor-model deepseek-reasoner \
--provider-synthesis openai \
--synthesizer-model gpt-4o-mini
This will use DeepSeek's reasoning model for extraction and OpenAI's GPT-4o-mini for synthesis.
Q: What if I have fewer papers than run directories?
A: The system will distribute papers as evenly as possible. For example, if you have 10 papers and 3 run directories, it will create directories with 4, 3, and 3 papers respectively.
Q: Do I need to run compare_reports.py before judge_conflicts.py?
A: Yes! The 3-step workflow is designed to be manual:
- Run reviews with different configurations
- Run
compare_reports.pyto find conflicts - Run
judge_conflicts.pyto resolve conflicts