Active Projects Extractor

Two-Pass LLM-Powered Web Scraper for Extracting Active Research Projects from Institute Listings

Overview

The extract.py script extracts active/ongoing research projects from institute project listing pages using a two-pass approach with automatic pagination detection and LLM-based extraction.

Workflow

1
First Pass (Candidate Extraction with Pagination): Fetches each institute's listing page and uses LLM to extract candidate projects. Automatically detects pagination links and follows them (up to 5 additional pages) to ensure comprehensive coverage. Extracts titles, URLs, brief descriptions, and status indicators.
2
Pagination Detection & Intelligent Probing: Analyzes HTML for common pagination patterns including:
  • Link text: "next", "older", "more", arrows (» ‹ →), page numbers, "page X" patterns
  • CSS classes: pagination, pager, pager__item, page-item, page-numbers, next
  • ARIA labels: aria-label="pagination", aria-label="next", aria-label="go to page X"
  • Standard attributes: rel="next"
Smart Filtering: Only follows pagination links that preserve the base URL path (e.g., for /projects, keeps /projects?page=1 but NOT / or /other). Skips ?page=0 links since they duplicate the base page content. Intelligent Probing: After detecting pagination links (e.g., ?page=0 through ?page=4), automatically probes additional pages incrementally (?page=5, ?page=6, ...) until no new candidates are found or the max_pages limit (default: 8) is reached. This ensures complete coverage without wasting requests on non-existent pages.
3
Filtering: Excludes projects matching exclusion keywords (job postings, navigation elements) or marked as completed.
4
Date-Based Expiration Check: Analyzes project duration/status text for end dates. Projects with past end dates (e.g., "Expected End Date: 31 May 2023") are automatically excluded.
5
Second Pass (Full Details): For each remaining candidate, fetches the individual project page and extracts comprehensive details including description, objectives, duration, funding, partners, and contact info.
6
Status Verification: Confirms project is active/ongoing. Checks both status keywords and duration field for expired dates. If no status is mentioned, defaults to "active".
7
Output: Saves all verified active projects to a JSON file with complete metadata.

Pagination Support

Automatic Detection: The extractor automatically detects and follows pagination links on listing pages. Intelligent Probing: After detecting pagination links (e.g., ?page=0-4), automatically probes additional pages incrementally (?page=5, ?page=6, ...) until no new candidates are found or max_pages limit (default: 8) is reached. Smart Filtering: Only follows links that preserve the base URL path (e.g., /projects?page=1), skips ?page=0 duplicates, and excludes navigation links. Deduplication: Tracks project URLs across pages to avoid duplicates. Same-Domain Only: Only follows pagination links within the same domain as the original listing page.

Date-Based Expiration Checking

Automatic Detection: The extractor automatically detects and parses end dates from project descriptions and status indicators. Supported Formats: "Expected End Date: 31 May 2023", "End Date: 31 May 2023", "31 May 2023", "May 31, 2023", "2023-05-31" (ISO), "31/05/2023", "31-05-2023". Status Precedence: Explicit status keywords take priority over date checks: "Active" or "Ongoing" → project is kept (dates ignored); "Completed" → project is excluded (dates ignored). Dates are only checked when status is unknown/ambiguous. Two-Stage Check: First checks listing page data (status indicator + brief description), then verifies against full details (duration field).

Configuration (.env_extraction)

.env_extraction
# Input/Output Files
INPUT_FILE=input/institutes.json
OUTPUT_FILE=output/active_projects.json

# Extraction Limits
MAX_PROJECTS=50

# Maximum pagination pages per institute. 1 = initial page only; values >1 enable pagination and intelligent probing when max_pages exceeds the highest detected page.
MAX_PAGES=8

# LLM Configuration for Crawl4AI
EXTRACTION_PROVIDER=anthropic/claude-sonnet-4-20250514
EXTRACTIONLLM_API_KEY=your_api_key_here

# Skip already processed institutes (set to "true" to enable)
SKIP_EXISTING=false

# Rate limiting settings
LLM_MAX_RETRIES=3              # Max retry attempts for failed LLM calls
LLM_REQUESTS_PER_SECOND=1.0    # Rate limit (1.0 = 1 request per second)

Variable Description Default
INPUT_FILE Path to institutes JSON file input/institutes.json
OUTPUT_FILE Path to output JSON file active_projects.json
MAX_PROJECTS Maximum projects to extract per institute 50
MAX_PAGES Maximum pagination pages to follow per institute. Set to 1 for initial page only, or higher to enable pagination following and intelligent probing. Probing activates when max_pages exceeds the highest detected page number. 8
EXTRACTION_PROVIDER LLM provider for extraction anthropic/claude-sonnet-4-20250514
EXTRACTIONLLM_API_KEY API key for LLM provider Required
LLM_MAX_RETRIES Maximum number of retry attempts for failed LLM calls 3
LLM_REQUESTS_PER_SECOND Rate limit for LLM API calls (requests per second). Use 0 to disable rate limiting. 1.0

LLM Rate Limiting & Retry Logic

The extractor includes built-in resilience against LLM rate limiting and transient failures. All LLM calls are automatically wrapped with retry logic and rate limiting to ensure reliable extraction.

Features

  • Automatic Retries: Failed LLM calls are automatically retried with exponential backoff
  • Rate Limit Detection: Detects common rate limit error patterns (429, "rate limit", "quota exceeded", etc.)
  • Configurable Rate Limiting: Control the maximum requests per second to stay within API quotas
  • Exponential Backoff: Retry delay doubles after each attempt (1s → 2s → 4s → ... → max 60s)

Retry Behavior

Setting Default Description
LLM_MAX_RETRIES 3 Maximum number of retry attempts after initial failure. Total attempts = retries + 1.
LLM_REQUESTS_PER_SECOND 1.0 Target rate of LLM requests. The extractor waits between requests to maintain this rate.
Initial Delay 1.0s Starting delay before first retry
Backoff Factor 2.0 Delay multiplier after each retry (1s → 2s → 4s → 8s)
Max Delay 60s Maximum wait time between retries

Retry Sequence Example (with default settings)

Attempt 1: Immediate request → Fails with rate limit
Retry 1: Wait 1s → Request → Still rate limited
Retry 2: Wait 2s → Request → Still rate limited
Retry 3: Wait 4s → Request → Success! (or fails after max retries)

Total time if successful: ~7 seconds (1s + 2s + 4s waits)

Rate Limit Detection

The extractor automatically detects rate limit errors by checking for these patterns in error messages:

  • HTTP status code: 429
  • Error messages: "rate limit", "too many requests", "quota exceeded", "rate_limit_exceeded"
  • Service messages: "try again later", "service unavailable", "temporarily unavailable"

Configuration Recommendations

Use Case LLM_REQUESTS_PER_SECOND LLM_MAX_RETRIES
Free tier APIs
Limited quota, strict limits
0.5 - 1.0 3 - 5
Paid tier (standard)
Reasonable limits
2.0 - 5.0 2 - 3
Enterprise tier
High quotas
5.0 - 10.0 2
Local LLM / No limits
Self-hosted, unlimited
0 (disabled) 1 - 2

Example Configurations

.env_extraction (Conservative - Free Tier)
# Conservative settings for free tier APIs
LLM_MAX_RETRIES=5              # More retries for stricter limits
LLM_REQUESTS_PER_SECOND=0.5    # 1 request every 2 seconds
.env_extraction (Aggressive - Paid Tier)
# Aggressive settings for paid tier APIs
LLM_MAX_RETRIES=2              # Fewer retries (fail fast)
LLM_REQUESTS_PER_SECOND=5.0    # 5 requests per second
.env_extraction (Disabled - Local/Unlimited)
# No rate limiting for local/unlimited LLMs
LLM_MAX_RETRIES=2              # Keep some retries for transient failures
LLM_REQUESTS_PER_SECOND=0      # 0 disables rate limiting

Logging

When retries occur, the extractor logs detailed information:

WARNING - Rate limit detected (attempt 1/4). Retrying in 1.0s... Error: 429 Client Error: Rate Limit Exceeded
WARNING - LLM call failed (attempt 2/4). Retrying in 2.0s... Error: ConnectionError: Connection timed out
ERROR - LLM call failed after 4 attempts. Error: APIError: Server error (500)

Important Notes

  • Applies to all LLM calls: Both listing page extraction and project detail extraction use rate limiting
  • Per-crawler rate limiting: Rate limiting is tracked per extractor instance, not globally
  • No persistence: Retry state is not saved across runs - if extraction is interrupted, start fresh
  • Works with all providers: Rate limiting is provider-agnostic and works with any LLM configured via EXTRACTION_PROVIDER

Input Format: institutes.json

The input file contains a list of institutes to process. Two formats are supported:

Array Format

[
  {
    "text": "UNU-CPR",
    "url": "https://unu.edu/cpr/projects",
    "id": 1
  },
  {
    "text": "UNU-BIOLAC",
    "url": "https://unu.edu/biolac/projects",
    "id": 2
  }
]

Object with References

{
  "references": [
    {
      "text": "UNU-CPR",
      "url": "https://unu.edu/cpr/projects",
      "id": 1
    }
  ]
}
Field Type Description
text string Institute name
url string URL to institute's projects listing page
id integer Unique identifier for the institute

Output Format: active_projects.json

An array of active project objects with the following schema:

[
  {
    "institute_name": "string",
    "institute_id": "integer",
    "project_url": "string",
    "title": "string",
    "status": "string",
    "description": "string",
    "objectives": "string | null",
    "duration": "string | null",
    "funding": "string | null",
    "partners": ["string"] | null,
    "contact": "string | null",
    "additional_info": { ... } | null
  }
]

Field Descriptions

Field Type Description
institute_name required string Name of the hosting institute
institute_id required integer ID of the hosting institute
project_url required string URL to the project's detail page
title required string Full project title
status required string Either "active" or "ongoing"
description required string Complete project description
objectives optional string Project objectives or goals
duration optional string Project timeline, start/end dates
funding optional string Funding sources, amounts, grants
partners optional array List of partner organizations
contact optional string Contact information (email, phone)
additional_info optional object Any extra extracted data

Common additional_info Keys

Based on analysis across 87 projects, 91% include additional_info. Here are the information types extracted:

Category Keys Found Frequency
People/Team project_manager, project_team, team_members, PhD Fellow, Promotor, Co-promotor, project_team_leads, coordinators 61
Topics/Themes keywords, keyword, theme, themes, thematic_pillars, thematic_areas, related_research_area 52
Locations location, locations 17
Organizations institute, institutes, affiliation, organization 13
Outputs outputs, research_outputs, activities 12
Metadata project_type, date_published, project_website, website, expected_start_date, expected_end_date, project_status 8
Events upcoming_briefings, training_sessions, online_courses 5
SDGs related_sdgs, contributions_to_sdg 2
Other benefits, platform, training, coordination, communications, membership_info, registration_link, achievements various

Frequency indicates how many projects contain at least one key from that category.

Debug Output: exclusions_debug.jsonl

All excluded projects are logged to output/exclusions_debug.jsonl (JSONL format, one entry per line):

{
  "timestamp": "2025-12-25T10:30:00.123456",
  "institute_id": 1,
  "institute_name": "UNU-CPR",
  "institute_url": "https://unu.edu/cpr/projects",
  "project_title": "Example Project",
  "project_url": "https://example.com/project",
  "project_brief_description": "Brief description...",
  "project_status_indicator": "Completed",
  "exclusion_trigger": "Completed",
  "exclusion_reason": "completed_keyword"
}

Exclusion Reasons

Reason Code Description
exclusion_keyword Matched job posting/navigation keywords
completed_keyword Status indicator on listing page showed "completed"
expired_date End date in listing page data has passed (e.g., "Expected End Date: 31 May 2023")
details_completed Full page extraction confirmed "completed" status
duration_expired End date in duration field has passed

Exclusion Keywords

The following keywords trigger automatic exclusion (found in title or brief description):

  • Job postings: vacancy, hiring, consultant, officer, internship, call for application, apply now, submit cv, job opening, career opportunity
  • Page elements: cookie, privacy policy, terms of use, subscribe, newsletter, follow us, rights reserved, skip to content, menu, contact us, about us

Completed Status Keywords

Projects matching these in status indicators are excluded: completed, finished, closed, past projects, archived

Date-Based Expiration

IMPORTANT: Explicit status takes precedence over date checks. If status explicitly says "Active" or "Ongoing", the project is kept regardless of end date. If status says "Completed", the project is excluded. Dates are only checked when status is unknown/ambiguous.

Projects with ambiguous/unknown status are automatically excluded if an end date is detected and the date has passed. The extractor scans:

  • Listing page data: status_indicator and brief_description fields
  • Full details: duration field from the project page

Date patterns detected:

  • "Expected End Date: 31 May 2023"
  • "End Date: 31 May 2023"
  • "31 May 2023", "May 31, 2023"
  • "2023-05-31" (ISO format)
  • "31/05/2023", "31-05-2023"

CLI Arguments & Skip-Existing Processing

The extractor supports incremental processing to resume interrupted runs or add new institutes without reprocessing existing ones:

CLI Arguments Reference

Argument Short Description Default
--skip-existing Skip institutes that already have projects in output file false
--force Force reprocess all institutes (overrides --skip-existing and --reprocess) false
--reprocess Comma-separated institute IDs, short names, or full names to reprocess (e.g., '1,2,3' or 'UNU-WIDER,MERIT') null
--output -o Path to output JSON file active_projects.json
--input -i Path to institutes.json input/institutes.json
--max-projects Max projects per institute (overrides env) 20
--max-pages Max pagination pages to follow per institute. Set to 1 for initial page only, or higher to enable pagination following and intelligent probing. Probing activates when max_pages exceeds the highest detected page number. 8

Pagination Behavior

Intelligent Probing: The intelligent probing feature activates when max_pages > max_detected_page + 1. It works by:
1. Visiting the initial listing page
2. Detecting pagination links (e.g., ?page=0 through ?page=4)
3. Visiting detected links (skipping ?page=0 as it duplicates the base page)
4. Only if max_pages exceeds the highest detected page number: probing additional pages incrementally beyond detected ones

Probing Condition: if max_pages > max_detected_page + 1, then probe

Early Stopping: Probing stops automatically when no new candidates are found (page doesn't exist or is duplicate), avoiding wasted requests on non-existent pages.

Detailed Examples (assuming detected ?page=0 through ?page=4)

max_pages Detected Links Probing? Pages Visited Total
1 - No Initial page only 1
2 ?page=1-4 (5 links) No (2 ≤ 5) Initial + 1st detected link 2
3 ?page=1-4 (5 links) No (3 ≤ 5) Initial + 2 detected links 3
5 ?page=1-4 (4 links after ?page=0 skip) No (5 = 5) Initial + 4 detected links 5
6 ?page=1-4 (4 links) Yes (6 > 5) Initial + 4 detected + 1 probed (?page=5) 6
7 ?page=1-4 (4 links) Yes (7 > 5) Initial + 4 detected + 2 probed (?page=5, ?page=6) 7
8 (default) ?page=1-4 (4 links) Yes (8 > 5) Initial + 4 detected + 3 probed (?page=5, ?page=6, ?page=7) 8
10 ?page=1-4 (4 links) Yes (10 > 5) Initial + 4 detected + 5 probed (?page=5 through ?page=9) 10
15 ?page=1-4 (4 links) Yes (15 > 5) Initial + 4 detected + 10 probed (?page=5 through ?page=14) 15

Note: "Detected links" count excludes ?page=0 since it duplicates the base page content. The max_detected_page value is the highest page number found (e.g., 4 for ?page=0 through ?page=4).

Precedence Order

When multiple sources specify processing behavior, the following precedence applies (highest to lowest):

  1. --force CLI flag - Always process all institutes (overrides everything)
  2. --reprocess CLI flag - Reprocess only specified institutes (overrides --skip-existing)
  3. --skip-existing CLI flag - Skip institutes with existing projects (overrides env)
  4. Environment variable - Used only if no CLI flags provided
# Run with defaults (process all)
python extract.py

# Skip already processed institutes
python extract.py --skip-existing

# Force reprocess all (overrides --skip-existing and --reprocess)
python extract.py --force

# Reprocess specific institutes by ID
python extract.py --reprocess 1,2,3

# Reprocess specific institutes by short name (e.g., UNU-WIDER)
python extract.py --reprocess UNU-WIDER,MERIT

# Reprocess by partial name match (case-insensitive)
python extract.py --reprocess "Biotechnology","Comparative Regional"

# Custom output file
python extract.py -o custom/projects.json

# Override max projects per institute
python extract.py --max-projects 30

# Combining flags (--force takes precedence)
python extract.py --skip-existing --force # will force reprocess all

Use Cases for Processing Modes

Skip-Existing - Resuming Interrupted Runs: If extraction crashes mid-run, use --skip-existing to continue from where it left off. The extractor loads existing projects, tracks processed institute IDs, and only processes new institutes.
Skip-Existing - Adding New Institutes: When adding new institutes to input/institutes.json, only new institutes will be processed. Existing projects are preserved and new ones are appended.
Reprocess - Fixing Specific Institutes: If an institute's data needs updating (e.g., website structure changed), use --reprocess with specific institute IDs or names. Only those institutes are re-crawled, and their old data is replaced.
Cost & Time Savings: Avoid re-crawling already processed institutes, saving both API costs and time.

How Skip-Existing Works

When --skip-existing is enabled:

  1. Checks if output file exists (e.g., active_projects.json)
  2. Loads existing projects and extracts unique institute_id values
  3. Filters input institutes to exclude those with existing projects
  4. Processes only remaining (new) institutes
  5. Accumulates existing + new projects and saves to output file

How Reprocess Works

When --reprocess is specified with institute IDs, short names, or full names:

  1. Loads all existing projects from the output file
  2. Matches institutes against search terms (supports: ID, exact short name like "UNU-WIDER", partial full name match)
  3. Removes projects belonging to matched institutes from the existing set
  4. Reprocesses only the specified institutes
  5. Merges the newly extracted data with the remaining existing projects
  6. Overwrites the output file with the combined data

Matching is case-insensitive. Short names (in parentheses) are matched exactly. Full names support partial substring matching.

Note: Since extract.py outputs a single combined file (not per-institute files), it tracks processed institutes by checking which institute_id values already exist in the output file.

Usage

# Basic usage (process all institutes)
python extract.py

# Skip already processed institutes
python extract.py --skip-existing

# Force reprocess all
python extract.py --force

# Reprocess specific institutes by ID
python extract.py --reprocess 1,2,3

# Reprocess by short name (recommended)
python extract.py --reprocess UNU-WIDER,MERIT

# Reprocess by partial full name
python extract.py --reprocess "Biotechnology","Comparative Regional"

# Custom files
python extract.py -i custom/institutes.json -o custom/projects.json

# Override max projects
python extract.py --max-projects 30

# Control pagination depth (probing only when max_pages > max_detected_page)
python extract.py --max-pages 1 # Single page only (no pagination)
python extract.py --max-pages 5 # Initial + detected links only (no probing if 5 pages detected)
python extract.py --max-pages 7 # Initial + detected + probing beyond detected (if 4 detected, probes 2 more)
python extract.py --max-pages 15 # Deep crawl with extensive probing beyond detected

# With custom environment file
dotenv run -f .env_extraction -- python extract.py

Constructor Parameters

Parameter Type Default Description
max_projects int from env or 50 Max projects per institute
max_pages int 8 Max pagination pages to follow (1 = initial page only, with intelligent probing)
extraction_provider str from env LLM provider for extraction
extraction_api_key str from env API key for LLM
debug_file str "output/exclusions_debug.jsonl" Path to exclusion debug log
skip_existing bool false Skip institutes with existing projects in output file

Requirements

  • Python 3.10+
  • crawl4ai
  • pydantic
  • python-dotenv
  • Playwright browser (installed via crawl4ai)
pip install crawl4ai pydantic python-dotenv