The extract.py script extracts active/ongoing research projects from institute project listing pages using a two-pass approach with automatic pagination detection and LLM-based extraction.
pagination, pager, pager__item, page-item, page-numbers, nextaria-label="pagination", aria-label="next", aria-label="go to page X"rel="next"/projects, keeps /projects?page=1 but NOT / or /other). Skips ?page=0 links since they duplicate the base page content.
Intelligent Probing: After detecting pagination links (e.g., ?page=0 through ?page=4), automatically probes additional pages incrementally (?page=5, ?page=6, ...) until no new candidates are found or the max_pages limit (default: 8) is reached. This ensures complete coverage without wasting requests on non-existent pages.
Automatic Detection: The extractor automatically detects and follows pagination links on listing pages. Intelligent Probing: After detecting pagination links (e.g., ?page=0-4), automatically probes additional pages incrementally (?page=5, ?page=6, ...) until no new candidates are found or max_pages limit (default: 8) is reached. Smart Filtering: Only follows links that preserve the base URL path (e.g., /projects?page=1), skips ?page=0 duplicates, and excludes navigation links. Deduplication: Tracks project URLs across pages to avoid duplicates. Same-Domain Only: Only follows pagination links within the same domain as the original listing page.
Automatic Detection: The extractor automatically detects and parses end dates from project descriptions and status indicators. Supported Formats: "Expected End Date: 31 May 2023", "End Date: 31 May 2023", "31 May 2023", "May 31, 2023", "2023-05-31" (ISO), "31/05/2023", "31-05-2023". Status Precedence: Explicit status keywords take priority over date checks: "Active" or "Ongoing" → project is kept (dates ignored); "Completed" → project is excluded (dates ignored). Dates are only checked when status is unknown/ambiguous. Two-Stage Check: First checks listing page data (status indicator + brief description), then verifies against full details (duration field).
# Input/Output Files
INPUT_FILE=input/institutes.json
OUTPUT_FILE=output/active_projects.json
# Extraction Limits
MAX_PROJECTS=50
# Maximum pagination pages per institute. 1 = initial page only; values >1 enable pagination and intelligent probing when max_pages exceeds the highest detected page.
MAX_PAGES=8
# LLM Configuration for Crawl4AI
EXTRACTION_PROVIDER=anthropic/claude-sonnet-4-20250514
EXTRACTIONLLM_API_KEY=your_api_key_here
# Skip already processed institutes (set to "true" to enable)
SKIP_EXISTING=false
# Rate limiting settings
LLM_MAX_RETRIES=3 # Max retry attempts for failed LLM calls
LLM_REQUESTS_PER_SECOND=1.0 # Rate limit (1.0 = 1 request per second)
| Variable | Description | Default |
|---|---|---|
INPUT_FILE |
Path to institutes JSON file | input/institutes.json |
OUTPUT_FILE |
Path to output JSON file | active_projects.json |
MAX_PROJECTS |
Maximum projects to extract per institute | 50 |
MAX_PAGES |
Maximum pagination pages to follow per institute. Set to 1 for initial page only, or higher to enable pagination following and intelligent probing. Probing activates when max_pages exceeds the highest detected page number. | 8 |
EXTRACTION_PROVIDER |
LLM provider for extraction | anthropic/claude-sonnet-4-20250514 |
EXTRACTIONLLM_API_KEY |
API key for LLM provider | Required |
LLM_MAX_RETRIES |
Maximum number of retry attempts for failed LLM calls | 3 |
LLM_REQUESTS_PER_SECOND |
Rate limit for LLM API calls (requests per second). Use 0 to disable rate limiting. | 1.0 |
The extractor includes built-in resilience against LLM rate limiting and transient failures. All LLM calls are automatically wrapped with retry logic and rate limiting to ensure reliable extraction.
| Setting | Default | Description |
|---|---|---|
LLM_MAX_RETRIES |
3 |
Maximum number of retry attempts after initial failure. Total attempts = retries + 1. |
LLM_REQUESTS_PER_SECOND |
1.0 |
Target rate of LLM requests. The extractor waits between requests to maintain this rate. |
Initial Delay |
1.0s |
Starting delay before first retry |
Backoff Factor |
2.0 |
Delay multiplier after each retry (1s → 2s → 4s → 8s) |
Max Delay |
60s |
Maximum wait time between retries |
Attempt 1: Immediate request → Fails with rate limit
Retry 1: Wait 1s → Request → Still rate limited
Retry 2: Wait 2s → Request → Still rate limited
Retry 3: Wait 4s → Request → Success! (or fails after max retries)
Total time if successful: ~7 seconds (1s + 2s + 4s waits)
The extractor automatically detects rate limit errors by checking for these patterns in error messages:
429| Use Case | LLM_REQUESTS_PER_SECOND | LLM_MAX_RETRIES |
|---|---|---|
| Free tier APIs Limited quota, strict limits |
0.5 - 1.0 |
3 - 5 |
| Paid tier (standard) Reasonable limits |
2.0 - 5.0 |
2 - 3 |
| Enterprise tier High quotas |
5.0 - 10.0 |
2 |
| Local LLM / No limits Self-hosted, unlimited |
0 (disabled) |
1 - 2 |
# Conservative settings for free tier APIs
LLM_MAX_RETRIES=5 # More retries for stricter limits
LLM_REQUESTS_PER_SECOND=0.5 # 1 request every 2 seconds
# Aggressive settings for paid tier APIs
LLM_MAX_RETRIES=2 # Fewer retries (fail fast)
LLM_REQUESTS_PER_SECOND=5.0 # 5 requests per second
# No rate limiting for local/unlimited LLMs
LLM_MAX_RETRIES=2 # Keep some retries for transient failures
LLM_REQUESTS_PER_SECOND=0 # 0 disables rate limiting
When retries occur, the extractor logs detailed information:
WARNING - Rate limit detected (attempt 1/4). Retrying in 1.0s... Error: 429 Client Error: Rate Limit Exceeded
WARNING - LLM call failed (attempt 2/4). Retrying in 2.0s... Error: ConnectionError: Connection timed out
ERROR - LLM call failed after 4 attempts. Error: APIError: Server error (500)
EXTRACTION_PROVIDERThe input file contains a list of institutes to process. Two formats are supported:
[
{
"text": "UNU-CPR",
"url": "https://unu.edu/cpr/projects",
"id": 1
},
{
"text": "UNU-BIOLAC",
"url": "https://unu.edu/biolac/projects",
"id": 2
}
]
{
"references": [
{
"text": "UNU-CPR",
"url": "https://unu.edu/cpr/projects",
"id": 1
}
]
}
| Field | Type | Description |
|---|---|---|
text |
string | Institute name |
url |
string | URL to institute's projects listing page |
id |
integer | Unique identifier for the institute |
An array of active project objects with the following schema:
[
{
"institute_name": "string",
"institute_id": "integer",
"project_url": "string",
"title": "string",
"status": "string",
"description": "string",
"objectives": "string | null",
"duration": "string | null",
"funding": "string | null",
"partners": ["string"] | null,
"contact": "string | null",
"additional_info": { ... } | null
}
]
| Field | Type | Description |
|---|---|---|
institute_name |
required string | Name of the hosting institute |
institute_id |
required integer | ID of the hosting institute |
project_url |
required string | URL to the project's detail page |
title |
required string | Full project title |
status |
required string | Either "active" or "ongoing" |
description |
required string | Complete project description |
objectives |
optional string | Project objectives or goals |
duration |
optional string | Project timeline, start/end dates |
funding |
optional string | Funding sources, amounts, grants |
partners |
optional array | List of partner organizations |
contact |
optional string | Contact information (email, phone) |
additional_info |
optional object | Any extra extracted data |
additional_info KeysBased on analysis across 87 projects, 91% include additional_info. Here are the information types extracted:
| Category | Keys Found | Frequency |
|---|---|---|
| People/Team | project_manager, project_team, team_members, PhD Fellow, Promotor, Co-promotor, project_team_leads, coordinators |
61 |
| Topics/Themes | keywords, keyword, theme, themes, thematic_pillars, thematic_areas, related_research_area |
52 |
| Locations | location, locations |
17 |
| Organizations | institute, institutes, affiliation, organization |
13 |
| Outputs | outputs, research_outputs, activities |
12 |
| Metadata | project_type, date_published, project_website, website, expected_start_date, expected_end_date, project_status |
8 |
| Events | upcoming_briefings, training_sessions, online_courses |
5 |
| SDGs | related_sdgs, contributions_to_sdg |
2 |
| Other | benefits, platform, training, coordination, communications, membership_info, registration_link, achievements |
various |
Frequency indicates how many projects contain at least one key from that category.
All excluded projects are logged to output/exclusions_debug.jsonl (JSONL format, one entry per line):
{
"timestamp": "2025-12-25T10:30:00.123456",
"institute_id": 1,
"institute_name": "UNU-CPR",
"institute_url": "https://unu.edu/cpr/projects",
"project_title": "Example Project",
"project_url": "https://example.com/project",
"project_brief_description": "Brief description...",
"project_status_indicator": "Completed",
"exclusion_trigger": "Completed",
"exclusion_reason": "completed_keyword"
}
| Reason Code | Description |
|---|---|
exclusion_keyword |
Matched job posting/navigation keywords |
completed_keyword |
Status indicator on listing page showed "completed" |
expired_date |
End date in listing page data has passed (e.g., "Expected End Date: 31 May 2023") |
details_completed |
Full page extraction confirmed "completed" status |
duration_expired |
End date in duration field has passed |
The following keywords trigger automatic exclusion (found in title or brief description):
Projects matching these in status indicators are excluded: completed, finished, closed, past projects, archived
IMPORTANT: Explicit status takes precedence over date checks. If status explicitly says "Active" or "Ongoing", the project is kept regardless of end date. If status says "Completed", the project is excluded. Dates are only checked when status is unknown/ambiguous.
Projects with ambiguous/unknown status are automatically excluded if an end date is detected and the date has passed. The extractor scans:
Date patterns detected:
The extractor supports incremental processing to resume interrupted runs or add new institutes without reprocessing existing ones:
| Argument | Short | Description | Default |
|---|---|---|---|
--skip-existing |
Skip institutes that already have projects in output file | false |
|
--force |
Force reprocess all institutes (overrides --skip-existing and --reprocess) | false |
|
--reprocess |
Comma-separated institute IDs, short names, or full names to reprocess (e.g., '1,2,3' or 'UNU-WIDER,MERIT') | null |
|
--output |
-o |
Path to output JSON file | active_projects.json |
--input |
-i |
Path to institutes.json | input/institutes.json |
--max-projects |
Max projects per institute (overrides env) | 20 |
|
--max-pages |
Max pagination pages to follow per institute. Set to 1 for initial page only, or higher to enable pagination following and intelligent probing. Probing activates when max_pages exceeds the highest detected page number. | 8 |
Intelligent Probing: The intelligent probing feature activates when max_pages > max_detected_page + 1. It works by:
1. Visiting the initial listing page
2. Detecting pagination links (e.g., ?page=0 through ?page=4)
3. Visiting detected links (skipping ?page=0 as it duplicates the base page)
4. Only if max_pages exceeds the highest detected page number: probing additional pages incrementally beyond detected ones
Probing Condition: if max_pages > max_detected_page + 1, then probe
Early Stopping: Probing stops automatically when no new candidates are found (page doesn't exist or is duplicate), avoiding wasted requests on non-existent pages.
max_pages |
Detected Links | Probing? | Pages Visited | Total |
|---|---|---|---|---|
1 |
- | No | Initial page only | 1 |
2 |
?page=1-4 (5 links) | No (2 ≤ 5) | Initial + 1st detected link | 2 |
3 |
?page=1-4 (5 links) | No (3 ≤ 5) | Initial + 2 detected links | 3 |
5 |
?page=1-4 (4 links after ?page=0 skip) | No (5 = 5) | Initial + 4 detected links | 5 |
6 |
?page=1-4 (4 links) | Yes (6 > 5) | Initial + 4 detected + 1 probed (?page=5) | 6 |
7 |
?page=1-4 (4 links) | Yes (7 > 5) | Initial + 4 detected + 2 probed (?page=5, ?page=6) | 7 |
8 (default) |
?page=1-4 (4 links) | Yes (8 > 5) | Initial + 4 detected + 3 probed (?page=5, ?page=6, ?page=7) | 8 |
10 |
?page=1-4 (4 links) | Yes (10 > 5) | Initial + 4 detected + 5 probed (?page=5 through ?page=9) | 10 |
15 |
?page=1-4 (4 links) | Yes (15 > 5) | Initial + 4 detected + 10 probed (?page=5 through ?page=14) | 15 |
Note: "Detected links" count excludes ?page=0 since it duplicates the base page content. The max_detected_page value is the highest page number found (e.g., 4 for ?page=0 through ?page=4).
When multiple sources specify processing behavior, the following precedence applies (highest to lowest):
--force CLI flag - Always process all institutes (overrides everything)--reprocess CLI flag - Reprocess only specified institutes (overrides --skip-existing)--skip-existing CLI flag - Skip institutes with existing projects (overrides env)# Run with defaults (process all)
python extract.py
# Skip already processed institutes
python extract.py --skip-existing
# Force reprocess all (overrides --skip-existing and --reprocess)
python extract.py --force
# Reprocess specific institutes by ID
python extract.py --reprocess 1,2,3
# Reprocess specific institutes by short name (e.g., UNU-WIDER)
python extract.py --reprocess UNU-WIDER,MERIT
# Reprocess by partial name match (case-insensitive)
python extract.py --reprocess "Biotechnology","Comparative Regional"
# Custom output file
python extract.py -o custom/projects.json
# Override max projects per institute
python extract.py --max-projects 30
# Combining flags (--force takes precedence)
python extract.py --skip-existing --force # will force reprocess all
Skip-Existing - Resuming Interrupted Runs: If extraction crashes mid-run, use --skip-existing to continue from where it left off. The extractor loads existing projects, tracks processed institute IDs, and only processes new institutes.
Skip-Existing - Adding New Institutes: When adding new institutes to input/institutes.json, only new institutes will be processed. Existing projects are preserved and new ones are appended.
Reprocess - Fixing Specific Institutes: If an institute's data needs updating (e.g., website structure changed), use --reprocess with specific institute IDs or names. Only those institutes are re-crawled, and their old data is replaced.
Cost & Time Savings: Avoid re-crawling already processed institutes, saving both API costs and time.
When --skip-existing is enabled:
active_projects.json)institute_id valuesWhen --reprocess is specified with institute IDs, short names, or full names:
Matching is case-insensitive. Short names (in parentheses) are matched exactly. Full names support partial substring matching.
Note: Since extract.py outputs a single combined file (not per-institute files), it tracks processed institutes by checking which institute_id values already exist in the output file.
# Basic usage (process all institutes)
python extract.py
# Skip already processed institutes
python extract.py --skip-existing
# Force reprocess all
python extract.py --force
# Reprocess specific institutes by ID
python extract.py --reprocess 1,2,3
# Reprocess by short name (recommended)
python extract.py --reprocess UNU-WIDER,MERIT
# Reprocess by partial full name
python extract.py --reprocess "Biotechnology","Comparative Regional"
# Custom files
python extract.py -i custom/institutes.json -o custom/projects.json
# Override max projects
python extract.py --max-projects 30
# Control pagination depth (probing only when max_pages > max_detected_page)
python extract.py --max-pages 1 # Single page only (no pagination)
python extract.py --max-pages 5 # Initial + detected links only (no probing if 5 pages detected)
python extract.py --max-pages 7 # Initial + detected + probing beyond detected (if 4 detected, probes 2 more)
python extract.py --max-pages 15 # Deep crawl with extensive probing beyond detected
# With custom environment file
dotenv run -f .env_extraction -- python extract.py
| Parameter | Type | Default | Description |
|---|---|---|---|
max_projects |
int | from env or 50 | Max projects per institute |
max_pages |
int | 8 | Max pagination pages to follow (1 = initial page only, with intelligent probing) |
extraction_provider |
str | from env | LLM provider for extraction |
extraction_api_key |
str | from env | API key for LLM |
debug_file |
str | "output/exclusions_debug.jsonl" |
Path to exclusion debug log |
skip_existing |
bool | false |
Skip institutes with existing projects in output file |
pip install crawl4ai pydantic python-dotenv