Active Projects Extractor

Two-Pass LLM-Powered Web Scraper for Extracting Active Research Projects from Institute Listings

Overview

The extract.py script extracts active/ongoing research projects from institute project listing pages using a two-pass approach with automatic pagination detection and LLM-based extraction.

Workflow

First Pass (Candidate Extraction with Pagination): Fetches each institute's listing page and uses LLM to extract candidate projects. Automatically detects pagination links and follows them (up to 5 additional pages) to ensure comprehensive coverage. Extracts titles, URLs, brief descriptions, and status indicators.

Pagination Detection & Intelligent Probing: Analyzes HTML for common pagination patterns including:

Link text: "next", "older", "more", arrows (» ‹ →), page numbers, "page X" patterns
CSS classes: pagination, pager, pager__item, page-item, page-numbers, next
ARIA labels: aria-label="pagination", aria-label="next", aria-label="go to page X"
Standard attributes: rel="next"

Smart Filtering: Only follows pagination links that preserve the base URL path (e.g., for /projects, keeps /projects?page=1 but NOT / or /other). Skips ?page=0 links since they duplicate the base page content. Intelligent Probing: After detecting pagination links (e.g., ?page=0 through ?page=4), automatically probes additional pages incrementally (?page=5, ?page=6, ...) until no new candidates are found or the max_pages limit (default: 8) is reached. This ensures complete coverage without wasting requests on non-existent pages.

Filtering: Excludes projects matching exclusion keywords (job postings, navigation elements) or marked as completed.

Date-Based Expiration Check: Analyzes project duration/status text for end dates. Projects with past end dates (e.g., "Expected End Date: 31 May 2023") are automatically excluded.

Second Pass (Full Details): For each remaining candidate, fetches the individual project page and extracts comprehensive details including description, objectives, duration, funding, partners, and contact info.

Status Verification: Confirms project is active/ongoing. Checks both status keywords and duration field for expired dates. If no status is mentioned, defaults to "active".

Output: Saves all verified active projects to a JSON file with complete metadata.

Pagination Support

Automatic Detection: The extractor automatically detects and follows pagination links on listing pages. Intelligent Probing: After detecting pagination links (e.g., ?page=0-4), automatically probes additional pages incrementally (?page=5, ?page=6, ...) until no new candidates are found or max_pages limit (default: 8) is reached. Smart Filtering: Only follows links that preserve the base URL path (e.g., /projects?page=1), skips ?page=0 duplicates, and excludes navigation links. Deduplication: Tracks project URLs across pages to avoid duplicates. Same-Domain Only: Only follows pagination links within the same domain as the original listing page.

Date-Based Expiration Checking

Automatic Detection: The extractor automatically detects and parses end dates from project descriptions and status indicators. Supported Formats: "Expected End Date: 31 May 2023", "End Date: 31 May 2023", "31 May 2023", "May 31, 2023", "2023-05-31" (ISO), "31/05/2023", "31-05-2023". Status Precedence: Explicit status keywords take priority over date checks: "Active" or "Ongoing" → project is kept (dates ignored); "Completed" → project is excluded (dates ignored). Dates are only checked when status is unknown/ambiguous. Two-Stage Check: First checks listing page data (status indicator + brief description), then verifies against full details (duration field).

Configuration (.env_extraction)

.env_extraction

# Input/Output Files
INPUT_FILE=input/institutes.json
OUTPUT_FILE=output/active_projects.json

# Extraction Limits
MAX_PROJECTS=50

# Maximum pagination pages per institute. 1 = initial page only; values >1 enable pagination and intelligent probing when max_pages exceeds the highest detected page.
MAX_PAGES=8

# LLM Configuration for Crawl4AI
EXTRACTION_PROVIDER=anthropic/claude-sonnet-4-20250514
EXTRACTIONLLM_API_KEY=your_api_key_here

# Skip already processed institutes (set to "true" to enable)
SKIP_EXISTING=false

# Rate limiting settings
LLM_MAX_RETRIES=3              # Max retry attempts for failed LLM calls
LLM_REQUESTS_PER_SECOND=1.0    # Rate limit (1.0 = 1 request per second)

Variable	Description	Default
`INPUT_FILE`	Path to institutes JSON file	`input/institutes.json`
`OUTPUT_FILE`	Path to output JSON file	`active_projects.json`
`MAX_PROJECTS`	Maximum projects to extract per institute	`50`
`MAX_PAGES`	Maximum pagination pages to follow per institute. Set to 1 for initial page only, or higher to enable pagination following and intelligent probing. Probing activates when max_pages exceeds the highest detected page number.	`8`
`EXTRACTION_PROVIDER`	LLM provider for extraction	`anthropic/claude-sonnet-4-20250514`
`EXTRACTIONLLM_API_KEY`	API key for LLM provider	Required
`LLM_MAX_RETRIES`	Maximum number of retry attempts for failed LLM calls	`3`
`LLM_REQUESTS_PER_SECOND`	Rate limit for LLM API calls (requests per second). Use 0 to disable rate limiting.	`1.0`

LLM Rate Limiting & Retry Logic

The extractor includes built-in resilience against LLM rate limiting and transient failures. All LLM calls are automatically wrapped with retry logic and rate limiting to ensure reliable extraction.

Features

Automatic Retries: Failed LLM calls are automatically retried with exponential backoff
Rate Limit Detection: Detects common rate limit error patterns (429, "rate limit", "quota exceeded", etc.)
Configurable Rate Limiting: Control the maximum requests per second to stay within API quotas
Exponential Backoff: Retry delay doubles after each attempt (1s → 2s → 4s → ... → max 60s)

Retry Behavior

Setting	Default	Description
`LLM_MAX_RETRIES`	`3`	Maximum number of retry attempts after initial failure. Total attempts = retries + 1.
`LLM_REQUESTS_PER_SECOND`	`1.0`	Target rate of LLM requests. The extractor waits between requests to maintain this rate.
`Initial Delay`	`1.0s`	Starting delay before first retry
`Backoff Factor`	`2.0`	Delay multiplier after each retry (1s → 2s → 4s → 8s)
`Max Delay`	`60s`	Maximum wait time between retries

Retry Sequence Example (with default settings)

Attempt 1: Immediate request → Fails with rate limit
Retry 1: Wait 1s → Request → Still rate limited
Retry 2: Wait 2s → Request → Still rate limited
Retry 3: Wait 4s → Request → Success! (or fails after max retries)

Total time if successful: ~7 seconds (1s + 2s + 4s waits)

Rate Limit Detection

The extractor automatically detects rate limit errors by checking for these patterns in error messages:

HTTP status code: 429
Error messages: "rate limit", "too many requests", "quota exceeded", "rate_limit_exceeded"
Service messages: "try again later", "service unavailable", "temporarily unavailable"

Configuration Recommendations

Use Case	LLM_REQUESTS_PER_SECOND	LLM_MAX_RETRIES
Free tier APIs Limited quota, strict limits	`0.5` - `1.0`	`3` - `5`
Paid tier (standard) Reasonable limits	`2.0` - `5.0`	`2` - `3`
Enterprise tier High quotas	`5.0` - `10.0`	`2`
Local LLM / No limits Self-hosted, unlimited	`0` (disabled)	`1` - `2`

Example Configurations

.env_extraction (Conservative - Free Tier)

# Conservative settings for free tier APIs
LLM_MAX_RETRIES=5              # More retries for stricter limits
LLM_REQUESTS_PER_SECOND=0.5    # 1 request every 2 seconds

.env_extraction (Aggressive - Paid Tier)

# Aggressive settings for paid tier APIs
LLM_MAX_RETRIES=2              # Fewer retries (fail fast)
LLM_REQUESTS_PER_SECOND=5.0    # 5 requests per second

.env_extraction (Disabled - Local/Unlimited)

# No rate limiting for local/unlimited LLMs
LLM_MAX_RETRIES=2              # Keep some retries for transient failures
LLM_REQUESTS_PER_SECOND=0      # 0 disables rate limiting

Logging

When retries occur, the extractor logs detailed information:

WARNING - Rate limit detected (attempt 1/4). Retrying in 1.0s... Error: 429 Client Error: Rate Limit Exceeded
WARNING - LLM call failed (attempt 2/4). Retrying in 2.0s... Error: ConnectionError: Connection timed out
ERROR - LLM call failed after 4 attempts. Error: APIError: Server error (500)

Important Notes

Applies to all LLM calls: Both listing page extraction and project detail extraction use rate limiting
Per-crawler rate limiting: Rate limiting is tracked per extractor instance, not globally
No persistence: Retry state is not saved across runs - if extraction is interrupted, start fresh
Works with all providers: Rate limiting is provider-agnostic and works with any LLM configured via EXTRACTION_PROVIDER

Input Format: institutes.json

The input file contains a list of institutes to process. Two formats are supported:

Array Format

[
  {
    "text": "UNU-CPR",
    "url": "https://unu.edu/cpr/projects",
    "id": 1
  },
  {
    "text": "UNU-BIOLAC",
    "url": "https://unu.edu/biolac/projects",
    "id": 2
  }
]

Object with References

{
  "references": [
    {
      "text": "UNU-CPR",
      "url": "https://unu.edu/cpr/projects",
      "id": 1
    }
  ]
}

Field	Type	Description
`text`	string	Institute name
`url`	string	URL to institute's projects listing page
`id`	integer	Unique identifier for the institute

Output Format: active_projects.json

An array of active project objects with the following schema:

[
  {
    "institute_name": "string",
    "institute_id": "integer",
    "project_url": "string",
    "title": "string",
    "status": "string",
    "description": "string",
    "objectives": "string | null",
    "duration": "string | null",
    "funding": "string | null",
    "partners": ["string"] | null,
    "contact": "string | null",
    "additional_info": { ... } | null
  }
]

Field Descriptions

Field	Type	Description
`institute_name`	required string	Name of the hosting institute
`institute_id`	required integer	ID of the hosting institute
`project_url`	required string	URL to the project's detail page
`title`	required string	Full project title
`status`	required string	Either `"active"` or `"ongoing"`
`description`	required string	Complete project description
`objectives`	optional string	Project objectives or goals
`duration`	optional string	Project timeline, start/end dates
`funding`	optional string	Funding sources, amounts, grants
`partners`	optional array	List of partner organizations
`contact`	optional string	Contact information (email, phone)
`additional_info`	optional object	Any extra extracted data

Common `additional_info` Keys

Based on analysis across 87 projects, 91% include additional_info. Here are the information types extracted:

Category	Keys Found	Frequency
People/Team	`project_manager`, `project_team`, `team_members`, `PhD Fellow`, `Promotor`, `Co-promotor`, `project_team_leads`, `coordinators`	61
Topics/Themes	`keywords`, `keyword`, `theme`, `themes`, `thematic_pillars`, `thematic_areas`, `related_research_area`	52
Locations	`location`, `locations`	17
Organizations	`institute`, `institutes`, `affiliation`, `organization`	13
Outputs	`outputs`, `research_outputs`, `activities`	12
Metadata	`project_type`, `date_published`, `project_website`, `website`, `expected_start_date`, `expected_end_date`, `project_status`	8
Events	`upcoming_briefings`, `training_sessions`, `online_courses`	5
SDGs	`related_sdgs`, `contributions_to_sdg`	2
Other	`benefits`, `platform`, `training`, `coordination`, `communications`, `membership_info`, `registration_link`, `achievements`	various

Frequency indicates how many projects contain at least one key from that category.

Debug Output: exclusions_debug.jsonl

All excluded projects are logged to output/exclusions_debug.jsonl (JSONL format, one entry per line):

{
  "timestamp": "2025-12-25T10:30:00.123456",
  "institute_id": 1,
  "institute_name": "UNU-CPR",
  "institute_url": "https://unu.edu/cpr/projects",
  "project_title": "Example Project",
  "project_url": "https://example.com/project",
  "project_brief_description": "Brief description...",
  "project_status_indicator": "Completed",
  "exclusion_trigger": "Completed",
  "exclusion_reason": "completed_keyword"
}

Exclusion Reasons

Reason Code	Description
`exclusion_keyword`	Matched job posting/navigation keywords
`completed_keyword`	Status indicator on listing page showed "completed"
`expired_date`	End date in listing page data has passed (e.g., "Expected End Date: 31 May 2023")
`details_completed`	Full page extraction confirmed "completed" status
`duration_expired`	End date in duration field has passed

Exclusion Keywords

The following keywords trigger automatic exclusion (found in title or brief description):

Job postings: vacancy, hiring, consultant, officer, internship, call for application, apply now, submit cv, job opening, career opportunity
Page elements: cookie, privacy policy, terms of use, subscribe, newsletter, follow us, rights reserved, skip to content, menu, contact us, about us

Completed Status Keywords

Projects matching these in status indicators are excluded: completed, finished, closed, past projects, archived

Date-Based Expiration

IMPORTANT: Explicit status takes precedence over date checks. If status explicitly says "Active" or "Ongoing", the project is kept regardless of end date. If status says "Completed", the project is excluded. Dates are only checked when status is unknown/ambiguous.

Projects with ambiguous/unknown status are automatically excluded if an end date is detected and the date has passed. The extractor scans:

Listing page data: status_indicator and brief_description fields
Full details: duration field from the project page

Date patterns detected:

"Expected End Date: 31 May 2023"
"End Date: 31 May 2023"
"31 May 2023", "May 31, 2023"
"2023-05-31" (ISO format)
"31/05/2023", "31-05-2023"

CLI Arguments & Skip-Existing Processing

The extractor supports incremental processing to resume interrupted runs or add new institutes without reprocessing existing ones:

CLI Arguments Reference

Argument	Short	Description	Default
`--skip-existing`		Skip institutes that already have projects in output file	`false`
`--force`		Force reprocess all institutes (overrides --skip-existing and --reprocess)	`false`
`--reprocess`		Comma-separated institute IDs, short names, or full names to reprocess (e.g., '1,2,3' or 'UNU-WIDER,MERIT')	`null`
`--output`	`-o`	Path to output JSON file	`active_projects.json`
`--input`	`-i`	Path to institutes.json	`input/institutes.json`
`--max-projects`		Max projects per institute (overrides env)	`20`
`--max-pages`		Max pagination pages to follow per institute. Set to 1 for initial page only, or higher to enable pagination following and intelligent probing. Probing activates when max_pages exceeds the highest detected page number.	`8`

Pagination Behavior

Intelligent Probing: The intelligent probing feature activates when max_pages > max_detected_page + 1. It works by:
1. Visiting the initial listing page
2. Detecting pagination links (e.g., ?page=0 through ?page=4)
3. Visiting detected links (skipping ?page=0 as it duplicates the base page)
4. Only if max_pages exceeds the highest detected page number: probing additional pages incrementally beyond detected ones

Probing Condition: if max_pages > max_detected_page + 1, then probe

Early Stopping: Probing stops automatically when no new candidates are found (page doesn't exist or is duplicate), avoiding wasted requests on non-existent pages.

Detailed Examples (assuming detected ?page=0 through ?page=4)

`max_pages`	Detected Links	Probing?	Pages Visited	Total
`1`	-	No	Initial page only	1
`2`	?page=1-4 (5 links)	No (2 ≤ 5)	Initial + 1st detected link	2
`3`	?page=1-4 (5 links)	No (3 ≤ 5)	Initial + 2 detected links	3
`5`	?page=1-4 (4 links after ?page=0 skip)	No (5 = 5)	Initial + 4 detected links	5
`6`	?page=1-4 (4 links)	Yes (6 > 5)	Initial + 4 detected + 1 probed (?page=5)	6
`7`	?page=1-4 (4 links)	Yes (7 > 5)	Initial + 4 detected + 2 probed (?page=5, ?page=6)	7
`8` (default)	?page=1-4 (4 links)	Yes (8 > 5)	Initial + 4 detected + 3 probed (?page=5, ?page=6, ?page=7)	8
`10`	?page=1-4 (4 links)	Yes (10 > 5)	Initial + 4 detected + 5 probed (?page=5 through ?page=9)	10
`15`	?page=1-4 (4 links)	Yes (15 > 5)	Initial + 4 detected + 10 probed (?page=5 through ?page=14)	15

Note: "Detected links" count excludes ?page=0 since it duplicates the base page content. The max_detected_page value is the highest page number found (e.g., 4 for ?page=0 through ?page=4).

Precedence Order

When multiple sources specify processing behavior, the following precedence applies (highest to lowest):

--force CLI flag - Always process all institutes (overrides everything)
--reprocess CLI flag - Reprocess only specified institutes (overrides --skip-existing)
--skip-existing CLI flag - Skip institutes with existing projects (overrides env)
Environment variable - Used only if no CLI flags provided

# Run with defaults (process all)
python extract.py

# Skip already processed institutes
python extract.py --skip-existing

# Force reprocess all (overrides --skip-existing and --reprocess)
python extract.py --force

# Reprocess specific institutes by ID
python extract.py --reprocess 1,2,3

# Reprocess specific institutes by short name (e.g., UNU-WIDER)
python extract.py --reprocess UNU-WIDER,MERIT

# Reprocess by partial name match (case-insensitive)
python extract.py --reprocess "Biotechnology","Comparative Regional"

# Custom output file
python extract.py -o custom/projects.json

# Override max projects per institute
python extract.py --max-projects 30

# Combining flags (--force takes precedence)
python extract.py --skip-existing --force  # will force reprocess all




            
                Use Cases for Processing Modes
                Skip-Existing - Resuming Interrupted Runs: If extraction crashes mid-run, use --skip-existing to continue from where it left off. The extractor loads existing projects, tracks processed institute IDs, and only processes new institutes.
Skip-Existing - Adding New Institutes: When adding new institutes to input/institutes.json, only new institutes will be processed. Existing projects are preserved and new ones are appended.
Reprocess - Fixing Specific Institutes: If an institute's data needs updating (e.g., website structure changed), use --reprocess with specific institute IDs or names. Only those institutes are re-crawled, and their old data is replaced.
Cost & Time Savings: Avoid re-crawling already processed institutes, saving both API costs and time.
            

            How Skip-Existing Works
            When --skip-existing is enabled:
            
                Checks if output file exists (e.g., active_projects.json)
                Loads existing projects and extracts unique institute_id values
                Filters input institutes to exclude those with existing projects
                Processes only remaining (new) institutes
                Accumulates existing + new projects and saves to output file
            

            How Reprocess Works
            When --reprocess is specified with institute IDs, short names, or full names:
            
                Loads all existing projects from the output file
                Matches institutes against search terms (supports: ID, exact short name like "UNU-WIDER", partial full name match)
                Removes projects belonging to matched institutes from the existing set
                Reprocesses only the specified institutes
                Merges the newly extracted data with the remaining existing projects
                Overwrites the output file with the combined data
            

            Matching is case-insensitive. Short names (in parentheses) are matched exactly. Full names support partial substring matching.

            Note: Since extract.py outputs a single combined file (not per-institute files), it tracks processed institutes by checking which institute_id values already exist in the output file.



        
            Usage

              # Basic usage (process all institutes)
python extract.py

# Skip already processed institutes
python extract.py --skip-existing

# Force reprocess all
python extract.py --force

# Reprocess specific institutes by ID
python extract.py --reprocess 1,2,3

# Reprocess by short name (recommended)
python extract.py --reprocess UNU-WIDER,MERIT

# Reprocess by partial full name
python extract.py --reprocess "Biotechnology","Comparative Regional"

# Custom files
python extract.py -i custom/institutes.json -o custom/projects.json

# Override max projects
python extract.py --max-projects 30

# Control pagination depth (probing only when max_pages > max_detected_page)
python extract.py --max-pages 1  # Single page only (no pagination)
python extract.py --max-pages 5  # Initial + detected links only (no probing if 5 pages detected)
python extract.py --max-pages 7  # Initial + detected + probing beyond detected (if 4 detected, probes 2 more)
python extract.py --max-pages 15 # Deep crawl with extensive probing beyond detected

# With custom environment file
dotenv run -f .env_extraction -- python extract.py


            Constructor Parameters
            
                
                    Parameter
                    Type
                    Default
                    Description
                
                
                    max_projects
                    int
                    from env or 50
                    Max projects per institute
                
                
                    max_pages
                    int
                    8
                    Max pagination pages to follow (1 = initial page only, with intelligent probing)
                
                
                    extraction_provider
                    str
                    from env
                    LLM provider for extraction
                
                
                    extraction_api_key
                    str
                    from env
                    API key for LLM
                
                
                    debug_file
                    str
                    "output/exclusions_debug.jsonl"
                    Path to exclusion debug log
                
                
                    skip_existing
                    bool
                    false
                    Skip institutes with existing projects in output file
                
            
        

        
            Requirements
            
                Python 3.10+
                crawl4ai
                pydantic
                python-dotenv
                Playwright browser (installed via crawl4ai)
            
            pip install crawl4ai pydantic python-dotenv

Parameter	Type	Default	Description
`max_projects`	int	from env or 50	Max projects per institute
`max_pages`	int	8	Max pagination pages to follow (1 = initial page only, with intelligent probing)
`extraction_provider`	str	from env	LLM provider for extraction
`extraction_api_key`	str	from env	API key for LLM
`debug_file`	str	`"output/exclusions_debug.jsonl"`	Path to exclusion debug log
`skip_existing`	bool	`false`	Skip institutes with existing projects in output file



    

    
        Active Projects Extractor Documentation | By Dr. Ng Chong | UNU Campus Computing Centre