Overview

The SDG Analysis Pipeline is a comprehensive multi-agent system that extracts active research projects from institute websites, analyzes their alignment with UN Sustainable Development Goals, assesses technology integration, and generates enhancement recommendations. The system produces an interactive dashboard for visualization and exploration.

🔍 Two-Phase Extraction

Separates web crawling (extract.py) from analysis (s2p.py) for better error handling, checkpointing, and reusability.

🌍 Institute-Driven Processing

Driven by input/institutes.json with two-level parallelism: institutes process concurrently, projects within each institute process concurrently.

📊 Rich Context Analysis

Agents leverage objectives, funding, period, keywords, themes, team, and outputs from additional_info for deeper analysis.

🎯 Per-Institute Outputs

Generates both combined output file AND individual files per institute in output/institutes/ directory.

⏱️ Progress Tracking

Real-time progress with elapsed time, ETA calculations, and detailed timing statistics on completion.

📈 Interactive Dashboard

- dashboard20.py produces advanced visualizations with sunburst, sankey, radar charts, and detailed project modals.

Pipeline Architecture

The pipeline consists of two main phases with a final visualization step:

📥

Phase 1: Extraction (extract.py)

Two-pass web scraping with Crawl4AI: First pass extracts candidates from listing pages, second pass fetches full project details. Filters out jobs, completed projects, and non-relevant content.

↓

📊

Phase 2: Analysis (s2p.py)

Institute-driven parallel processing: Reads institutes.json, filters projects from active_projects.json by institute_id, runs SDG classification, technology analysis, and enhancement recommendations.

↓

📈

Phase 3: Visualization (dashboard20.py)

Generates interactive HTML dashboard with global and institute-level views, SDG hierarchy visualization, technology assessment profiles, and detailed project cards with modal deep-dive.

Key Architectural Benefits

Separation of Concerns: Extraction and analysis are separate - you can re-run analysis without re-crawling. Checkpointing: If analysis fails, you have the extracted data to resume from. Scalability: Two-level parallelism maximizes throughput while preventing API overload.

Phase 1: Extraction (extract.py)

The extractor uses a two-pass approach with Crawl4AI and LLM-based structured extraction:

1️⃣ Candidate Discovery

Fetch institute listing page, extract candidate projects with titles, URLs, brief descriptions, and status indicators.

2️⃣ Smart Filtering

Exclude jobs (vacancy, hiring), page elements (cookies, privacy), and completed projects (finished, closed).

3️⃣ Full Details

For each candidate, fetch project page and extract description, objectives, duration, funding, partners, contact, and additional_info.

4️⃣ Status Verification

Confirm project is active/ongoing. Default to "active" if no status is explicitly mentioned.

5️⃣ Save Results

Output to input/active_projects.json with complete metadata and timing statistics.

Key Features

Per-Institute Limit: MAX_PROJECTS controls max projects extracted per institute (default: 20)
Debug Logging: All exclusions logged to output/exclusions_debug.jsonl with reason codes
Rich Extraction: 91% of projects include additional_info with keywords, themes, team members, outputs
Error Recovery: Continues on individual project failures, logs exceptions for debugging
Skip Existing: Can skip institutes that already have projects in the output file

CLI Arguments Reference (extract.py)

Argument	Short	Description
`--skip-existing`		Skip institutes with existing projects in output file
`--force`		Force reprocess all institutes (overrides --skip-existing and --reprocess)
`--reprocess`		Comma-separated institute IDs, short names, or full names to reprocess (e.g., '1,2,3' or 'UNU-WIDER,MERIT')
`--output`	`-o`	Custom output file path
`--input`	`-i`	Custom institutes.json path
`--max-projects`		Max projects per institute (overrides env)

# Run with defaults
python extract.py

# Skip already processed institutes
python extract.py --skip-existing

# Force reprocess all
python extract.py --force

# Reprocess specific institutes by ID
python extract.py --reprocess 1,2,3

# Reprocess by short name (recommended)
python extract.py --reprocess UNU-WIDER,MERIT

# Reprocess by partial full name
python extract.py --reprocess "Biotechnology","Comparative Regional"

See EXTRACT_PROJECTS.html for complete extraction documentation.

Phase 2: Analysis Overview (s2p.py)

The analysis pipeline is institute-driven for maximum parallelism:

Input Sources

input/institutes.json - List of institutes to process
input/active_projects.json - Extracted project data from extract.py
input/sdgs.json - SDG keyword taxonomy for classification

Output Files

output/analyzed_projects.json - Combined results from all institutes
output/institutes/{name}.json - Per-institute result files
output/pipeline.log - Detailed execution logs

Parallel Processing Model

The system uses two-level parallelism for optimal throughput:

Level	Concurrency Setting	Description
Institute-Level	`MAX_INSTITUTE_CONCURRENCY` (default: 2)	Number of institutes processed simultaneously
Project-Level	`MAX_LLM_CONCURRENCY` (default: 3)	Max LLM calls per institute (SDG + Tech + Recommendation agents)

Example: With 2 institute concurrency and 3 LLM concurrency, you can process up to 6 LLM calls in parallel (2 institutes × 3 calls each).

Resume & Incremental Processing

The pipeline supports incremental processing to resume interrupted runs or add new institutes without reprocessing existing ones:

Feature	Description
`SKIP_EXISTING`	When enabled, skips institutes with existing output files in `output/institutes/`
Automatic Detection	Checks if `output/institutes/{institute_name}.json` exists before processing
Smart Logging	Shows "Loading existing result" or "Skipped (loaded N projects)" for cached institutes

Use Cases for SKIP_EXISTING

Resuming Interrupted Runs: If the pipeline crashes mid-run, set SKIP_EXISTING=true to continue from where it left off.
Adding New Institutes: When adding new institutes to input/institutes.json, only new institutes will be processed.
Cost Savings: Avoid re-processing completed institutes when re-running analysis.

# Skip existing output files (CLI flag)
python s2p.py --skip-existing

# Or using environment variable
SKIP_EXISTING=true python s2p.py

# Reprocess specific institutes by ID
python s2p.py --reprocess 1,2,3

# Reprocess by short name (recommended)
python s2p.py --reprocess UNU-WIDER,MERIT

CLI Arguments Reference

Argument	Short	Description
`--skip-existing`		Skip institutes with existing output files
`--force`		Force reprocess all institutes (overrides --skip-existing, --reprocess and env)
`--reprocess`		Comma-separated institute IDs, short names, or full names to reprocess
`--output`	`-o`	Custom output file path
`--institutes`	`-i`	Custom institutes.json path
`--projects`	`-p`	Custom active_projects.json path

Agent Execution Flow

For each project, three agents run in parallel using asyncio.to_thread():

SDG Classification Agent

Multi-stage analysis: Keywords → Semantic LLM → Calibration with priority regions

Confidence scores per SDG
Specific target mapping (e.g., 7.2)
Justification chains
Impact pathways

Technology Analysis Agent

Assesses tech integration, maturity, and identifies gaps

Integration level (Low/Medium/High)
Maturity & Innovation scores
Technology categorization
Gap identification

Rich Context Injection

Agents use project metadata to support more detailed analysis. When fields are missing, agents use graceful fallbacks (e.g., "Not specified") to ensure analysis continues:

Context Source	Fields Used	Benefits	Fallback
Core Project Data	title, description, location, partners	Basic project understanding	Empty string / empty list
Objectives & Period	objectives, period, funding	Timeline-aware recommendations	"Not specified"
Additional Info	keywords, themes, team, outputs	Domain-specific analysis	"Not specified" / "None"

Note: Not all projects have complete metadata. Agents are designed to work with partial information, using whatever context is available while providing "Not specified" placeholders for missing fields in their analysis prompts.

python s2p.py

Progress & Timing Output

During execution, the pipeline provides real-time feedback:

Progress: 5/20 institutes completed | Elapsed: 125.3s (2.1m) | ETA: 376.0s (6.3m)

On completion, detailed timing statistics are displayed:

========================================
TIMING STATISTICS
========================================
Total time: 1500.00s (25.00m)
Average per institute: 75.00s
Average per project: 3.50s
Projects per minute: 17.14

Maturity Level Assessment

The Maturity Score is a comprehensive multi-dimensional metric that evaluates technology readiness across four key dimensions. Each dimension is scored independently and then combined using a weighted formula.

Four-Dimensional Maturity Model

Dimension	Weight	Description	Key Indicators
Technical Readiness	35%	How developed and proven the technology is	Production keywords, deployment status, tech count, temporal analysis
Operational Status	25%	Current deployment and operational state	Deployed/piloting/development keywords, scale indicators
Adoption Level	20%	User adoption and scale of implementation	User/beneficiary counts, partner count, geographic reach
Evidence Base	20%	Validation and research backing	Peer-reviewed publications, evaluations, measured results, outputs

Maturity Level Tiers

Score Range	Level	Characteristics	Example Indicators
0.90-1.00	Production	Fully deployed, proven at scale, operational in multiple sites	"deployed", "operational", "live", "scaling", "proven", "fully operational"
0.70-0.89	Advanced	Field-tested, validated pilots, ready for scale-up	"field-tested", "validated", "beta", "commercialized", "implemented"
0.50-0.69	Intermediate	Working prototypes, active pilots, showing promise	"pilot", "prototype", "testing", "demonstration", "user testing"
0.25-0.49	Early	Early prototypes, proof of concept, experimental	"experimental", "proof of concept", "exploratory", "initial development"
0.00-0.24	Planning	Concept phase, planning, research only	"proposed", "planned", "roadmap", "design phase", "conceptual"

Scoring Algorithm Details

1. Technical Readiness (35% weight)

Baseline score: 0.5
Context-weighted keywords: Additional info (outputs) gets max score, objectives get midpoint, description gets min score
Position-based weighting: Keywords appearing in last 30% of description get +0.1 boost
Temporal analysis: Past tense indicators > Present > Future (weighted 1.0 : 0.6 : 0.2)
Contradictory signals: Multiple "planning" keywords apply -0.15 penalty
Technology count: 5+ technologies = +0.1, 3+ = +0.05

2. Operational Status (25% weight)

Baseline score: 0.5
Deployed: +0.35 base + 0.05 per matching keyword → max 1.0
Piloting: +0.20 base + 0.03 per keyword → max 0.75
Development: +0.10 → max 0.55
Planning: -0.20 → min 0.1
Scale bonus: +0.15 for "scaling", "nationwide", "multiple sites"

3. Adoption Level (20% weight)

Baseline score: 0.5
High adoption: 0.85 (users, beneficiaries, thousands/millions, nationwide)
Medium adoption: 0.60 (pilot users, communities, several sites)
Low adoption: 0.35 (planned users, target users)
Partner bonus: 5+ partners = +0.1, 3+ = +0.05

4. Evidence Base (20% weight)

Baseline score: 0.5
Strong evidence: 0.6 base + 0.1 per match (published, peer-reviewed, validated)
Moderate evidence: 0.45 base + 0.08 per match (evaluated, measured, demonstrated)
Weak evidence: 0.35 base - 0.05 per match (anecdotal, preliminary)
Outputs bonus: +0.15 if >2 outputs in additional_info

5. Quantitative Boosts (added to overall score)

Project duration: 3+ years = +0.1, 1+ years = +0.05
Funding scale: Million+ = +0.08, thousand+ = +0.03
Partner count: 5+ = +0.08, 3+ = +0.04

Overall Score Formula

overall = (technical × 0.35) + (operational × 0.25) +
          (adoption × 0.20) + (evidence × 0.20) +
          quantitative_boost

Example Calculations

High Maturity Project (0.89): AI diagnostic platform deployed across 50 clinics, 100K+ patients processed, peer-reviewed study published.
→ Technical: 0.90 (AI proven), Operational: 0.95 (deployed at scale), Adoption: 0.80 (50 clinics), Evidence: 0.90 (published)
→ Overall: (0.90×0.35) + (0.95×0.25) + (0.80×0.20) + (0.90×0.20) = 0.89

Medium Maturity Project (0.57): Blockchain land registry pilot in 3 communities, prototype working, conference proceedings published.
→ Technical: 0.55 (prototype), Operational: 0.60 (pilot), Adoption: 0.40 (3 communities), Evidence: 0.65 (conference)
→ Overall: (0.55×0.35) + (0.60×0.25) + (0.40×0.20) + (0.65×0.20) = 0.57

Low Maturity Project (0.23): Exploring IoT sensors for water monitoring, researching options, seeking funding.
→ Technical: 0.25 (research), Operational: 0.20 (planning), Adoption: 0.10 (no users), Evidence: 0.30 (exploratory)
→ Overall: (0.25×0.35) + (0.20×0.25) + (0.10×0.20) + (0.30×0.20) = 0.23

Innovation Score Assessment

The Innovation Score is a multi-dimensional metric that evaluates how innovative and forward-looking a project's technology approach is. It considers not just what technologies are used, but how they're combined, the novelty of the approach, and the visionary nature of the project.

Four-Dimensional Innovation Model

Dimension	Weight	Description	Key Indicators
Emerging Technology Usage	40%	Presence of cutting-edge technologies	AI/ML, blockchain, IoT, spatial tech, data analytics, cloud, mobile
Novelty Indicators	30%	Language indicating innovative approaches	Breakthrough, novel, pioneering, cutting-edge, proprietary, patented
Technology Combination	20%	Combinatorial innovation from tech synergy	Premium tech pairs (AI+IoT, AI+Blockchain, etc.), diversity bonus
Forward-Looking Language	10%	Future-oriented, visionary statements	Will transform, next-generation, paradigm shift, revolutionize

Innovation Score Tiers

Score Range	Level	Characteristics	Example Indicators
0.80-1.00	Transformative	Breakthrough innovation combining multiple cutting-edge technologies	Multiple premium tech combos, novel language, strong vision
0.60-0.79	High Innovation	Advanced use of emerging technologies with novel approaches	AI/ML + spatial, cutting-edge language, some combinations
0.40-0.59	Moderate Innovation	Good technology mix with some innovative elements	Single emerging tech, moderate novelty, basic combinations
0.20-0.39	Low Innovation	Conventional technology approach with limited novelty	Commoditized tech only, standard approaches, minimal novelty
0.00-0.19	Minimal Innovation	Traditional or no significant technology component	No emerging tech, established/conventional language only

Scoring Algorithm Details

1. Emerging Technology Usage (40% weight)

Technology categories have different innovation weights based on novelty:
- AI/ML: 0.35 (highest innovation weight)
- Blockchain: 0.30
- IoT: 0.28
- Spatial (GIS/satellite): 0.25
- Data analytics: 0.20
- Cloud/Mobile: 0.15 (more commoditized)
Tech count boost: +20% per additional technology in same category (max +50%)

2. Novelty Indicators (30% weight)

High-impact keywords (+0.15 each): breakthrough, novel, pioneering, cutting-edge, state-of-the-art, revolutionary, groundbreaking, first-of-its-kind, unprecedented, innovative, disruptive, paradigm shift, transformative
Medium keywords (+0.08 each): advanced, emerging, next-generation, sophisticated, modern, progressive, forward-thinking, experimental, exploratory, ambitious
Low keywords (-0.10 each): traditional, conventional, established, standard, existing, legacy, routine, typical
Innovation phrases (+0.05 each): new approach, unique method, original, proprietary technology, patent-pending, patented, custom-built, bespoke, tailored

3. Technology Combination (20% weight)

Premium combinations (+0.25 each):
- AI + IoT (AIoT)
- AI + Blockchain (Decentralized AI)
- AI + Spatial (AI + geospatial)
- Blockchain + IoT (Blockchain IoT)
- IoT + Spatial (IoT + remote sensing)
- AI + Data (AI + big data)
Diversity bonus: +0.12 per category (max 0.50)

4. Forward-Looking Language (10% weight)

Future phrases (+0.08 each): will enable, will transform, future-proof, scalable to, next phase, roadmap includes, vision for, forward-looking, next-generation, preparing for, anticipating, positioned to, poised to, on the horizon
Ambitious indicators (+0.05 each): revolutionize, transform, reimagine, redefine, paradigm, leapfrog, accelerate, catalyze

Overall Score Formula

innovation = (tech_innovation × 0.40) + (novelty_score × 0.30) +
              (combinatorial_score × 0.20) + (forward_looking_score × 0.10)

Example Calculations

High Innovation Project (0.88): AI-powered satellite imagery analysis for disaster response. Uses cutting-edge deep learning with geospatial data. Breakthrough approach with proprietary algorithms.
→ Tech: 0.60 (AI 0.35 + Spatial 0.25), Novelty: 1.00 (breakthrough + cutting-edge + proprietary), Combination: 0.62 (AI+Spatial premium), Forward: 0.50 (will transform + next-generation)
→ Overall: (0.60×0.40) + (1.00×0.30) + (0.62×0.20) + (0.50×0.10) = 0.88

Medium Innovation Project (0.55): Mobile app for data collection in rural health clinics. Uses cloud storage and basic analytics. Emerging technology approach with modern UI.
→ Tech: 0.30 (Mobile 0.15 + Cloud 0.15), Novelty: 0.65 (emerging + modern + advanced), Combination: 0.30 (2 categories), Forward: 0.38 (scalable to + next phase)
→ Overall: (0.30×0.40) + (0.65×0.30) + (0.30×0.20) + (0.38×0.10) = 0.55

Low Innovation Project (0.18): Traditional database system for record management. Established technology with conventional methods. Standard approach.
→ Tech: 0.0 (no emerging tech), Novelty: 0.20 (established + conventional -0.20), Combination: 0.0 (1 category), Forward: 0.30 (minimal forward-looking)
→ Overall: (0.0×0.40) + (0.20×0.30) + (0.0×0.20) + (0.30×0.10) = 0.18

Technology Integration Level Assessment

The Integration Level is a sophisticated multi-dimensional metric that evaluates how deeply and effectively technologies are embedded within a project. Unlike simple technology counts, this assessment considers breadth, depth, interconnectedness, and architectural sophistication.

Four-Dimensional Integration Model

Dimension	Weight	Description	Key Indicators
Breadth	25%	Technology count and category diversity	Number of technologies, diversity across categories (AI, IoT, cloud, etc.)
Depth	30%	How deeply technologies are embedded in project implementation	Tech in objectives, implementation language, multiple mentions across sections
Interconnectedness	25%	Whether technologies form an integrated ecosystem	Premium tech combinations, integration keywords, data flow indicators
Sophistication	20%	Architectural complexity and modern development practices	Advanced patterns (microservices, serverless), DevOps, scalability considerations

Integration Level Tiers

Score Range	Level	Characteristics	Example Profile
0.75-1.00	High	Multiple diverse technologies deeply integrated with sophisticated architecture	5+ technologies across 3+ categories, integrated ecosystem, advanced patterns
0.45-0.74	Medium	Good technology mix with some integration and implementation depth	3+ technologies or 2+ categories, moderate interconnectedness
0.00-0.44	Low	Limited technology use with minimal integration	Few technologies, single category, shallow implementation

Scoring Algorithm Details

1. Breadth Score (25% weight)

Technology Count: Normalized to 0-1 scale (capped at 8 technologies for diminishing returns)
Category Diversity: Normalized to 0-1 scale (capped at 5 categories)
Combined with diversity emphasis: Tech count (40%) + Diversity (60%) - quality over quantity

2. Depth Score (30% weight)

Context-weighted sources: Objectives (1.0x) > Description (0.7x) > Additional info (0.5x)
Depth indicators:
- High: implement, deploy, integrate, build, develop, create
- Medium: use, utilize, leverage, apply, employ
- Low: explore, investigate, consider, potential, could
Multi-mention bonus: +10% per additional mention across sections (max +20%)

3. Interconnectedness Score (25% weight)

Premium combinations: +0.15 each for high-value tech pairs (AI+IoT, AI+Blockchain, AI+Spatial, Blockchain+IoT, IoT+Spatial, AI+Data, Cloud+IoT, Cloud+AI, Blockchain+Data, Mobile+Data)
Integration keywords:
- Strong: integrate, ecosystem, platform, seamless, bridge (0.6 base + 0.1 per match)
- Moderate: link, interface, sync, collaborate, synergy (0.4 base + 0.08 per match)
Data flow indicators: +0.1 for each (pipeline, stream, flow, feed, sync, exchange, transmit, share) - max 0.3

4. Sophistication Score (20% weight)

Architectural patterns:
- Advanced: microservices, serverless, distributed, edge computing (0.4 base + 0.15 per match)
- Moderate: platform, API-based, modular, scalable (0.25 base + 0.1 per match)
Methodology: DevOps, CI/CD, MLOps, agile (0.3 base + 0.12 per match)
Performance keywords: scalability, high availability, fault tolerant (+0.1 each, max 0.3)
Emerging tech bonus: +0.05 each for AI/ML, Blockchain, IoT, Spatial categories

Overall Score Formula

integration = (breadth × 0.25) + (depth × 0.30) +
               (interconnectedness × 0.25) + (sophistication × 0.20)

Example Calculations

High Integration Project (0.82): Distributed AI platform with microservices architecture integrating IoT sensors, blockchain for data integrity, and cloud infrastructure.
→ Breadth: 0.80 (6 techs, 4 categories), Depth: 0.90 (tech in objectives + implement language), Interconnectedness: 0.85 (AI+IoT+Blockchain combos + ecosystem), Sophistication: 0.75 (microservices + DevOps)
→ Overall: (0.80×0.25) + (0.90×0.30) + (0.85×0.25) + (0.75×0.20) = 0.82

Medium Integration Project (0.52): Mobile application using cloud storage and basic data analytics for health data collection.
→ Breadth: 0.55 (3 techs, 2 categories), Depth: 0.60 (use/utilize language), Interconnectedness: 0.40 (mobile+data combo), Sophistication: 0.45 (API-based + scalable)
→ Overall: (0.55×0.25) + (0.60×0.30) + (0.40×0.25) + (0.45×0.20) = 0.52

Low Integration Project (0.28): Basic website using standard HTML/CSS with minimal technology stack.
→ Breadth: 0.15 (1 tech, 1 category), Depth: 0.35 (superficial mention), Interconnectedness: 0.20 (no combinations), Sophistication: 0.35 (basic patterns)
→ Overall: (0.15×0.25) + (0.35×0.30) + (0.20×0.25) + (0.35×0.20) = 0.28

Hybrid Scoring: Rule-Based + LLM Semantic Analysis

The pipeline uses a hybrid approach that combines deterministic rule-based algorithms with LLM semantic understanding. This provides the best of both worlds: consistent, reproducible scoring with deep semantic nuance.

Two-Stage Scoring Process

1️⃣ Rule-Based Scoring

Deterministic algorithms compute baseline scores for Integration Level, Maturity, and Innovation using keyword matching, pattern detection, and weighted formulas.

2️⃣ LLM Semantic Enhancement

LLM receives rule-based scores as context, performs deep semantic analysis, and can refine scores based on nuanced understanding that rule-based methods might miss.

Dimension-Specific Approaches

Dimension	Rule-Based Contribution	LLM Enhancement Role	Override Behavior
Integration Level	Multi-dimensional analysis across breadth, depth, interconnectedness, and sophistication with 4 sub-methods	Semantic understanding of how technologies meaningfully contribute to project goals and work together	LLM can override - When deep semantic analysis suggests different integration than rule-based patterns indicate
Maturity Score	Four-dimensional model (Technical 35%, Operational 25%, Adoption 20%, Evidence 20%) with context-weighted scoring	Interprets nuanced deployment status, validates evidence quality, assesses real-world operational readiness	LLM can override - When semantic context indicates different maturity than keyword patterns suggest
Innovation Score	Four-dimensional model (Tech 40%, Novelty 30%, Combination 20%, Forward-Looking 10%) with weighted tech categories	Detects genuine breakthrough approaches vs marketing buzzwords, assesses true novelty in domain context	LLM can override - When semantic analysis reveals genuine innovation or identifies inflated claims

Benefits of Hybrid Approach

✓ Consistency

Rule-based scoring ensures same input always produces same baseline score, enabling comparability across projects.

✓ Semantic Nuance

LLM adds human-like understanding of context, intent, and genuine innovation that keyword matching cannot capture.

✓ Hallucination Mitigation

Rule-based baseline anchors LLM analysis, reducing risk of hallucination by providing grounded starting point.

✓ Graceful Degradation

If LLM fails or returns incomplete data, system falls back to rule-based scores, ensuring reliability.

How LLM Override Works

Step 1: Rule-based algorithms compute initial scores with full debug logging
Step 2: Initial scores are provided to LLM as context/suggested values in the prompt
Step 3: LLM can choose to:
    • Keep the score if it seems accurate based on semantic analysis
    • Adjust the score if deep understanding suggests different assessment
    • Return nothing (falls back to rule-based via data.get("key", rule_based_value))
Step 4: Final score uses LLM value if provided, otherwise rule-based value

LLM's Qualitative Contributions

Beyond quantitative scores, the LLM provides essential qualitative analysis that rule-based methods cannot generate:

Output Field	Purpose	Why LLM is Essential
analysis	Narrative explanation of technology approach	Synthesizes complex information into coherent human-readable summary
strengths	What's done well technically	Identifies specific technical merits and implementation quality
gaps	Missing technologies or capabilities	Recognizes what's missing based on domain knowledge and best practices
scalability_assessment	Scaling potential and limitations	Evaluates architectural decisions for scalability implications
interoperability_notes	Integration with existing systems	Assesses compatibility with standards and existing infrastructure
tech_recommendations	Specific technology improvements	Suggests relevant, actionable enhancements based on project context

Phase 3: Dashboard (dashboard20.py)

The dashboard generator creates an interactive HTML visualization with both Global Analysis and Institute View tabs:

Global Analysis Tab

Global SDG Hierarchy (Sunburst): Interactive visualization showing Global → SDG → Project hierarchy. Click outer ring to view project details.
Global Readiness Profile (Radar): Average scores across Maturity, Innovation, Impact, and Feasibility dimensions.
Contribution Flow (Sankey): Visual flow from institutes to SDGs showing contribution strength.
Institute Summary Table: Quick comparison of project counts, top SDGs, and readiness scores.

Institute View Tab

Institute Selector: Dropdown to select specific institute for detailed analysis.
KPI Cards: Projects count, Primary SDG, Average Maturity, Strategic Focus.
Detailed SDG Breakdown (Sunburst): Institute-specific SDG → Project hierarchy.
Institute Readiness Profile (Radar): Institute-specific average scores.
Project → SDG Mapping (Sankey): Flow visualization for institute projects.
Project Cards: Expandable cards with SDG badges, descriptions, and top recommendations.

Interactive Features

Modal Deep-Dive: Click any project to view detailed analysis including strengths, gaps, scalability, tech recommendations, and implementation roadmap.
SDG Tooltips: Hover over SDG badges to see official titles and descriptions.
Responsive Design: Works on desktop and mobile devices.

python dashboard20.py

Output (Anthropic): dashboard_advanced.html - Open in any modern web browser.

Overview

🔍 Two-Phase Extraction

🌍 Institute-Driven Processing

📊 Rich Context Analysis

🎯 Per-Institute Outputs

⏱️ Progress Tracking

📈 Interactive Dashboard

Pipeline Architecture

Phase 1: Extraction (extract.py)

Phase 2: Analysis (s2p.py)

Phase 3: Visualization (dashboard20.py)

Key Architectural Benefits

Phase 1: Extraction (extract.py)

Key Features

CLI Arguments Reference (extract.py)

Phase 2: Analysis Overview (s2p.py)

Input Sources

Output Files

Parallel Processing Model

Resume & Incremental Processing

Use Cases for SKIP_EXISTING

CLI Arguments Reference

Agent Execution Flow

SDG Classification Agent

Technology Analysis Agent

Enhancement Recommendation Agent

Rich Context Injection

Progress & Timing Output

Maturity Level Assessment

Four-Dimensional Maturity Model

Maturity Level Tiers

Scoring Algorithm Details

Overall Score Formula

Example Calculations

Innovation Score Assessment

Four-Dimensional Innovation Model

Innovation Score Tiers

Scoring Algorithm Details

Overall Score Formula

Example Calculations

Technology Integration Level Assessment

Four-Dimensional Integration Model

Integration Level Tiers

Scoring Algorithm Details

Overall Score Formula

Example Calculations

Hybrid Scoring: Rule-Based + LLM Semantic Analysis

Two-Stage Scoring Process

Dimension-Specific Approaches

Benefits of Hybrid Approach

✓ Consistency

✓ Semantic Nuance

✓ Hallucination Mitigation

✓ Graceful Degradation

How LLM Override Works

LLM's Qualitative Contributions

Phase 3: Dashboard (dashboard20.py)

Global Analysis Tab

Institute View Tab

Interactive Features