Project documentation: https://c3.unu.edu/projects/ai/hr/cv-screener.html
This tool is also documented in AI for Good, a United Nations–led initiative that explores how AI can address global challenges and advance the Sustainable Development Goals.
Overview
As AI language models become increasingly capable, many organizations are exploring how these tools might assist with recruitment and hiring. At UNU Campus Computing Centre (C3), we wanted to understand this space through hands-on research rather than speculation. Our central question: How can AI assist with CV screening while keeping human judgment firmly in the decision-making seat?
This document describes the design principles, architecture, and key findings from our research prototype, a web-based application that demonstrates AI-assisted candidate screening at multiple levels of depth and rigor.
What the Prototype Does
The prototype provides three core capabilities:
Job Description Parsing extracts structured requirements from job postings, accepting input via URL, PDF, or HTML upload.
CV Analysis evaluates candidate CVs and optional written screening questionnaire responses against those structured requirements, producing detailed, explainable assessments.
Multi-Model Comparison enables re-running candidate evaluations across different AI providers (OpenAI, Anthropic, Google Gemini, DeepSeek, Azure OpenAI, and local Ollama models), making it possible to compare results and understand model-specific variance.
Design Principles
1. Human-in-the-Loop by Design
The system is explicitly designed not to make autonomous hiring decisions. Every evaluation is structured to inform and support human judgment, not replace it. Each output includes a “Qualified” flag based on explicit, rules-driven criteria (education, experience, required skills), a “Recommendation” field capturing more nuanced assessment, and a detailed gap analysis with suggested interview focus areas for human follow-up.
2. Transparency and Explainability
Unlike black-box AI systems, the prototype makes its reasoning explicit at every step. Each evaluation surfaces the specific skills matched and missing, how a candidate’s prior responsibilities align with the role, and how identified gaps might be compensated by other documented strengths. Hiring teams can see not just what the system concluded, but why.
3. Multi-Model Evaluation
Different AI models can produce meaningfully different assessments of the same candidate. Supporting multiple providers allows researchers to compare evaluations side by side, identify model-specific biases or tendencies, and make informed decisions about which model best suits a given role’s requirements — balancing cost, speed, and quality.
Key Features
Structured Evaluation Levels
The prototype offers five prompt levels, ranging from Basic (a 30-second pass/fail screen) to Advanced (a 15–20 minute forensic analysis incorporating skill recency assessment). This spectrum lets researchers explore trade-offs between speed and depth, simple qualification flags and nuanced scoring, and current versus dated experience.
Screening Questionnaire Integration
The prototype supports optional written screening questionnaires sent to candidates before their CV is reviewed. Skills can be evidenced in either source — or ideally both. When a CV and questionnaire response consistently support the same skill claim, confidence is highest. The prototype explicitly tracks four patterns:
- Convergence: Strong technical explanations in questionnaire responses confirm CV claims. This is the most reliable form of evidence.
- Divergence: Vague or contradictory responses relative to CV claims may indicate overstatement.
- Hidden Strength: Some candidates demonstrate genuine expertise in questionnaire responses not mentioned on their CV at all.
- CV-Only Evidence: Some skills are well-documented on the CV but simply not probed in the questionnaire.
At higher evaluation levels, screening questionnaire responses can carry up to 40–45% of the total score, with top marks reserved for cases where both sources converge.
AND/OR Logic in Skill Requirements
Job descriptions encode important logical nuance. “X, Y, and Z” means all are required; “X or Y” means either suffices; “e.g., X, Y” signals that any one example works. The system parses these cues and sets appropriate required_count values to ensure candidates are assessed accurately — and fairly.
Alternative Qualification Pathways
Job descriptions commonly offer parallel qualification routes, such as “6 years’ experience with a High School diploma, OR 4 years’ experience with a Bachelor’s degree.” The prototype treats these as genuine OR conditions: candidates need only satisfy one pathway, preventing unfair penalization of those who exceed requirements in one dimension but not another.
Skill Recency Assessment
In technology roles particularly, when matters as much as what. A candidate claiming ten years of Hadoop experience from 2015 is fundamentally different from one with current PyTorch experience. The Advanced evaluation mode tracks recency explicitly, applying scoring multipliers that down-weight outdated skills and flag expired certifications.
Technical Architecture
Python-Based Scoring to Prevent Hallucination
One of the most important architectural decisions is offloading all quantitative score calculations from the LLM to Python code. Large language models are prone to “hallucinating” numerical results — producing confident but incorrect calculations. The prototype addresses this by having the LLM extract structured qualitative data (which skills are present, how responsibilities align, what gaps exist), then computing final scores using deterministic Python formulas that the LLM cannot override.
For example, the Required Skills score is computed as:
Final Score = Base Ratio × Recency Multiplier × Q&A Multiplier
The system also tracks discrepancies between the LLM’s own stated score and the Python-computed score — a useful diagnostic for understanding how often models get their own math wrong.
Customizable Evidence Types by Role
Different roles demand different forms of evidence. The prototype’s evaluation prompts can be configured for role-specific indicators:
- Research positions: publications, citation counts, conference presentations, grant funding
- Technical roles: GitHub repositories, open source contributions, technical writing
- Teaching roles: curriculum development, pedagogical publications, student evaluations
- Management roles: team size, budget responsibility, measurable performance outcomes
Research Findings
Model Quality and Structured Output
AI providers vary substantially in their ability to produce structured outputs that comply with complex schemas. Major providers (OpenAI, Anthropic, Google Gemini, DeepSeek) consistently produce well-formed JSON with appropriate categorization. Smaller models — particularly those available via local deployments like Ollama — frequently struggle with nested structures. A recurring example: when extracting responsibility areas from job descriptions, smaller models tend to collapse all categories into a single generic bucket rather than preserving the named areas from the source document. Model scale appears to matter meaningfully; models with 70B+ parameters perform significantly better on this task.
Job Parsing Depth Affects Candidate Outcomes
The choice of parsing level has real downstream consequences. Level 1 (Strict) extracts skills only from the “Qualifications” section, producing a shorter and more permissive requirement list. Level 2 (Deep) extracts skills from both “Qualifications” and “Responsibilities” sections, capturing implied requirements that appear in job duties but not in explicit qualification criteria. Using Level 2 can turn a “nice-to-have” skill from the responsibilities section into what the system treats as a mandatory requirement, significantly impacting candidate scores.
Screening Questionnaire Design Matters
A well-designed questionnaire does something a CV cannot: it puts every candidate in the same situation and asks them to respond from their own experience. That only works if the questions are built around the role, not around assumptions about who will apply.
Effective questionnaires:
- Align questions to key skills and responsibilities in the job description
- Favor “how” and “why” questions over “what” questions (depth reveals genuine expertise)
- Include scenario-based prompts that require candidates to draw on multiple skills
- Cross-reference related technologies to probe actual understanding
- Leave room for honest self-assessment, so that admitting limited experience is a valid and valued response
Question design is only half the picture — instructions matter just as much. Where instructions are unclear, candidates may over- or under-state their experience in ways their CVs don’t support, introducing avoidable noise into the data used for AI-assisted evaluation.
Good instructions tell candidates that responses will be reviewed alongside their CV, encourage specific examples from real projects, and make clear that admitting unfamiliarity with a topic is valued over inflated claims. The tone should be transparent and welcoming, not intimidating. Well-crafted instructions don’t just set expectations — they actively improve the signal-to-noise ratio of every response the system receives.
Looking Forward
This research continues to develop. Current and planned areas of investigation include:
Improving structured output from smaller models. The performance gap between large API-hosted models and smaller locally-deployed models on complex schemas is significant. Understanding whether this gap can be closed — through better prompting, fine-tuning, or schema simplification — is a priority for organizations with data privacy constraints that require local deployment.
Richer skill recency and currency assessment. The current recency multiplier is a useful starting point, but more sophisticated approaches could draw on technology lifecycle data, track when specific tools fell out of mainstream use, or distinguish between foundational skills that age slowly and rapidly evolving frameworks.
Prompt strategy research. Different prompt formulations produce meaningfully different evaluations of the same candidate. Systematic study of how prompt structure, specificity, and framing affect evaluation quality and what constitutes a fair and defensible prompt for a given role — is an underexplored area.
Bias detection and mitigation. AI-assisted screening carries the risk of encoding or amplifying existing biases from training data. Future work should examine whether models evaluate equivalent credentials differently based on demographic signals present in CV formatting, naming conventions, educational institutions, or writing style — and develop evaluation strategies that are robust to these effects.
Calibration and human alignment. Ultimately, an AI screening tool is only useful if its assessments correlate meaningfully with the judgments of experienced human recruiters and hiring managers. Developing calibration methods — comparing AI evaluations against ground-truth hiring outcomes — would allow the system to be tuned toward genuine predictive validity.
Multi-round and longitudinal evaluation. The current prototype focuses on initial screening. Extending the framework to later hiring stages — structured interviews, reference checks, or post-hire performance tracking — could create a feedback loop for continuous improvement and provide richer data on what early signals actually predict success.
Important Context
This is experimental research, not a production system. The prototype is designed to explore how AI might assist with CV screening. It does not represent UNU hiring policy or procedures, and is not used for actual hiring decisions at the United Nations University.