advanced-evaluation — quality + safety report
In the Skillier index (antigravity__advanced-evaluation) · scanned 2026-06-03 · engine: builtin+triage
✓ Clean — no heuristic safety flags surfaced.
Heuristic flags from the builtin scanner, which is known to over-flag (it trips on legitimate env-reading integrations, security skills, and library .eval calls). This is NOT an authoritative malicious verdict — re-scan with SkillSpector for the authoritative result. Run the authoritative scan →
📇 This skill is in the Skillier index (curated · deduped · quality-filtered). Install Skillier to route & load it into your AI client.
Quality notes
About this skill
This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.
📄 Read the SKILL.md
---
name: advanced-evaluation
description: This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.
risk: safe
source: community
date_added: 2026-03-18
---
# Advanced Evaluation
This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.
**Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
## When to Use
Activate this skill when:
- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards across evaluation teams
- Debugging evaluation systems that show inconsistent results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
- Analyzing correlation between automated and human judgments
## Core Concepts
### The Evaluation Taxonomy
Evaluation approaches fall into two primary categories with distinct reliability profiles:
**Direct Scoring**: A single LLM rates one response on a defined scale.
- Best for: Objective criteria (factual accuracy, instruction following, toxicity)
- Reliability: Moderate to high for well-defined criteria
- Failure mode: Score calibration drift, inconsistent scale interpretation
**Pairwise Comparison**: An LLM compares two responses and selects the better one.
- Best for: Subjective preferences (tone, style, persuasiveness)
- Reliability: Higher than direct scoring for preferences
- Failure mode: Position bias, length bias
Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.
### The Bias Landscape
LLM judges exhibit systematic biases that must be actively mitigated:
**Position Bias**: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.
**Length Bias**: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.
**Self-Enhancement Bias**: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.
**Verbosity Bias**: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.
**Authority Bias**: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.
### Metric Selection Framework
Choose metrics based on the evaluation task structure:
| Task Type | Primary Metrics | Secondary Metrics |
|-----------|-----------------|-------------------|
| Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's κ |
| Ordinal scale (1-5 rating) | Spearman's ρ, Kendall's τ | Cohen's κ (weighted) |
| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |
| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.
## Evaluation Approaches
### Direct Scoring Implementation
Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format.
**Criteria Definition Pattern**:
```
Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]
```
**Scale Calibration**:
- 1-3 scales: Binary with neutral option, lowest cognitive load
- 1-5 scales: Standard Likert, good balance of granularity and reliability
- 1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics
**Prompt Structure for Direct Scoring**:
```
You are an expert evaluator assessing response quality.
## Task
Evaluate the following response against each criterion.
## Original Prompt
{prompt}
## Response to Evaluate
{response}
## Criteria
{for each criterion: name, description, weight}
## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
4. Suggest one specific improvement
## Output Format
Respond with structured JSON containing scores, justifications, and summary.
```
**Chain-of-Thought Requirement**: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.
### Pairwise Comparison Implementation
Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.
**Position Bias Mitigation Protocol**:
1. First pass: Response A in first position, Response B in second
2. Second pass: Response B in first position, Response A in second
3. Consistency check: If passes disagree, return TIE with reduced confidence
4. Final verdict: Consistent winner with averaged confidence
**Prompt Structure for Pairwise Comparison**:
```
You are an expert evaluator comparing two AI responses.
## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to the specified criteria
- Ties are acceptable when responses are genuinely equivalent
## Original Prompt
{prompt}
## Response A
{response_a}
## Response B
{response_b}
## Comparison Criteria
{criteria list}
## Instructions
1. Analyze each response independently first
2. Compare them on each criterion
3. Determine overall winner with confidence level
## Output Format
JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.
```
**Confidence Calibration**: Confidence scores should reflect position consistency:
- Both passes agree: confidence = average of individual confidences
- Passes disagree: confidence = 0.5, verdict = TIE
### Rubric Generation
Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.
**Rubric Components**:
1. **Level descriptions**: Clear boundaries for each score level
2. **Characteristics**: Observable features that define each level
3. **Examples**: Representative text for each level (optional but valuable)
4. **Edge cases**: Guidance for ambiguous situations
5. **Scoring guidelines**: General principles for consistent application
**Strictness Calibration**:
- **Lenient**: Lower bar for passing scores, appropriate for encouraging iteration
- **Balanced**: Fair, typical expectations for production use
- **Strict**: High standards, appropriate for safety-critical or high-stakes evaluation
**Domain Adaptation**: Rubrics should use domain-specific terminology. A "code readability" rubric mentions variables, functions, and comments. A "medical accuracy" rubric references clinical terminology and evidence standards.
## Practical Guidance
### Evaluation Pipeline Design
Production evaluation systems require multiple layers:
```
┌─────────────────────────────────────────────────┐
│ Evaluation Pipeline │
├─────────────────────────────────────────────────┤
│ │
│ Input: Response + Prompt + Context │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Criteria Loader │ ◄── Rubrics, weights │
│ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Primary Scorer │ ◄── Direct or Pairwise │
│ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Bias Mitigation │ ◄── Position swap, etc. │
│ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Confidence Scoring │ ◄── Calibration │
│ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ Output: Scores + Justifications + Confidence │
│ │
└─────────────────────────────────────────────────┘
```
### Common Anti-Patterns
**Anti-pattern: Scoring without justification**
- Problem: Scores lack grounding, difficult to debug or improve
- Solution: Always require evidence-based justification before score
**Anti-pattern: Single-pass pairwise comparison**
- Problem: Position bias corrupts results
- Solution: Always swap positions and check consistency
**Anti-pattern: Overloaded criteria**
- Problem: Criteria measuring multiple things are unreliable
- Solution: One criterion = one measurable aspect
**Anti-pattern: Missing edge case guidance**
- Problem: Evaluators handle ambiguous cases inconsistently
- Solution: Include edge cases in rubrics with explicit guidance
**Anti-pattern: Ignoring confidence calibration**
- Problem: High-confidence wrong judgments are worse than low-confidence
- Solution: Calibrate confidence to position consistency and evidence strength
### Decision Framework: Direct vs. Pairwise
Use this decision tree:
```
Is there an objective ground truth?
├── Yes → Direct Scoring
│ └── Examples: factual accuracy, instruction following, format compliance
│
└── No → Is it a preference or quality judgment?
├── Yes → Pairwise Comparison
│ └── Examples: tone, style, persuasiveness, creativity
│
└── No → Consider reference-based evaluation
└── Examples: summarization (compare to source), translation (compare to reference)
```
### Scaling Evaluation
For high-volume evaluation:
1. **Panel of LLMs (PoLL)**: Use multiple models as judges, aggregate votes
- Reduces individual model bias
- More expensive but more reliable for high-stakes decisions
2. **Hierarchical evaluation**: Fast cheap model for screening, expensive model for edge cases
- Cost-effective for large volumes
- Requires calibration of screening threshold
3. **Human-in-the-loop**: Automated evaluation for clear cases, human review for low-confidence
- Best reliability for critical applications
- Design feedback loop to improve automated evaluation
## Examples
### Example 1: Direct Scoring for Accuracy
**Input**:
```
Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5
```
**Output**:
```json
{
"criterion": "Factual Accuracy",
"score": 5,
"evidence": [
"Correctly identifies axial tilt as primary cause",
"Correctly explains differential sunlight by hemisphere",
"No factual errors present"
],
"justification": "Response accurately explains the cau
… (truncated)Want a live grade + an embeddable README badge? Run your skill through the free scanner.
Graded independently by Skillproof — nothing to sell the author. Quality is mechanical + corpus-grounded; safety flags are heuristic (builtin+triage), not a malicious verdict.