eval — quality + safety report
In the Skillier index (alireza__eval) · scanned 2026-06-03 · engine: builtin+triage
1 heuristic flag to review
Heuristic flags from the builtin scanner, which is known to over-flag (it trips on legitimate env-reading integrations, security skills, and library .eval calls). This is NOT an authoritative malicious verdict — re-scan with SkillSpector for the authoritative result. Run the authoritative scan →
📇 This skill is in the Skillier index (curated · deduped · quality-filtered). Install Skillier to route & load it into your AI client.
Quality notes
About this skill
Evaluate and rank agent results by metric or LLM judge for an AgentHub session.
📄 Read the SKILL.md
---
name: "eval"
description: "Evaluate and rank agent results by metric or LLM judge for an AgentHub session."
command: /hub:eval
---
# /hub:eval — Evaluate Agent Results
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
## Usage
```
/hub:eval # Eval latest session using configured criteria
/hub:eval 20260317-143022 # Eval specific session
/hub:eval --judge # Force LLM judge mode (ignore metric config)
```
## What It Does
### Metric Mode (eval command configured)
Run the evaluation command in each agent's worktree:
```bash
python {skill_path}/scripts/result_ranker.py \
--session {session-id} \
--eval-cmd "{eval_cmd}" \
--metric {metric} --direction {direction}
```
Output:
```
RANK AGENT METRIC DELTA FILES
1 agent-2 142ms -38ms 2
2 agent-1 165ms -15ms 3
3 agent-3 190ms +10ms 1
Winner: agent-2 (142ms)
```
### LLM Judge Mode (no eval command, or --judge flag)
For each agent:
1. Get the diff: `git diff {base_branch}...{agent_branch}`
2. Read the agent's result post from `.agenthub/board/results/agent-{i}-result.md`
3. Compare all diffs and rank by:
- **Correctness** — Does it solve the task?
- **Simplicity** — Fewer lines changed is better (when equal correctness)
- **Quality** — Clean execution, good structure, no regressions
Present rankings with justification.
Example LLM judge output for a content task:
```
RANK AGENT VERDICT WORD COUNT
1 agent-1 Strong narrative, clear CTA 1480
2 agent-3 Good data points, weak intro 1520
3 agent-2 Generic tone, no differentiation 1350
Winner: agent-1 (strongest narrative arc and call-to-action)
```
### Hybrid Mode
1. Run metric evaluation first
2. If top agents are within 10% of each other, use LLM judge to break ties
3. Present both metric and qualitative rankings
## After Eval
1. Update session state:
```bash
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
```
2. Tell the user:
- Ranked results with winner highlighted
- Next step: `/hub:merge` to merge the winner
- Or `/hub:merge {session-id} --agent {winner}` to be explicitWant a live grade + an embeddable README badge? Run your skill through the free scanner.
Graded independently by Skillproof — nothing to sell the author. Quality is mechanical + corpus-grounded; safety flags are heuristic (builtin+triage), not a malicious verdict.