behavioral-evals — quality + safety report

In the Skillier index (gemini-cli__behavioral-evals) · scanned 2026-06-03 · engine: builtin+triage

A
Quality
96/100
Safety

✓ Clean — no heuristic safety flags surfaced.

Heuristic flags from the builtin scanner, which is known to over-flag (it trips on legitimate env-reading integrations, security skills, and library .eval calls). This is NOT an authoritative malicious verdict — re-scan with SkillSpector for the authoritative result. Run the authoritative scan →

Skillproof quality grade A

📇 This skill is in the Skillier index (curated · deduped · quality-filtered). Install Skillier to route & load it into your AI client.

Quality notes

No example
low · quality · body
→ Add at least one worked example (input → expected action/output).
No explicit output format / contract
low · quality · body
→ State the expected output format (structure, sections, or schema).

About this skill

Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests.

📄 Read the SKILL.md
---
name: behavioral-evals
description: Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests.
---

# Behavioral Evals

## Overview

Behavioral evaluations (evals) are tests that validate the **agent's decision-making** (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.

> [!NOTE]
> **Single Source of Truth**: For core concepts, policies, running tests, and general best practices, always refer to **[evals/README.md](file:///Users/abhipatel/code/gemini-cli/docs/evals/README.md)**.

---

## 🔄 Workflow Decision Tree

1.  **Does a prompt/tool change need validation?**
    *   *No* -> Normal integration tests.
    *   *Yes* -> Continue below.
2.  **Is it UI/Interaction heavy?**
    *   *Yes* -> Use `appEvalTest` (`AppRig`). See **[creating.md](references/creating.md)**.
    *   *No* -> Use `evalTest` (`TestRig`). See **[creating.md](references/creating.md)**.
3.  **Is it a new test?**
    *   *Yes* -> Set policy to `USUALLY_PASSES`.
    *   *No* -> `ALWAYS_PASSES` (locks in regression).
4.  **Are you fixing a failure or promoting a test?**
    *   *Fixing* -> See **[fixing.md](references/fixing.md)**.
    *   *Promoting* -> See **[promoting.md](references/promoting.md)**.

---

## 📋 Quick Checklist

### 1. Setup Workspace
Seed the workspace with necessary files using the `files` object to simulate a realistic scenario (e.g., NodeJS project with `package.json`).
*   *Details in **[creating.md](references/creating.md)***

### 2. Write Assertions
Audit agent decisions using `rig.setBreakpoint()` (AppRig only) or index verification on `rig.readToolLogs()`.
*   *Details in **[creating.md](references/creating.md)***

### 3. Verify
Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.
*   *See **[evals/README.md](file:///Users/abhipatel/code/gemini-cli/docs/evals/README.md)** for running commands.*

---

## 📦 Bundled Resources

Detailed procedural guides:
*   **[creating.md](references/creating.md)**: Assertion strategies, Rig selection, Mock MCPs.
*   **[fixing.md](references/fixing.md)**: Step-by-step automated investigation, architecture diagnosis guidelines.
*   **[promoting.md](references/promoting.md)**: Candidate identification criteria and threshold guidelines.
Scan or optimize your own skill →

Want a live grade + an embeddable README badge? Run your skill through the free scanner.

Graded independently by Skillproof — nothing to sell the author. Quality is mechanical + corpus-grounded; safety flags are heuristic (builtin+triage), not a malicious verdict.