Skip to main content
Building & Publishing5 min read

Understanding Your Verification Score

Your package scored 85 but isn't Gold. Here's how to read the score breakdown, diagnose blockers, and reach the next tier.

By agentnode

You published your package, verification ran, and you see a score and a tier. But what do those numbers actually mean? And why might a package score 95 but still sit at Verified instead of Gold?

This guide breaks down the entire scoring system.

The Score Breakdown

Every tool pack is scored on a 0-100 scale across seven dimensions:

StepMax PointsWhat It Measures
Install15Package installs without errors
Import15Tool entrypoint imports successfully
Smoke25Tool produces a valid return value when called
Tests15Publisher-provided tests pass
Contract10Return value is serializable, non-None, type-stable
Reliability10Same input produces success on repeated runs (3x)
Determinism5Same input produces the same output hash

Additionally, runtime warnings deduct up to 10 points (2 per warning).

How to Check Your Score

Via the API:

curl https://agentnode.net/v1/packages/my-pack/versions/1.0.0 | jq '.verification'

The response includes:

{
  "score": 95,
  "tier": "gold",
  "confidence": "high",
  "breakdown": {
    "install": {"points": 15, "max": 15, "reason": "Installed in 2.3s"},
    "import": {"points": 15, "max": 15, "reason": "All tools imported successfully"},
    "smoke": {"points": 25, "max": 25, "reason": "Returned valid result"},
    "tests": {"points": 15, "max": 15, "reason": "Publisher-provided tests passed"},
    "contract": {"points": 10, "max": 10, "reason": "Serializable, typed return"},
    "reliability": {"points": 10, "max": 10, "reason": "3/3 runs passed"},
    "determinism": {"points": 5, "max": 5, "reason": "Consistent output across runs"}
  }
}

Score → Tier Mapping

Score RangeBase Tier
90-100Gold
70-89Verified
50-69Partial
0-49Unverified

But this is just the base tier. Hard caps can override it downward.

Hard Tier Caps

Even with a high score, certain conditions cap your maximum tier:

ConditionMaximum Tier
No verification cases (has_explicit_cases=false)Verified
Smoke test not passedVerified
Contract invalidVerified
Credential boundary reached (no publisher tests)Partial
verification_mode=limitedVerified

This is why a score of 95 doesn't guarantee Gold. The most common blocker: no explicit verification cases.

Common Scenarios and Fixes

Score 95, Tier Verified

Cause: No verification.cases in your manifest. The pipeline used auto-generated inputs and everything passed — but without publisher-declared cases, Gold is not reachable.

Fix: Add a verification.cases block to your agentnode.yaml. See the verification cases guide.

Smoke: 12/25 (Credential Boundary)

Cause: Your tool tried to call an external API and got an auth error. The sandbox has no API keys.

Fix: Either add a VCR cassette (for the API path) or add publisher tests that mock the API call. With passing tests, the smoke score bumps to 15/25 and you can reach Verified.

Contract: 0/10

Cause: Your tool returned None, a non-serializable object, or the return type changed between runs.

Fix: Ensure your tool always returns a JSON-serializable value (dict, list, str, int, float, bool). Never return None on success.

Reliability: 6/10 (2/3 runs passed)

Cause: One of three identical calls failed. Common reasons: rate limiting, network timeouts (in real mode), or non-deterministic state.

Fix: If using verification cases with a VCR cassette, this shouldn't happen (replay is deterministic). If in cases_real mode, ensure your tool handles edge cases gracefully.

Determinism: 0/5

Cause: Same input produced different output hashes across runs. This is expected for tools that include timestamps, random IDs, or live data in their output.

Fix: The pipeline normalizes outputs before hashing (sorts dict keys, strips whitespace). If your tool legitimately produces different output each time (e.g., a news aggregator), partial determinism credit (2-3/5) is normal and acceptable for Gold.

Tests: 0/15

Cause: Publisher tests failed in the container sandbox. Common reasons: tests try to access network, tests depend on system binaries not in the container, tests reference absolute paths.

Fix: Ensure your tests work in an isolated environment. Use pytest.mark.skipif for tests that need optional dependencies. Use relative paths or /tmp for file operations.

The Three Verification Modes

After verification runs, your package is assigned a mode that appears in the score detail:

ModeMeaningGold Eligible
fixtureCases ran with VCR cassette replayYes
cases_realCases ran with real local executionYes
real_autoNo explicit cases, auto-generated inputsNo

Gold Checklist

All of these must be true simultaneously:

  • verification.cases present in manifest (at least 1 case)
  • Smoke status: passed
  • Contract valid: true
  • Reliability: >= 0.9 (at least 3/3 or 9/10 runs pass)
  • Score: >= 90
  • Mode: fixture or cases_real (not real_auto)

If any one condition fails, the tier caps at Verified regardless of score.

Re-Verification

Your package is re-verified when:

  • You publish a new version
  • An admin triggers re-verification (after infrastructure updates)
  • The verification runner is upgraded (new capabilities)

Re-verification can upgrade your tier (e.g., after you add cases) or downgrade it (if something broke). The tier is always computed fresh from the latest run.

Debugging Tips

  1. Check the smoke log — the API returns smoke_log with detailed output from each case run
  2. Check smoke_reason — values like credential_boundary_reached, missing_system_dependency, or needs_binary_input tell you exactly what blocked the smoke test
  3. Check stability_log — shows each stability run's success/failure and output hash
  4. Check contract_details — shows why contract validation failed (non-serializable, None return, type mismatch)

All of these fields are available in the version detail API response.