Understanding Your Verification Score
Your package scored 85 but isn't Gold. Here's how to read the score breakdown, diagnose blockers, and reach the next tier.
You published your package, verification ran, and you see a score and a tier. But what do those numbers actually mean? And why might a package score 95 but still sit at Verified instead of Gold?
This guide breaks down the entire scoring system.
The Score Breakdown
Every tool pack is scored on a 0-100 scale across seven dimensions:
| Step | Max Points | What It Measures |
|---|---|---|
| Install | 15 | Package installs without errors |
| Import | 15 | Tool entrypoint imports successfully |
| Smoke | 25 | Tool produces a valid return value when called |
| Tests | 15 | Publisher-provided tests pass |
| Contract | 10 | Return value is serializable, non-None, type-stable |
| Reliability | 10 | Same input produces success on repeated runs (3x) |
| Determinism | 5 | Same input produces the same output hash |
Additionally, runtime warnings deduct up to 10 points (2 per warning).
How to Check Your Score
Via the API:
curl https://agentnode.net/v1/packages/my-pack/versions/1.0.0 | jq '.verification'The response includes:
{
"score": 95,
"tier": "gold",
"confidence": "high",
"breakdown": {
"install": {"points": 15, "max": 15, "reason": "Installed in 2.3s"},
"import": {"points": 15, "max": 15, "reason": "All tools imported successfully"},
"smoke": {"points": 25, "max": 25, "reason": "Returned valid result"},
"tests": {"points": 15, "max": 15, "reason": "Publisher-provided tests passed"},
"contract": {"points": 10, "max": 10, "reason": "Serializable, typed return"},
"reliability": {"points": 10, "max": 10, "reason": "3/3 runs passed"},
"determinism": {"points": 5, "max": 5, "reason": "Consistent output across runs"}
}
}Score → Tier Mapping
| Score Range | Base Tier |
|---|---|
| 90-100 | Gold |
| 70-89 | Verified |
| 50-69 | Partial |
| 0-49 | Unverified |
But this is just the base tier. Hard caps can override it downward.
Hard Tier Caps
Even with a high score, certain conditions cap your maximum tier:
| Condition | Maximum Tier |
|---|---|
No verification cases (has_explicit_cases=false) | Verified |
| Smoke test not passed | Verified |
| Contract invalid | Verified |
| Credential boundary reached (no publisher tests) | Partial |
verification_mode=limited | Verified |
This is why a score of 95 doesn't guarantee Gold. The most common blocker: no explicit verification cases.
Common Scenarios and Fixes
Score 95, Tier Verified
Cause: No verification.cases in your manifest. The pipeline used auto-generated inputs and everything passed — but without publisher-declared cases, Gold is not reachable.
Fix: Add a verification.cases block to your agentnode.yaml. See the verification cases guide.
Smoke: 12/25 (Credential Boundary)
Cause: Your tool tried to call an external API and got an auth error. The sandbox has no API keys.
Fix: Either add a VCR cassette (for the API path) or add publisher tests that mock the API call. With passing tests, the smoke score bumps to 15/25 and you can reach Verified.
Contract: 0/10
Cause: Your tool returned None, a non-serializable object, or the return type changed between runs.
Fix: Ensure your tool always returns a JSON-serializable value (dict, list, str, int, float, bool). Never return None on success.
Reliability: 6/10 (2/3 runs passed)
Cause: One of three identical calls failed. Common reasons: rate limiting, network timeouts (in real mode), or non-deterministic state.
Fix: If using verification cases with a VCR cassette, this shouldn't happen (replay is deterministic). If in cases_real mode, ensure your tool handles edge cases gracefully.
Determinism: 0/5
Cause: Same input produced different output hashes across runs. This is expected for tools that include timestamps, random IDs, or live data in their output.
Fix: The pipeline normalizes outputs before hashing (sorts dict keys, strips whitespace). If your tool legitimately produces different output each time (e.g., a news aggregator), partial determinism credit (2-3/5) is normal and acceptable for Gold.
Tests: 0/15
Cause: Publisher tests failed in the container sandbox. Common reasons: tests try to access network, tests depend on system binaries not in the container, tests reference absolute paths.
Fix: Ensure your tests work in an isolated environment. Use pytest.mark.skipif for tests that need optional dependencies. Use relative paths or /tmp for file operations.
The Three Verification Modes
After verification runs, your package is assigned a mode that appears in the score detail:
| Mode | Meaning | Gold Eligible |
|---|---|---|
fixture | Cases ran with VCR cassette replay | Yes |
cases_real | Cases ran with real local execution | Yes |
real_auto | No explicit cases, auto-generated inputs | No |
Gold Checklist
All of these must be true simultaneously:
verification.casespresent in manifest (at least 1 case)- Smoke status:
passed - Contract valid:
true - Reliability:
>= 0.9(at least 3/3 or 9/10 runs pass) - Score:
>= 90 - Mode:
fixtureorcases_real(notreal_auto)
If any one condition fails, the tier caps at Verified regardless of score.
Re-Verification
Your package is re-verified when:
- You publish a new version
- An admin triggers re-verification (after infrastructure updates)
- The verification runner is upgraded (new capabilities)
Re-verification can upgrade your tier (e.g., after you add cases) or downgrade it (if something broke). The tier is always computed fresh from the latest run.
Debugging Tips
- Check the smoke log — the API returns
smoke_logwith detailed output from each case run - Check
smoke_reason— values likecredential_boundary_reached,missing_system_dependency, orneeds_binary_inputtell you exactly what blocked the smoke test - Check
stability_log— shows each stability run's success/failure and output hash - Check
contract_details— shows why contract validation failed (non-serializable, None return, type mismatch)
All of these fields are available in the version detail API response.