Verification Cases: How Packages Prove They Work
We shipped a declarative verification system that lets publishers prove their tools work. No manual review, no special-case logic — declare your test cases and the pipeline handles the rest.
When you install a package from a registry, how do you know it actually works? Most ecosystems rely on unit tests that run during CI — but those tests are written by the publisher and run in the publisher's environment. They prove the code works somewhere. Not necessarily on your machine, in your Python version, with your dependency tree.
AgentNode takes a different approach. Every package is verified by the platform, in a sandboxed container, with reproducible inputs. Today we're shipping the final piece of that system: verification cases.
The Problem We Solved
Before today, our verification pipeline ran packages through four steps: install, import, smoke test, and publisher tests. The smoke test used auto-generated inputs — educated guesses based on the tool's input schema. This worked well for simple tools, but had limitations:
- API tools hit credential boundaries (no real API keys in the sandbox)
- File-based tools had no files to process
- Complex input tools received syntactically valid but semantically meaningless inputs
The result: many high-quality packages scored well but couldn't reach Gold tier because the smoke test couldn't fully exercise them.
The Solution: Publisher-Declared Verification Cases
Verification cases flip the model. Instead of the platform guessing what inputs to try, the publisher declares exactly how to test their tool:
verification:
cases:
- name: "analyze_sample_csv"
tool: "describe_csv"
input:
file_path: "/workspace/fixtures/test_data.csv"
expected:
return_type: "dict"
required_keys: ["rows", "columns", "statistics"]
- name: "filter_by_city"
tool: "filter_csv"
input:
file_path: "/workspace/fixtures/test_data.csv"
column: "city"
value: "Berlin"
operator: "=="
expected:
return_type: "dict"
required_keys: ["matched_rows", "rows"]The platform runs these cases in the same sandboxed container as before — same security guarantees, same isolation. But now it has meaningful inputs and concrete expectations.
Three Patterns for Three Types of Tools
1. File-Based Tools
Tools that process files (CSV, JSON, images, audio) bundle a small test fixture in their package:
my-pack/
├── fixtures/
│ └── test_data.csv
├── src/
│ └── my_pack/tool.py
└── agentnode.yamlThe sandbox mounts the package at /workspace/, so fixtures are accessible at /workspace/fixtures/test_data.csv. No network access needed. The tool runs exactly as it would in production, just with a known input.
2. API Tools (VCR Cassettes)
Tools that call external APIs face a challenge: the sandbox blocks all network access. The solution is VCR cassette replay — pre-recorded HTTP responses that are replayed during verification:
verification:
cases:
- name: "search_python"
tool: "search_web"
input:
query: "Python programming"
cassette: "fixtures/cassettes/search.yaml"
expected:
return_type: "dict"
required_keys: ["results"]The publisher records the cassette once (locally, with real API access), commits the YAML file, and the verification pipeline replays those exact responses on every run. This gives us:
- Determinism — same response every time
- No credentials needed — sandbox stays locked down
- Real behavior — the tool processes real API responses, not mocks
3. Pure Computation
Tools that don't need files or network (formatters, parsers, calculators) just declare their inputs directly:
verification:
cases:
- name: "format_json"
input:
data: '{"key": "value"}'
indent: 2
expected:
return_type: "str"The Gold Tier Policy
With verification cases, the tier system now has a clear policy:
| Condition | Maximum Tier |
|---|---|
| No verification cases | Verified |
| Cases present, all passing | Gold |
This is deliberate. A package that scores 95/100 but has no explicit verification cases maxes out at Verified. The auto-generated smoke test proves the tool installs and imports correctly. But only publisher-declared cases prove it works as intended.
Gold means: "This tool was tested with meaningful inputs and produced expected outputs in a sandboxed environment."
What Gets Validated
Each case can declare expectations about the return value:
| Field | What it checks |
|---|---|
return_type | Python type name (dict, str, list, etc.) |
required_keys | Keys that must exist in a dict return |
min_lengths | Minimum length of specific collections |
min_length | Minimum total length of the return value |
Beyond these static checks, the pipeline also runs stability checks (same input, 3 runs, compare outputs) to measure reliability and determinism. All of this feeds into the final score.
System Requirements
Some tools need system-level dependencies that aren't available in the base container. Verification cases support a system_requirements field that tells the pipeline which specialized container image to use:
verification:
system_requirements: ["browser"]
cases:
- name: "render_page"
input:
url: "https://example.com"
cassette: "fixtures/cassettes/render.yaml"Currently supported: browser (Playwright), ffmpeg, tesseract, imagemagick. Unknown requirements produce a warning but don't block publishing — forward-compatible by design.
Backward Compatibility
If you published before today using the older verification.fixtures or verification.test_input format: nothing breaks. The pipeline automatically normalizes legacy formats into the new cases structure. Your existing Gold tier is preserved.
New packages should use verification.cases directly. The older formats remain functional but are considered deprecated.
The Numbers
After deploying verification cases and re-verifying our starter pack library:
- 95 tool packs at Gold tier (score ≥ 90, all criteria met)
- 0 regressions — no existing Gold pack lost its tier
- 3 verification modes tracked:
fixture(VCR replay),cases_real(local execution),real_auto(auto-generated)
Getting Started
Adding verification cases to your package takes about 5 minutes:
- Create a
fixtures/directory with test data (if your tool processes files) - Add a
MANIFEST.inwithrecursive-include fixtures * - Add the
verification.casesblock to youragentnode.yaml - Publish a new version
The pipeline runs your cases automatically. If they pass along with the standard checks (contract, reliability, determinism), you'll reach Gold on the next verification run.
Full schema reference: Publishing Guide — Verification Cases
What's Next
Verification cases are the foundation for several upcoming features:
- Publisher CLI tooling — record VCR cassettes with
agentnode record-fixture - Compatibility matrix — run cases across Python versions and dependency sets
- Public verification reports — show exact inputs/outputs on the package detail page
The goal is simple: every package in the registry should be provably functional. Verification cases make that possible without manual review, without special-case logic, and without trusting the publisher's CI.
Trust, but verify — automatically.