Verification Cases — Declarative Testing for AI Tool Packages — AgentNode Blog

When you install a package from a registry, how do you know it actually works? Most ecosystems rely on unit tests that run during CI — but those tests are written by the publisher and run in the publisher's environment. They prove the code works somewhere. Not necessarily on your machine, in your Python version, with your dependency tree.

AgentNode takes a different approach. Every package is verified by the platform, in a sandboxed container, with reproducible inputs. Today we're shipping the final piece of that system: verification cases.

The Problem We Solved

Before today, our verification pipeline ran packages through four steps: install, import, smoke test, and publisher tests. The smoke test used auto-generated inputs — educated guesses based on the tool's input schema. This worked well for simple tools, but had limitations:

API tools hit credential boundaries (no real API keys in the sandbox)
File-based tools had no files to process
Complex input tools received syntactically valid but semantically meaningless inputs

The result: many high-quality packages scored well but couldn't reach Gold tier because the smoke test couldn't fully exercise them.

The Solution: Publisher-Declared Verification Cases

Verification cases flip the model. Instead of the platform guessing what inputs to try, the publisher declares exactly how to test their tool:

verification:
  cases:
    - name: "analyze_sample_csv"
      tool: "describe_csv"
      input:
        file_path: "/workspace/fixtures/test_data.csv"
      expected:
        return_type: "dict"
        required_keys: ["rows", "columns", "statistics"]
    - name: "filter_by_city"
      tool: "filter_csv"
      input:
        file_path: "/workspace/fixtures/test_data.csv"
        column: "city"
        value: "Berlin"
        operator: "=="
      expected:
        return_type: "dict"
        required_keys: ["matched_rows", "rows"]

The platform runs these cases in the same sandboxed container as before — same security guarantees, same isolation. But now it has meaningful inputs and concrete expectations.

Three Patterns for Three Types of Tools

1. File-Based Tools

Tools that process files (CSV, JSON, images, audio) bundle a small test fixture in their package:

my-pack/
├── fixtures/
│   └── test_data.csv
├── src/
│   └── my_pack/tool.py
└── agentnode.yaml

The sandbox mounts the package at /workspace/, so fixtures are accessible at /workspace/fixtures/test_data.csv. No network access needed. The tool runs exactly as it would in production, just with a known input.

2. API Tools (VCR Cassettes)

Tools that call external APIs face a challenge: the sandbox blocks all network access. The solution is VCR cassette replay — pre-recorded HTTP responses that are replayed during verification:

verification:
  cases:
    - name: "search_python"
      tool: "search_web"
      input:
        query: "Python programming"
      cassette: "fixtures/cassettes/search.yaml"
      expected:
        return_type: "dict"
        required_keys: ["results"]

The publisher records the cassette once (locally, with real API access), commits the YAML file, and the verification pipeline replays those exact responses on every run. This gives us:

Determinism — same response every time
No credentials needed — sandbox stays locked down
Real behavior — the tool processes real API responses, not mocks

3. Pure Computation

Tools that don't need files or network (formatters, parsers, calculators) just declare their inputs directly:

verification:
  cases:
    - name: "format_json"
      input:
        data: '{"key": "value"}'
        indent: 2
      expected:
        return_type: "str"

The Gold Tier Policy

With verification cases, the tier system now has a clear policy:

Condition	Maximum Tier
No verification cases	Verified
Cases present, all passing	Gold

This is deliberate. A package that scores 95/100 but has no explicit verification cases maxes out at Verified. The auto-generated smoke test proves the tool installs and imports correctly. But only publisher-declared cases prove it works as intended.

Gold means: "This tool was tested with meaningful inputs and produced expected outputs in a sandboxed environment."

What Gets Validated

Each case can declare expectations about the return value:

Field	What it checks
`return_type`	Python type name (dict, str, list, etc.)
`required_keys`	Keys that must exist in a dict return
`min_lengths`	Minimum length of specific collections
`min_length`	Minimum total length of the return value

Beyond these static checks, the pipeline also runs stability checks (same input, 3 runs, compare outputs) to measure reliability and determinism. All of this feeds into the final score.

System Requirements

Some tools need system-level dependencies that aren't available in the base container. Verification cases support a system_requirements field that tells the pipeline which specialized container image to use:

verification:
  system_requirements: ["browser"]
  cases:
    - name: "render_page"
      input:
        url: "https://example.com"
      cassette: "fixtures/cassettes/render.yaml"

Currently supported: browser (Playwright), ffmpeg, tesseract, imagemagick. Unknown requirements produce a warning but don't block publishing — forward-compatible by design.

Backward Compatibility

If you published before today using the older verification.fixtures or verification.test_input format: nothing breaks. The pipeline automatically normalizes legacy formats into the new cases structure. Your existing Gold tier is preserved.

New packages should use verification.cases directly. The older formats remain functional but are considered deprecated.

The Numbers

After deploying verification cases and re-verifying our starter pack library:

95 tool packs at Gold tier (score ≥ 90, all criteria met)
0 regressions — no existing Gold pack lost its tier
3 verification modes tracked: fixture (VCR replay), cases_real (local execution), real_auto (auto-generated)

Getting Started

Adding verification cases to your package takes about 5 minutes:

Create a fixtures/ directory with test data (if your tool processes files)
Add a MANIFEST.in with recursive-include fixtures *
Add the verification.cases block to your agentnode.yaml
Publish a new version

The pipeline runs your cases automatically. If they pass along with the standard checks (contract, reliability, determinism), you'll reach Gold on the next verification run.

Full schema reference: Publishing Guide — Verification Cases

What's Next

Verification cases are the foundation for several upcoming features:

Publisher CLI tooling — record VCR cassettes with agentnode record-fixture
Compatibility matrix — run cases across Python versions and dependency sets
Public verification reports — show exact inputs/outputs on the package detail page

The goal is simple: every package in the registry should be provably functional. Verification cases make that possible without manual review, without special-case logic, and without trusting the publisher's CI.

Trust, but verify — automatically.