Write Tests for Agent Skills & Maximize Verification Score

Why Your Verification Score Matters

When an AI agent searches the AgentNode registry for a tool, it sees scores and tier badges. A package with a Gold badge (90+ score) signals that the tool installs cleanly, runs correctly, produces consistent output, and has been tested. A package sitting at Partial (50-69) or Unverified (<50) tells the agent — and the human reviewing the agent's choices — that this tool is a gamble.

Verification happens automatically when you publish. The pipeline runs your tool in a sandbox and scores it across multiple dimensions. You cannot game the score, but you can engineer it by understanding what each component measures and writing code and tests that satisfy them.

The Scoring Breakdown

The total score is out of 100 points, broken down as follows:

Component	Points	What It Measures
Install	15	Dependencies install without errors
Import	15	Entrypoint module imports cleanly
Smoke Test	25	Tool runs with sample input and returns valid output
Unit Tests	15	Tests in `tests/` directory pass
Reliability	10	Consistent results across multiple smoke runs
Determinism	5	Same input produces same output
Contract	10	Output matches declared `output_schema`
Warnings	-2 each	Deprecation warnings, resource leaks, noisy stderr

Let us go through each component and discuss exactly how to maximize it.

Install (15 Points)

This step installs everything listed in runtime.dependencies from PyPI into a clean virtual environment. To earn the full 15 points:

List every dependency your code imports. If you use requests, it must be in the manifest even if it happens to be pre-installed on your machine.
Use version ranges, not exact pins. requests>=2.28 is better than requests==2.31.0 because pinned versions may conflict with other packages in the sandbox.
Avoid dependencies with native extensions that require system libraries. The verification sandbox has a standard Python environment. Packages like psycopg2 (requires libpq) will fail — use psycopg2-binary instead.
Test your dependencies in a clean virtual environment locally before publishing. Create a fresh venv, install only what the manifest lists, and confirm the tool still works.

Import (15 Points)

The pipeline does the equivalent of import your_module.tool. To pass:

No side effects on import. If your module connects to a database, reads a file, or makes an API call at the top level (outside any function), it will fail in the sandbox. Move all initialization inside functions.
Include __init__.py files. Every directory in your module path needs one.
No conditional imports that fail silently. If you catch ImportError and fall back to something, make sure the fallback works without the missing package.

# BAD — runs on import, will fail in sandbox
import redis
cache = redis.Redis()  # Connection attempt at import time

# GOOD — deferred to function call
import redis

def my_tool(key: str) -> dict:
    cache = redis.Redis()
    return {"value": cache.get(key)}

Smoke Test (25 Points)

This is the single largest scoring component. The pipeline generates sample input based on your input_schema and calls your function. To maximize these 25 points:

Make your input_schema descriptive. The smoke test generator uses field names, types, and descriptions to create realistic input. A field named url with type: string will get a URL. A field named x with no description might get garbage.
Handle edge cases gracefully. The generated input may be empty strings, zero-length arrays, or boundary values. Do not crash on unexpected input — return an error result or raise a clear exception.
Respect timeouts. The smoke test has a 30-second timeout. If your tool makes network calls, use reasonable timeouts on those calls too.
Avoid tools that require authentication for the smoke test. If your tool needs an API key, design it so that missing keys produce a clear error rather than an unhandled crash. The smoke test runs without your environment variables.

Dealing with Network-Dependent Tools

If your tool calls external APIs, the smoke test will attempt real network calls. This is fine if the API is public and free (like calling a public REST API). But if the tool requires authentication, you have two options:

Return a structured error when the API key is missing: {"error": "OPENWEATHER_API_KEY not set"}. This still passes the smoke test because the function returned valid output without crashing.
Include a fallback mode with sample data when credentials are absent. The smoke test will use the fallback, and real users will get live data.

Unit Tests (15 Points)

This is where most developers can make the biggest improvement. The pipeline runs pytest tests/ and scores based on how many tests pass.

Writing Effective Tests

Place your tests in a tests/ directory at the package root. Use standard pytest conventions:

tests/
├── __init__.py      # optional but recommended
├── test_tool.py
└── conftest.py      # optional, for shared fixtures

Here is a template for a well-structured test file:

import pytest
from my_tool.tool import my_function


class TestHappyPath:
    """Tests for expected, valid inputs."""

    def test_basic_input(self):
        result = my_function(text="hello world")
        assert result["word_count"] == 2

    def test_with_optional_params(self):
        result = my_function(text="hello world", max_words=1)
        assert result["word_count"] <= 1


class TestEdgeCases:
    """Tests for boundary conditions."""

    def test_empty_input(self):
        result = my_function(text="")
        assert result["word_count"] == 0

    def test_very_long_input(self):
        result = my_function(text="word " * 10000)
        assert isinstance(result["word_count"], int)

    def test_unicode_input(self):
        result = my_function(text="cafe\u0301 nai\u0308ve re\u0301sume\u0301")
        assert result["word_count"] > 0


class TestOutputShape:
    """Tests that verify the output matches the declared schema."""

    def test_returns_dict(self):
        result = my_function(text="test")
        assert isinstance(result, dict)

    def test_has_required_keys(self):
        result = my_function(text="test")
        assert "word_count" in result
        assert "text" in result

    def test_correct_types(self):
        result = my_function(text="test")
        assert isinstance(result["word_count"], int)
        assert isinstance(result["text"], str)


class TestErrorHandling:
    """Tests for invalid or problematic inputs."""

    def test_none_input_raises(self):
        with pytest.raises((TypeError, ValueError)):
            my_function(text=None)

What Makes Tests Score Well

Coverage of distinct behaviors. Five tests that all check the happy path are worth less than five tests that each check a different scenario.
All tests must pass. A single failing test reduces your score. Remove or fix broken tests before publishing.
Fast execution. Tests that take too long may be killed by the pipeline timeout. Mock network calls if needed.
No external dependencies in tests. Do not rely on a running database, a live API, or files that exist on your machine but not in the sandbox. Use fixtures and mocks.

Reliability (10 Points)

The pipeline runs the smoke test multiple times and checks that the results are consistent. To earn full reliability points:

Avoid randomness in output. If your tool uses random sampling, seed it or make the output independent of the randomness (e.g., always returning the same structure even if internal processing varies).
Handle transient failures. If your tool makes network calls, a single timeout can cost you reliability points. Use retries with short backoffs for external calls.
Clean up after yourself. If your tool writes temp files, delete them. Leftover state from one run can affect the next.

Determinism (5 Points)

Related to reliability but stricter: given the exact same input, does the tool produce the exact same output? This earns you 5 bonus points.

Avoid timestamps in output unless they are meaningful. An extracted_at field that changes every call makes the tool non-deterministic.
Sort collections. If your tool returns a list of items from a set or dictionary, the order may vary across runs. Sort them.
Use deterministic algorithms. If you can choose between a randomized and deterministic approach, choose deterministic.

Contract Compliance (10 Points)

The pipeline checks that the actual output of your tool matches the output_schema in your manifest. This is a common source of lost points:

Declare every field you return. If your function returns a metadata key that is not in the output schema, you may lose points for unexpected fields.
Match types exactly. If the schema says integer, do not return a float. Python's len() returns int, but count / total returns float — be deliberate.
Handle nullable fields. If a field can be None, declare it as type: ["string", "null"] in the schema.

# Schema says: word_count is integer
# BAD:
return {"word_count": float(len(words))}  # Returns 42.0, not 42

# GOOD:
return {"word_count": len(words)}  # Returns 42

Warnings (-2 Points Each)

The pipeline captures stderr and warnings during execution. Each distinct warning costs 2 points. Common offenders:

Deprecation warnings from outdated library usage. Pin to a newer version of the dependency.
Unclosed resource warnings (files, sockets, connections). Always use context managers (with statements).
Print statements to stderr. Some logging configurations write to stderr by default. Redirect or silence them during tool execution.

# BAD — may trigger ResourceWarning
def fetch(url):
    resp = requests.get(url)
    return resp.text

# GOOD — connection properly managed
def fetch(url):
    with requests.Session() as session:
        resp = session.get(url)
        return resp.text

A Scoring Checklist

Before publishing, run through this checklist:

All dependencies listed in runtime.dependencies? (Install: 15 pts)
Module imports with no side effects? (Import: 15 pts)
Tool runs with typical input without crashing? (Smoke: 25 pts)
At least 3-5 tests in tests/, all passing? (Tests: 15 pts)
Tool produces same output on repeated runs? (Reliability: 10 pts, Determinism: 5 pts)
Output matches output_schema exactly? (Contract: 10 pts)
No warnings or resource leaks? (Avoid -2 per warning)

A package that checks every box starts at 95 points — well into Gold territory. Most packages that fall short do so because of missing tests (0 out of 15 points), a schema mismatch (0 out of 10 points), or warnings they did not know about.

Iterating on Your Score

If your first publish lands below Gold, do not worry — that is normal. Read the verification logs on your package page to see exactly where you lost points. Fix the issues, bump the version in your manifest, and publish again:

# After fixing issues
# Update version in manifest.yaml from "0.1.0" to "0.1.1"
agentnode publish

Each version is verified independently. Your latest version's score is what agents see, so there is no penalty for iterating. Some of the highest-rated packages on AgentNode went through three or four versions before hitting Gold.

Start writing better tests today, and give agents a reason to trust your tools.