How to Write Tests for Your Agent Skill and Maximize Your Verification Score
Your verification score determines how much agents trust your package. Learn how to write tests that push your score into Gold tier.
Why Your Verification Score Matters
When an AI agent searches the AgentNode registry for a tool, it sees scores and tier badges. A package with a Gold badge (90+ score) signals that the tool installs cleanly, runs correctly, produces consistent output, and has been tested. A package sitting at Partial (50-69) or Unverified (<50) tells the agent — and the human reviewing the agent's choices — that this tool is a gamble.
Verification happens automatically when you publish. The pipeline runs your tool in a sandbox and scores it across multiple dimensions. You cannot game the score, but you can engineer it by understanding what each component measures and writing code and tests that satisfy them.
The Scoring Breakdown
The total score is out of 100 points, broken down as follows:
| Component | Points | What It Measures |
|---|---|---|
| Install | 15 | Dependencies install without errors |
| Import | 15 | Entrypoint module imports cleanly |
| Smoke Test | 25 | Tool runs with sample input and returns valid output |
| Unit Tests | 15 | Tests in tests/ directory pass |
| Reliability | 10 | Consistent results across multiple smoke runs |
| Determinism | 5 | Same input produces same output |
| Contract | 10 | Output matches declared output_schema |
| Warnings | -2 each | Deprecation warnings, resource leaks, noisy stderr |
Let us go through each component and discuss exactly how to maximize it.
Install (15 Points)
This step installs everything listed in runtime.dependencies from PyPI into a clean virtual environment. To earn the full 15 points:
- List every dependency your code imports. If you use
requests, it must be in the manifest even if it happens to be pre-installed on your machine. - Use version ranges, not exact pins.
requests>=2.28is better thanrequests==2.31.0because pinned versions may conflict with other packages in the sandbox. - Avoid dependencies with native extensions that require system libraries. The verification sandbox has a standard Python environment. Packages like
psycopg2(requires libpq) will fail — usepsycopg2-binaryinstead. - Test your dependencies in a clean virtual environment locally before publishing. Create a fresh venv, install only what the manifest lists, and confirm the tool still works.
Import (15 Points)
The pipeline does the equivalent of import your_module.tool. To pass:
- No side effects on import. If your module connects to a database, reads a file, or makes an API call at the top level (outside any function), it will fail in the sandbox. Move all initialization inside functions.
- Include
__init__.pyfiles. Every directory in your module path needs one. - No conditional imports that fail silently. If you catch
ImportErrorand fall back to something, make sure the fallback works without the missing package.
# BAD — runs on import, will fail in sandbox
import redis
cache = redis.Redis() # Connection attempt at import time
# GOOD — deferred to function call
import redis
def my_tool(key: str) -> dict:
cache = redis.Redis()
return {"value": cache.get(key)}
Smoke Test (25 Points)
This is the single largest scoring component. The pipeline generates sample input based on your input_schema and calls your function. To maximize these 25 points:
- Make your
input_schemadescriptive. The smoke test generator uses field names, types, and descriptions to create realistic input. A field namedurlwithtype: stringwill get a URL. A field namedxwith no description might get garbage. - Handle edge cases gracefully. The generated input may be empty strings, zero-length arrays, or boundary values. Do not crash on unexpected input — return an error result or raise a clear exception.
- Respect timeouts. The smoke test has a 30-second timeout. If your tool makes network calls, use reasonable timeouts on those calls too.
- Avoid tools that require authentication for the smoke test. If your tool needs an API key, design it so that missing keys produce a clear error rather than an unhandled crash. The smoke test runs without your environment variables.
Dealing with Network-Dependent Tools
If your tool calls external APIs, the smoke test will attempt real network calls. This is fine if the API is public and free (like calling a public REST API). But if the tool requires authentication, you have two options:
- Return a structured error when the API key is missing:
{"error": "OPENWEATHER_API_KEY not set"}. This still passes the smoke test because the function returned valid output without crashing. - Include a fallback mode with sample data when credentials are absent. The smoke test will use the fallback, and real users will get live data.
Unit Tests (15 Points)
This is where most developers can make the biggest improvement. The pipeline runs pytest tests/ and scores based on how many tests pass.
Writing Effective Tests
Place your tests in a tests/ directory at the package root. Use standard pytest conventions:
tests/
├── __init__.py # optional but recommended
├── test_tool.py
└── conftest.py # optional, for shared fixtures
Here is a template for a well-structured test file:
import pytest
from my_tool.tool import my_function
class TestHappyPath:
"""Tests for expected, valid inputs."""
def test_basic_input(self):
result = my_function(text="hello world")
assert result["word_count"] == 2
def test_with_optional_params(self):
result = my_function(text="hello world", max_words=1)
assert result["word_count"] <= 1
class TestEdgeCases:
"""Tests for boundary conditions."""
def test_empty_input(self):
result = my_function(text="")
assert result["word_count"] == 0
def test_very_long_input(self):
result = my_function(text="word " * 10000)
assert isinstance(result["word_count"], int)
def test_unicode_input(self):
result = my_function(text="cafe\u0301 nai\u0308ve re\u0301sume\u0301")
assert result["word_count"] > 0
class TestOutputShape:
"""Tests that verify the output matches the declared schema."""
def test_returns_dict(self):
result = my_function(text="test")
assert isinstance(result, dict)
def test_has_required_keys(self):
result = my_function(text="test")
assert "word_count" in result
assert "text" in result
def test_correct_types(self):
result = my_function(text="test")
assert isinstance(result["word_count"], int)
assert isinstance(result["text"], str)
class TestErrorHandling:
"""Tests for invalid or problematic inputs."""
def test_none_input_raises(self):
with pytest.raises((TypeError, ValueError)):
my_function(text=None)
What Makes Tests Score Well
- Coverage of distinct behaviors. Five tests that all check the happy path are worth less than five tests that each check a different scenario.
- All tests must pass. A single failing test reduces your score. Remove or fix broken tests before publishing.
- Fast execution. Tests that take too long may be killed by the pipeline timeout. Mock network calls if needed.
- No external dependencies in tests. Do not rely on a running database, a live API, or files that exist on your machine but not in the sandbox. Use fixtures and mocks.
Reliability (10 Points)
The pipeline runs the smoke test multiple times and checks that the results are consistent. To earn full reliability points:
- Avoid randomness in output. If your tool uses random sampling, seed it or make the output independent of the randomness (e.g., always returning the same structure even if internal processing varies).
- Handle transient failures. If your tool makes network calls, a single timeout can cost you reliability points. Use retries with short backoffs for external calls.
- Clean up after yourself. If your tool writes temp files, delete them. Leftover state from one run can affect the next.
Determinism (5 Points)
Related to reliability but stricter: given the exact same input, does the tool produce the exact same output? This earns you 5 bonus points.
- Avoid timestamps in output unless they are meaningful. An
extracted_atfield that changes every call makes the tool non-deterministic. - Sort collections. If your tool returns a list of items from a set or dictionary, the order may vary across runs. Sort them.
- Use deterministic algorithms. If you can choose between a randomized and deterministic approach, choose deterministic.
Contract Compliance (10 Points)
The pipeline checks that the actual output of your tool matches the output_schema in your manifest. This is a common source of lost points:
- Declare every field you return. If your function returns a
metadatakey that is not in the output schema, you may lose points for unexpected fields. - Match types exactly. If the schema says
integer, do not return afloat. Python'slen()returnsint, butcount / totalreturnsfloat— be deliberate. - Handle nullable fields. If a field can be
None, declare it astype: ["string", "null"]in the schema.
# Schema says: word_count is integer
# BAD:
return {"word_count": float(len(words))} # Returns 42.0, not 42
# GOOD:
return {"word_count": len(words)} # Returns 42
Warnings (-2 Points Each)
The pipeline captures stderr and warnings during execution. Each distinct warning costs 2 points. Common offenders:
- Deprecation warnings from outdated library usage. Pin to a newer version of the dependency.
- Unclosed resource warnings (files, sockets, connections). Always use context managers (
withstatements). - Print statements to stderr. Some logging configurations write to stderr by default. Redirect or silence them during tool execution.
# BAD — may trigger ResourceWarning
def fetch(url):
resp = requests.get(url)
return resp.text
# GOOD — connection properly managed
def fetch(url):
with requests.Session() as session:
resp = session.get(url)
return resp.text
A Scoring Checklist
Before publishing, run through this checklist:
- All dependencies listed in
runtime.dependencies? (Install: 15 pts) - Module imports with no side effects? (Import: 15 pts)
- Tool runs with typical input without crashing? (Smoke: 25 pts)
- At least 3-5 tests in
tests/, all passing? (Tests: 15 pts) - Tool produces same output on repeated runs? (Reliability: 10 pts, Determinism: 5 pts)
- Output matches
output_schemaexactly? (Contract: 10 pts) - No warnings or resource leaks? (Avoid -2 per warning)
A package that checks every box starts at 95 points — well into Gold territory. Most packages that fall short do so because of missing tests (0 out of 15 points), a schema mismatch (0 out of 10 points), or warnings they did not know about.
Iterating on Your Score
If your first publish lands below Gold, do not worry — that is normal. Read the verification logs on your package page to see exactly where you lost points. Fix the issues, bump the version in your manifest, and publish again:
# After fixing issues
# Update version in manifest.yaml from "0.1.0" to "0.1.1"
agentnode publish
Each version is verified independently. Your latest version's score is what agents see, so there is no penalty for iterating. Some of the highest-rated packages on AgentNode went through three or four versions before hitting Gold.
Start writing better tests today, and give agents a reason to trust your tools.