Building Agent Skills Tests: Maximize Your Verification Score
Learn how to write tests that maximize your AgentNode verification score. Covers test structure, unit vs integration tests, mock strategies, and tips for reaching Gold tier.
Your verification score determines whether developers trust and install your agent skill. A low score means fewer downloads, lower search rankings, and a Bronze badge that signals uncertainty. A high score — Silver or Gold — tells the world that your tool works as advertised, handles edge cases, and will not break in production.
The difference between Bronze and Gold often comes down to one thing: the quality of your tests. This guide shows you exactly how to write tests that maximize your AgentNode verification score, avoid common pitfalls, and reach Gold tier on your first submission.
How the Verification Pipeline Works
Before writing tests, you need to understand what the verification pipeline actually checks. When you submit a skill to AgentNode, the pipeline runs through several stages:
- Schema validation — Checks that your skill's manifest, input schema, and output schema conform to the ANP specification
- Static analysis — Scans your code for known vulnerability patterns, unsafe imports, and resource usage
- Test execution — Runs your test suite in a sandboxed environment and measures pass rate, coverage, and execution time
- Behavioral verification — Tests your skill against the declared input/output schemas with generated inputs
- Trust scoring — Aggregates results into a final score that determines your trust tier
Each stage contributes to your overall score. Tests directly affect stages 3 and 4, and indirectly affect stage 5. The testing guide for agent skills covers each stage in detail.
Trust Tiers Explained
AgentNode assigns one of three trust tiers based on your verification score:
- Bronze (60-74) — Basic verification passed. Schema is valid, no critical security issues, but test coverage or quality may be low.
- Silver (75-89) — Good verification. Solid test coverage, no security issues, behavioral tests pass.
- Gold (90-100) — Excellent verification. Comprehensive tests, edge cases covered, fast execution, no warnings.
Gold-tier skills get higher search rankings, a prominent badge, and significantly more downloads. Understanding why verification matters is the first step toward building skills that developers actually trust.
Test Structure: The Foundation
Every agent skill should have three categories of tests. Here is the recommended file structure:
my-skill/
├── src/
│ └── skill.py
├── tests/
│ ├── __init__.py
│ ├── test_unit.py # Unit tests for internal functions
│ ├── test_integration.py # Integration tests for the skill interface
│ ├── test_edge_cases.py # Edge cases and error handling
│ └── conftest.py # Shared fixtures
├── manifest.yaml
└── pyproject.toml
Unit Tests
Unit tests verify individual functions and methods in isolation. They should be fast, deterministic, and cover your core logic:
import pytest
from my_skill.skill import parse_input, transform_data, format_output
class TestParseInput:
def test_valid_string_input(self):
result = parse_input({"text": "Hello world"})
assert result.text == "Hello world"
assert result.language is None # default
def test_input_with_all_options(self):
result = parse_input({
"text": "Hello world",
"language": "en",
"max_length": 100
})
assert result.text == "Hello world"
assert result.language == "en"
assert result.max_length == 100
def test_empty_text_raises(self):
with pytest.raises(ValueError, match="text cannot be empty"):
parse_input({"text": ""})
def test_missing_required_field_raises(self):
with pytest.raises(KeyError):
parse_input({})
Integration Tests
Integration tests verify your skill's public interface — the run() method that AgentNode calls. These are the most important tests for your verification score:
import pytest
from my_skill import MySkill
class TestSkillIntegration:
@pytest.fixture
def skill(self):
return MySkill()
def test_basic_execution(self, skill):
result = skill.run({"text": "The quick brown fox jumps over the lazy dog."})
assert "output" in result
assert isinstance(result["output"], dict)
assert "summary" in result["output"]
def test_output_matches_schema(self, skill):
result = skill.run({"text": "Sample text for analysis."})
output = result["output"]
# Verify all declared output fields are present
assert isinstance(output["summary"], str)
assert isinstance(output["confidence"], float)
assert 0.0 <= output["confidence"] <= 1.0
def test_execution_time(self, skill):
import time
start = time.time()
skill.run({"text": "Performance test input."})
elapsed = time.time() - start
assert elapsed < 5.0, "Skill execution should complete within 5 seconds"
Edge Case Tests
Edge case tests are what separate Silver from Gold. The verification pipeline specifically checks for these:
class TestEdgeCases:
def test_very_long_input(self, skill):
long_text = "word " * 10000
result = skill.run({"text": long_text})
assert result["output"]["summary"] is not None
def test_unicode_input(self, skill):
result = skill.run({"text": "日本語テスト 中文测试 한국어 테스트"})
assert result["output"]["summary"] is not None
def test_special_characters(self, skill):
result = skill.run({"text": "Hello <script>alert('xss')</script> & 'quotes'"})
assert "<script>" not in result["output"]["summary"]
def test_whitespace_only(self, skill):
with pytest.raises(ValueError):
skill.run({"text": " \n\t "})
def test_none_input(self, skill):
with pytest.raises((TypeError, ValueError)):
skill.run({"text": None})
def test_numeric_input_coercion(self, skill):
result = skill.run({"text": "12345"})
assert result["output"]["summary"] is not None
Mock Strategies for API-Dependent Tools
Many agent skills depend on external APIs — language models, databases, third-party services. The verification pipeline runs in a sandboxed environment without network access, so you must mock these dependencies.
Strategy 1: Dependency Injection
class MySkill:
def __init__(self, llm_client=None):
self._llm = llm_client or DefaultLLMClient()
def run(self, input_data):
response = self._llm.complete(input_data["text"])
return {"output": {"summary": response.text}}
# In tests:
class MockLLMClient:
def complete(self, text):
return MockResponse(text=f"Summary of: {text[:50]}")
def test_with_mock():
skill = MySkill(llm_client=MockLLMClient())
result = skill.run({"text": "Test input"})
assert "Summary of:" in result["output"]["summary"]
Strategy 2: pytest Fixtures with monkeypatch
@pytest.fixture
def mock_api(monkeypatch):
def mock_call(self, prompt):
return {"text": "Mocked response", "tokens": 10}
monkeypatch.setattr("my_skill.api_client.APIClient.call", mock_call)
def test_skill_with_mocked_api(skill, mock_api):
result = skill.run({"text": "Test"})
assert result["output"]["summary"] == "Mocked response"
Strategy 3: Response Fixtures
# tests/fixtures/api_responses.json
{
"summarize_short": {"text": "Brief summary.", "tokens": 5},
"summarize_long": {"text": "Detailed summary with multiple sentences.", "tokens": 25}
}
# conftest.py
import json
from pathlib import Path
@pytest.fixture
def api_responses():
fixture_path = Path(__file__).parent / "fixtures" / "api_responses.json"
with open(fixture_path) as f:
return json.load(f)
What the Verification Pipeline Checks
Here is a detailed breakdown of what contributes to your score:
| Check | Weight | What It Measures |
|---|---|---|
| Test pass rate | 25% | Percentage of tests that pass |
| Code coverage | 20% | Lines and branches covered by tests |
| Edge case coverage | 15% | Tests for boundary conditions, error handling |
| Schema conformance | 15% | Output matches declared schema for all inputs |
| Execution time | 10% | Tests complete within time budget |
| Static analysis | 10% | No security issues, clean code patterns |
| Documentation | 5% | Docstrings, type hints, README quality |
Tips for Reaching Gold Tier
Based on analysis of hundreds of Gold-tier skills, here are the patterns that consistently produce high scores:
1. Aim for 90%+ Code Coverage
Gold tier requires at least 85% line coverage and 75% branch coverage. Use pytest --cov=my_skill --cov-report=term-missing to identify gaps.
2. Test Every Input Schema Field
For each field in your input schema, write tests for: valid values, boundary values, invalid types, missing/null values, and extreme values.
3. Test Every Output Schema Field
Verify that every declared output field is present and correctly typed in every test that exercises the run() method.
4. Include Negative Tests
The pipeline specifically looks for tests that verify error behavior. Every pytest.raises block contributes to your edge case coverage score.
5. Keep Tests Fast
The pipeline has a 60-second timeout for your entire test suite. If you are mocking correctly, this is generous — but large fixture files or expensive setup can eat into it.
6. Add Type Hints Everywhere
Type hints on your skill's public interface contribute to the documentation score and help the behavioral verification stage generate better test inputs.
7. Use Parameterized Tests
@pytest.mark.parametrize("input_text,expected_lang", [
("Hello world", "en"),
("Bonjour le monde", "fr"),
("Hola mundo", "es"),
("こんにちは世界", "ja"),
])
def test_language_detection(skill, input_text, expected_lang):
result = skill.run({"text": input_text})
assert result["output"]["detected_language"] == expected_lang
Common Pitfalls That Lower Your Score
Avoid these mistakes that commonly drop skills from Gold to Silver or even Bronze:
- No error handling tests — If your skill can raise exceptions, you need tests that prove it raises the right ones
- Hardcoded API keys in tests — The pipeline will flag these as security issues, immediately dropping your static analysis score
- Tests that depend on network access — The sandbox has no internet. Tests that make real API calls will fail.
- Missing conftest.py — Without shared fixtures, you may have test isolation issues that cause flaky results
- No assertion messages — While not required, assertion messages help the pipeline generate better feedback when tests fail
- Testing implementation details — Test behavior, not internals. If a refactor breaks your tests but not your interface, the tests are too tightly coupled.
Running the Verification Locally
Before submitting, run the verification checks locally to catch issues early:
# Run tests with coverage
pytest tests/ --cov=my_skill --cov-report=term-missing --cov-branch -v
# Run the local verification check
agentnode verify --local
# Preview your score
agentnode verify --dry-run
The --dry-run flag simulates the full pipeline locally and shows your estimated score. Fix any issues before submitting to the publish and verify your skill page.
After Verification: Next Steps
Once your tests pass and you have a Gold-tier score, you are ready to publish. The publish your ANP package tutorial walks through the full submission process, including metadata, versioning, and release notes.
Strong tests are not just about verification scores — they protect your users, make your skill maintainable, and build the kind of trust that drives adoption. Every Gold-tier badge represents a commitment to quality that the entire AgentNode ecosystem benefits from.
Frequently Asked Questions
How to write tests for agent skills?
Write three categories of tests: unit tests for internal functions, integration tests for the skill's run() interface, and edge case tests for boundary conditions and error handling. Use pytest as your test framework, mock external dependencies with dependency injection or monkeypatch, and aim for 90% or higher code coverage. Run the local verification check with agentnode verify --local before submitting.
What is a good verification score?
A score of 75 or above earns Silver tier, which is considered good. A score of 90 or above earns Gold tier, which is excellent and significantly increases your skill's visibility and download count. The average published skill scores 72, so anything above 80 puts you in the top quartile.
How to reach Gold tier?
Gold tier requires 90+ verification score. Focus on four areas: comprehensive test coverage (90%+ lines, 75%+ branches), edge case tests for every input field, fast test execution (under 60 seconds total), and clean static analysis with no security warnings. Use parameterized tests to efficiently cover multiple input variations.
Do I need tests to publish?
Tests are not strictly required to publish — you can submit a skill with zero tests and receive a Bronze tier rating if it passes schema validation and static analysis. However, skills without tests are limited to Bronze tier, receive lower search rankings, and are far less likely to be installed by developers who rely on verification badges for trust decisions.