Agent Skill Testing Guide: Maximize Verification Score

Your verification score determines whether developers trust and install your agent skill. A low score means fewer downloads, lower search rankings, and a Bronze badge that signals uncertainty. A high score — Silver or Gold — tells the world that your tool works as advertised, handles edge cases, and will not break in production.

The difference between Bronze and Gold often comes down to one thing: the quality of your tests. This guide shows you exactly how to write tests that maximize your AgentNode verification score, avoid common pitfalls, and reach Gold tier on your first submission.

How the Verification Pipeline Works

Before writing tests, you need to understand what the verification pipeline actually checks. When you submit a skill to AgentNode, the pipeline runs through several stages:

Schema validation — Checks that your skill's manifest, input schema, and output schema conform to the ANP specification
Static analysis — Scans your code for known vulnerability patterns, unsafe imports, and resource usage
Test execution — Runs your test suite in a sandboxed environment and measures pass rate, coverage, and execution time
Behavioral verification — Tests your skill against the declared input/output schemas with generated inputs
Trust scoring — Aggregates results into a final score that determines your trust tier

Each stage contributes to your overall score. Tests directly affect stages 3 and 4, and indirectly affect stage 5. The testing guide for agent skills covers each stage in detail.

Trust Tiers Explained

AgentNode assigns one of three trust tiers based on your verification score:

Bronze (60-74) — Basic verification passed. Schema is valid, no critical security issues, but test coverage or quality may be low.
Silver (75-89) — Good verification. Solid test coverage, no security issues, behavioral tests pass.
Gold (90-100) — Excellent verification. Comprehensive tests, edge cases covered, fast execution, no warnings.

Gold-tier skills get higher search rankings, a prominent badge, and significantly more downloads. Understanding why verification matters is the first step toward building skills that developers actually trust.

Test Structure: The Foundation

Every agent skill should have three categories of tests. Here is the recommended file structure:

my-skill/
├── src/
│   └── skill.py
├── tests/
│   ├── __init__.py
│   ├── test_unit.py          # Unit tests for internal functions
│   ├── test_integration.py   # Integration tests for the skill interface
│   ├── test_edge_cases.py    # Edge cases and error handling
│   └── conftest.py           # Shared fixtures
├── manifest.yaml
└── pyproject.toml

Unit Tests

Unit tests verify individual functions and methods in isolation. They should be fast, deterministic, and cover your core logic:

import pytest
from my_skill.skill import parse_input, transform_data, format_output

class TestParseInput:
    def test_valid_string_input(self):
        result = parse_input({"text": "Hello world"})
        assert result.text == "Hello world"
        assert result.language is None  # default

    def test_input_with_all_options(self):
        result = parse_input({
            "text": "Hello world",
            "language": "en",
            "max_length": 100
        })
        assert result.text == "Hello world"
        assert result.language == "en"
        assert result.max_length == 100

    def test_empty_text_raises(self):
        with pytest.raises(ValueError, match="text cannot be empty"):
            parse_input({"text": ""})

    def test_missing_required_field_raises(self):
        with pytest.raises(KeyError):
            parse_input({})

Integration Tests

Integration tests verify your skill's public interface — the run() method that AgentNode calls. These are the most important tests for your verification score:

import pytest
from my_skill import MySkill

class TestSkillIntegration:
    @pytest.fixture
    def skill(self):
        return MySkill()

    def test_basic_execution(self, skill):
        result = skill.run({"text": "The quick brown fox jumps over the lazy dog."})
        assert "output" in result
        assert isinstance(result["output"], dict)
        assert "summary" in result["output"]

    def test_output_matches_schema(self, skill):
        result = skill.run({"text": "Sample text for analysis."})
        output = result["output"]
        # Verify all declared output fields are present
        assert isinstance(output["summary"], str)
        assert isinstance(output["confidence"], float)
        assert 0.0 <= output["confidence"] <= 1.0

    def test_execution_time(self, skill):
        import time
        start = time.time()
        skill.run({"text": "Performance test input."})
        elapsed = time.time() - start
        assert elapsed < 5.0, "Skill execution should complete within 5 seconds"

Edge Case Tests

Edge case tests are what separate Silver from Gold. The verification pipeline specifically checks for these:

class TestEdgeCases:
    def test_very_long_input(self, skill):
        long_text = "word " * 10000
        result = skill.run({"text": long_text})
        assert result["output"]["summary"] is not None

    def test_unicode_input(self, skill):
        result = skill.run({"text": "日本語テスト 中文测试 한국어 테스트"})
        assert result["output"]["summary"] is not None

    def test_special_characters(self, skill):
        result = skill.run({"text": "Hello <script>alert('xss')</script> & 'quotes'"})
        assert "<script>" not in result["output"]["summary"]

    def test_whitespace_only(self, skill):
        with pytest.raises(ValueError):
            skill.run({"text": "   \n\t  "})

    def test_none_input(self, skill):
        with pytest.raises((TypeError, ValueError)):
            skill.run({"text": None})

    def test_numeric_input_coercion(self, skill):
        result = skill.run({"text": "12345"})
        assert result["output"]["summary"] is not None

Mock Strategies for API-Dependent Tools

Many agent skills depend on external APIs — language models, databases, third-party services. The verification pipeline runs in a sandboxed environment without network access, so you must mock these dependencies.

Strategy 1: Dependency Injection

class MySkill:
    def __init__(self, llm_client=None):
        self._llm = llm_client or DefaultLLMClient()

    def run(self, input_data):
        response = self._llm.complete(input_data["text"])
        return {"output": {"summary": response.text}}

# In tests:
class MockLLMClient:
    def complete(self, text):
        return MockResponse(text=f"Summary of: {text[:50]}")

def test_with_mock():
    skill = MySkill(llm_client=MockLLMClient())
    result = skill.run({"text": "Test input"})
    assert "Summary of:" in result["output"]["summary"]

Strategy 2: pytest Fixtures with monkeypatch

@pytest.fixture
def mock_api(monkeypatch):
    def mock_call(self, prompt):
        return {"text": "Mocked response", "tokens": 10}
    monkeypatch.setattr("my_skill.api_client.APIClient.call", mock_call)

def test_skill_with_mocked_api(skill, mock_api):
    result = skill.run({"text": "Test"})
    assert result["output"]["summary"] == "Mocked response"

Strategy 3: Response Fixtures

# tests/fixtures/api_responses.json
{
  "summarize_short": {"text": "Brief summary.", "tokens": 5},
  "summarize_long": {"text": "Detailed summary with multiple sentences.", "tokens": 25}
}

# conftest.py
import json
from pathlib import Path

@pytest.fixture
def api_responses():
    fixture_path = Path(__file__).parent / "fixtures" / "api_responses.json"
    with open(fixture_path) as f:
        return json.load(f)

What the Verification Pipeline Checks

Here is a detailed breakdown of what contributes to your score:

Check	Weight	What It Measures
Test pass rate	25%	Percentage of tests that pass
Code coverage	20%	Lines and branches covered by tests
Edge case coverage	15%	Tests for boundary conditions, error handling
Schema conformance	15%	Output matches declared schema for all inputs
Execution time	10%	Tests complete within time budget
Static analysis	10%	No security issues, clean code patterns
Documentation	5%	Docstrings, type hints, README quality

Tips for Reaching Gold Tier

Based on analysis of hundreds of Gold-tier skills, here are the patterns that consistently produce high scores:

1. Aim for 90%+ Code Coverage

Gold tier requires at least 85% line coverage and 75% branch coverage. Use pytest --cov=my_skill --cov-report=term-missing to identify gaps.

2. Test Every Input Schema Field

For each field in your input schema, write tests for: valid values, boundary values, invalid types, missing/null values, and extreme values.

3. Test Every Output Schema Field

Verify that every declared output field is present and correctly typed in every test that exercises the run() method.

4. Include Negative Tests

The pipeline specifically looks for tests that verify error behavior. Every pytest.raises block contributes to your edge case coverage score.

5. Keep Tests Fast

The pipeline has a 60-second timeout for your entire test suite. If you are mocking correctly, this is generous — but large fixture files or expensive setup can eat into it.

6. Add Type Hints Everywhere

Type hints on your skill's public interface contribute to the documentation score and help the behavioral verification stage generate better test inputs.

7. Use Parameterized Tests

@pytest.mark.parametrize("input_text,expected_lang", [
    ("Hello world", "en"),
    ("Bonjour le monde", "fr"),
    ("Hola mundo", "es"),
    ("こんにちは世界", "ja"),
])
def test_language_detection(skill, input_text, expected_lang):
    result = skill.run({"text": input_text})
    assert result["output"]["detected_language"] == expected_lang

Common Pitfalls That Lower Your Score

Avoid these mistakes that commonly drop skills from Gold to Silver or even Bronze:

No error handling tests — If your skill can raise exceptions, you need tests that prove it raises the right ones
Hardcoded API keys in tests — The pipeline will flag these as security issues, immediately dropping your static analysis score
Tests that depend on network access — The sandbox has no internet. Tests that make real API calls will fail.
Missing conftest.py — Without shared fixtures, you may have test isolation issues that cause flaky results
No assertion messages — While not required, assertion messages help the pipeline generate better feedback when tests fail
Testing implementation details — Test behavior, not internals. If a refactor breaks your tests but not your interface, the tests are too tightly coupled.

Running the Verification Locally

Before submitting, run the verification checks locally to catch issues early:

# Run tests with coverage
pytest tests/ --cov=my_skill --cov-report=term-missing --cov-branch -v

# Run the local verification check
agentnode verify --local

# Preview your score
agentnode verify --dry-run

The --dry-run flag simulates the full pipeline locally and shows your estimated score. Fix any issues before submitting to the publish and verify your skill page.

After Verification: Next Steps

Once your tests pass and you have a Gold-tier score, you are ready to publish. The publish your ANP package tutorial walks through the full submission process, including metadata, versioning, and release notes.

Strong tests are not just about verification scores — they protect your users, make your skill maintainable, and build the kind of trust that drives adoption. Every Gold-tier badge represents a commitment to quality that the entire AgentNode ecosystem benefits from.

Frequently Asked Questions

How to write tests for agent skills?

Write three categories of tests: unit tests for internal functions, integration tests for the skill's run() interface, and edge case tests for boundary conditions and error handling. Use pytest as your test framework, mock external dependencies with dependency injection or monkeypatch, and aim for 90% or higher code coverage. Run the local verification check with agentnode verify --local before submitting.

What is a good verification score?

A score of 75 or above earns Silver tier, which is considered good. A score of 90 or above earns Gold tier, which is excellent and significantly increases your skill's visibility and download count. The average published skill scores 72, so anything above 80 puts you in the top quartile.

How to reach Gold tier?

Gold tier requires 90+ verification score. Focus on four areas: comprehensive test coverage (90%+ lines, 75%+ branches), edge case tests for every input field, fast test execution (under 60 seconds total), and clean static analysis with no security warnings. Use parameterized tests to efficiently cover multiple input variations.

Do I need tests to publish?

Tests are not strictly required to publish — you can submit a skill with zero tests and receive a Bronze tier rating if it passes schema validation and static analysis. However, skills without tests are limited to Bronze tier, receive lower search rankings, and are far less likely to be installed by developers who rely on verification badges for trust decisions.