AgentNode Verification Pipeline & Trust Scores Explained

Why Verification Matters for AI Agent Tools

When you install a library from PyPI or npm, you're largely trusting the publisher. There's no automated check that the code does what it claims, that the declared functions are actually callable, or that the package won't crash at runtime. For traditional libraries, this is a manageable risk—you read the README, check the star count, and test it yourself.

AI agent tools are different. An LLM calls these tools autonomously, often without human review of each invocation. If a tool claims to "extract text from PDFs" but silently fails on every input, your agent will produce garbage without anyone noticing until downstream results are wrong. Verification is how AgentNode ensures that tools actually work before they reach your agent.

The 4-Step Verification Pipeline

Every package version published to AgentNode goes through a four-step automated verification pipeline. Each step builds on the previous one, and the results feed into the final trust score.

Step 1: Install (15 points)

The first step creates an isolated Python virtual environment and installs the package using pip install. This verifies that:

The package is a valid Python distribution (correct setup.py or pyproject.toml)
All declared dependencies can be resolved and installed
No installation errors occur (compilation failures, version conflicts, etc.)

If installation fails, verification stops immediately. A package that can't install can't be verified further, and it receives a score of 0.

The install step records timing data. A package that installs in 2 seconds versus one that takes 45 seconds tells you something about dependency weight, which the score explanation surfaces.

Step 2: Import (15 points)

After installation, the pipeline tries to import every tool declared in the package manifest. For each tool, it checks that:

The module path is importable (e.g., import mypackage.tools succeeds)
The declared function exists in the module
The function is callable
The function signature roughly matches the declared input schema

This step also performs automatic schema introspection. If a tool's manifest declares an incomplete input schema, the import step inspects the function signature and generates a corrected schema. This auto-generated schema is used by subsequent steps to construct valid test inputs.

The import step catches a surprising number of issues: typos in entrypoint declarations, missing __init__.py files, and import-time side effects that crash before the function is ever called.

Step 3: Smoke Test (25 points)

The smoke test is the most sophisticated step. It actually executes each tool with generated test inputs and analyzes the results. This is where verification goes beyond "does it import?" to "does it do something useful?"

For each tool, the smoke step:

Generates test candidates. Using the tool's input schema, the pipeline generates plausible test inputs. It understands common parameter patterns: if a field is called file_path, it creates a stub file; if a field has enum constraints, it picks valid values.
Executes the tool. Each candidate is run in the verification sandbox with network access blocked. The pipeline captures the return value, its type, whether it's JSON-serializable, and any exceptions thrown.
Classifies the result. The pipeline categorizes the outcome: passed (returned a valid result), failed (threw an unexpected exception), or inconclusive (the result couldn't be determined due to sandbox limitations).

The smoke test awards up to 25 points for a clean pass. But this is where AgentNode's scoring gets nuanced—not every failure is the package's fault.

Step 4: Tests (15 points)

The final step runs the package's test suite using pytest. This checks whether the publisher included their own tests and whether those tests pass in the verification environment.

The scoring here distinguishes between three cases:

Publisher-provided tests pass (15 points): The best outcome. The publisher wrote tests and they work, which is strong evidence the package is well-maintained.
Auto-generated tests only (8 points): If the publisher didn't include tests, AgentNode generates basic tests from the manifest. Passing these is positive but less convincing than publisher tests.
No tests provided (3 points): The package has no test directory at all. A small number of points are still awarded because the package passed the earlier steps.
Tests fail (0 points): Tests exist but fail. This suggests quality issues.

Score Breakdown: The Full Picture

The total score ranges from 0 to 100. Beyond the four pipeline steps (which account for 70 points), additional quality metrics contribute up to 25 more points, with possible deductions:

Contract Validation (10 points)

Contract validation checks the smoke test output for quality signals:

Non-None return: The tool returned something (not None)
JSON-serializable: The output can be serialized to JSON, which matters for agent frameworks that pass tool results as messages
Type plausibility: The return type exists and isn't NoneType
Structure plausibility: If the return is a dict, it has keys; if a string, it's not empty

The contract validator also runs light semantic checks. For example, a tool named summarize_text that returns an output longer than its input gets flagged (though this is never fatal—it just reduces the contract score by a small amount).

Reliability (10 points)

Reliability measures consistency across multiple runs. The pipeline executes the same test input three times and checks how many runs succeed. A tool that passes 3/3 times gets full points. One that passes 2/3 times gets partial credit. A flaky tool that only works sometimes scores lower, even if it occasionally produces correct output.

Determinism (5 points)

Determinism checks whether the tool produces the same output given the same input. The pipeline hashes the output of each run and compares them. A fully deterministic tool (same hash every time) gets 5 points. Non-deterministic tools (different outputs each run) get partial credit.

Note that non-determinism isn't necessarily bad—tools that fetch live data or use timestamps will naturally produce different outputs. The score reflects this as a data point, not a judgment.

Warning Deductions (up to -10 points)

The pipeline counts deprecation warnings, runtime warnings, and other diagnostic messages emitted during tool execution. Each warning deducts 2 points, up to a maximum deduction of 10 points. This penalizes packages that rely on deprecated APIs or produce noisy output.

The Tier System

The numeric score maps to a tier that provides a quick summary of the package's verification status:

Gold (score 90+): The highest tier. The package passed all four verification steps, has valid contracts, high reliability, and no significant warnings. Gold requires real execution (not mocked or limited mode), passed smoke tests, and at least 90% reliability.
Verified (score 70-89): The package works correctly in the verification environment. It may have minor issues like missing publisher tests or slight non-determinism, but it functions as declared.
Partial (score 50-69): The package installs and imports correctly but has limited runtime verification. This commonly happens with tools that require API credentials or system dependencies that aren't available in the sandbox.
Unverified (score below 50): The package failed critical verification steps. It may not install, import, or function correctly.

Tier Caps: Hard Rules That Override Scores

The tier system includes hard caps that can demote a package's tier regardless of its numeric score. These enforce logical constraints:

Smoke not passed → maximum Verified. A package can never reach Gold if its smoke test didn't pass cleanly. Even if the score is 92 due to strong install, import, and test results, the tier caps at Verified.
Credential boundary → maximum Partial. If the smoke test reached a credential boundary (the tool tried to call an API that requires authentication), the tier is capped at Partial. The tool might work perfectly with valid credentials, but verification can't confirm that.
Mock mode → maximum Partial. Packages verified in mock mode (simulated execution) are capped at Partial, since the real execution path wasn't tested.
Limited mode → maximum Verified. Packages verified with limited capabilities are capped at Verified.

Tier Floors: Generous Minimums for Legitimate Limitations

Conversely, the system includes floors that prevent unfairly low tiers. If a package installs and imports correctly but hits a sandbox limitation (needs credentials, needs a system dependency, needs binary input, or network was blocked), it receives at least a Partial tier. This prevents a well-coded API wrapper from being labeled "Unverified" just because the sandbox doesn't have its API key.

Inconclusive Results: The Honest Middle Ground

One of the most important design decisions in AgentNode's verification is the concept of inconclusive results. Not every verification outcome is binary pass/fail. Many real-world tools can't be fully tested in an automated sandbox:

Credential boundary reached: The tool tried to authenticate with an external API. The sandbox doesn't have credentials, so the tool failed—but the failure was an authentication error, not a code bug. High-confidence credential boundary detection awards 15 points; medium-confidence awards 12.
Missing system dependency: The tool requires a system library (like ffmpeg or poppler) that isn't in the verification container. This is a sandbox limitation, not a package defect. Awards 12 points.
Needs binary input: The tool expects a specific binary file format (like a real PDF or image) that stub files can't adequately simulate. Awards 12 points.
External network blocked: The tool needs network access that the sandbox blocks for security. Awards 12 points.

These inconclusive reasons are tracked and displayed to users. When you see a package with a "Partial" tier and a note about credential boundary, you know it probably works fine—AgentNode just couldn't prove it without credentials.

Confidence Levels

Each verification result also carries a confidence level (high, medium, or low) that indicates how much you should trust the score. Confidence is computed from multiple signals:

Positive signals: Smoke test passed (+2), contract valid (+1), reliability ≥ 90% (+1), publisher tests passed (+1)
Negative signals: Smoke inconclusive (-2), credential boundary hit (-1)

A high-confidence result (4+ positive signals) means the score is well-supported by evidence. A low-confidence result means the pipeline couldn't collect enough evidence to be sure—the score might be misleadingly high or low.

Sandbox Isolation

All verification runs in an isolated environment. In container mode, each package gets its own Docker container with --network=none for smoke tests, preventing packages from making network calls during the runtime test phase. This means:

A package can't exfiltrate data during verification
A package can't download additional code at runtime
Side effects are contained to the disposable container

Network access is allowed only during the install step (because pip install needs to download dependencies). After that, the sandbox locks down.

How to Interpret a Verification Score

When evaluating a package's verification results, consider the full picture:

Check the tier first. Gold and Verified packages have demonstrated functionality. Partial packages likely work but couldn't be fully tested. Unverified packages have real issues.
Read the explanation. Each result includes a human-readable explanation like "Package installs and imports correctly. Requires API credentials for full verification. No custom tests provided." This tells you more than the number alone.
Look at the breakdown. The per-step scores show exactly where points were earned or lost. A package with 15/15 install, 15/15 import, 12/25 smoke (credential boundary), 3/15 tests (no tests) tells a clear story: the code is solid but the publisher didn't provide tests and the tool needs API keys to fully verify.
Consider confidence. A high-confidence score of 75 is more trustworthy than a low-confidence score of 85. The confidence level tells you how much evidence backs the score.
Check the smoke reason. If the smoke test was inconclusive, the reason matters. "Credential boundary" is very different from "fatal runtime error." The former suggests the tool would work with proper credentials; the latter suggests a real bug.

What Publishers Can Do to Improve Scores

If you're publishing packages to AgentNode, here's how to maximize your verification score:

Include tests. Publisher-provided tests that pass are worth 15 points (versus 8 for auto-generated or 3 for no tests). This is the easiest way to improve your score.
Declare accurate input schemas. The smoke test generates inputs from your schema. A well-defined schema with correct types and clear field names produces better test candidates.
Handle missing credentials gracefully. Instead of crashing with a traceback when an API key is missing, raise a clear authentication error. This helps the verifier correctly classify the result as "credential boundary" rather than "runtime error."
Minimize warnings. Each deprecation or runtime warning costs 2 points. Update deprecated API calls and suppress noisy but harmless warnings.
Return serializable, non-None values. Tools should return JSON-serializable data. Returning None on success costs contract points. Return a dict or string with meaningful content.

Verification Is Per-Version

An important detail: verification happens per version, not per package. When you publish version 1.2.0, it gets its own verification run independent of version 1.1.0. This means a bug introduced in a new version will be caught even if the previous version had a Gold score. It also means fixing a bug and publishing a new version will update the verification results without carrying over the old failures.

Summary

AgentNode's verification pipeline checks that packages install correctly, tools import as declared, runtime execution produces valid results, and tests (if provided) pass. The resulting score (0-100) and tier (Gold, Verified, Partial, Unverified) give you a data-driven way to evaluate package quality. Inconclusive results for sandbox limitations are handled honestly with partial credit and clear explanations, so you always know exactly what was tested and what wasn't.