PDF Extractor Pack

★Trustedv1.0.0MIT✔Verified88

by AgentNode · published 22 days ago · toolpack

Extract text, tables, and images from PDF documents.

Parse PDF files to extract structured text, data tables, embedded images, and metadata. Supports OCR for scanned documents via pytesseract.

langchaincrewaigeneric

Quick Start

bash

agentnode install pdf-extractor-pack

Usage

From package

python

from pdf_extractor_pack.tool import run

result = run(
    action="extract_pdf",
    file_path="/tmp/q4-earnings-report.pdf",
    extract=["tables", "text"],
    pages="3-8"
)

print(f"Pages processed: {result['page_count']}")
print(f"Tables found: {len(result['tables'])}")

for i, table in enumerate(result["tables"]):
    print(f"\nTable {i+1} (page {table['page']})")
    print(f"  Columns: {table['headers']}")
    for row in table["rows"][:3]:
        print(f"  {row}")
    print(f"  ... {len(table['rows'])} total rows")

Verification

high confidence88/100✔ Verified

smokeReturned valid result

+25/25

testsAuto-generated tests only

+8/15

importAll tools imported successfully

+15/15

installInstalled in 2.5s

+15/15

contractAll contract checks passed

+10/10

warningsNo warnings

0/0

determinismOutput consistency check

+5/5

reliability3/3 runs passed

+10/10

Package installs and imports correctly. runtime checks passed.

✔install2.5s

✔import705ms

✔smoke410ms

✔tests1.5s

This package was executed and validated by AgentNode before listing. Install, import, and runtime checks passed.

Python 3.12.3ffmpegpopplertesseractuv

Last verified 18d ago· Runner v2.0.0

Use this when you need to...

›Extract financial tables from quarterly earnings PDF reports
›Pull embedded images and charts from research papers
›Convert multi-page PDF contracts into structured JSON sections
›Extract metadata and document properties from legal filings
›OCR scanned PDF documents that contain no selectable text

README

PDF Extractor Pack

Extract text, tables, and images from PDF documents. Parse structured text, data tables, embedded images, and metadata with optional OCR for scanned documents.

Quick Start

agentnode install pdf-extractor-pack

from pdf_extractor_pack.tool import run

result = run(action="extract_pdf", file_path="/tmp/document.pdf")
print(result["text"])

Usage

Extract Text and Tables

result = run(
    action="extract_pdf",
    file_path="/tmp/report.pdf",
    extract=["text", "tables"],
    pages="1-10"
)
print(result["text"])
for table in result["tables"]:
    print(table["headers"], table["rows"][:2])

Extract Embedded Images

result = run(
    action="extract_pdf",
    file_path="/tmp/brochure.pdf",
    extract=["images"],
    image_output_dir="/tmp/extracted_images/"
)
for img in result["images"]:
    print(f"Saved {img['path']} ({img['width']}x{img['height']})")

OCR Scanned Documents

result = run(
    action="extract_pdf",
    file_path="/tmp/scanned-contract.pdf",
    extract=["text"],
    ocr_fallback=True,
    language="eng"
)
print(result["text"])
print(f"OCR used: {result['ocr_applied']}")

API Reference

Capability	Description
`extract_pdf`	Extract text, tables, images, and metadata from PDF files with optional OCR

Requirements

No API keys required. All processing runs locally.

License

MIT

Version History

v1.0.0latestverified

3/16/2026

Capabilities

pdf_extractionextract_pdftool

Permissions

This package declares the following access levels. Review before installing.

Networknone

Filesystemtemp

Code Executionnone

Data Accessinput_only

User Approvalnever

bash

agentnode install pdf-extractor-pack

Files (3)

License

MIT

Stats

Downloads0

Installs0

Versionv1.0.0

Published3/16/2026

Channelstable

Typetoolpack

Entrypointpdf_extractor_pack.tool

Compatibility

Frameworks

langchaincrewaigeneric

Runtime

python

Python Version

>=3.10

Trust & Security

Publisher★Trusted

SignatureNone

ProvenanceNone

Security Issues0

Publisher

AgentNode

@agentnode

Report an issue with this package