PDF Extractor Pack

Trustedv1.0.0MITVerified88

by AgentNode · published 22 days ago · toolpack

Extract text, tables, and images from PDF documents.

Parse PDF files to extract structured text, data tables, embedded images, and metadata. Supports OCR for scanned documents via pytesseract.

langchaincrewaigeneric

Quick Start

bash
agentnode install pdf-extractor-pack

Usage

From package
python
from pdf_extractor_pack.tool import run

result = run(
    action="extract_pdf",
    file_path="/tmp/q4-earnings-report.pdf",
    extract=["tables", "text"],
    pages="3-8"
)

print(f"Pages processed: {result['page_count']}")
print(f"Tables found: {len(result['tables'])}")

for i, table in enumerate(result["tables"]):
    print(f"\nTable {i+1} (page {table['page']})")
    print(f"  Columns: {table['headers']}")
    for row in table["rows"][:3]:
        print(f"  {row}")
    print(f"  ... {len(table['rows'])} total rows")

Verification

high confidence88/100✔ Verified
smokeReturned valid result
+25/25
testsAuto-generated tests only
+8/15
importAll tools imported successfully
+15/15
installInstalled in 2.5s
+15/15
contractAll contract checks passed
+10/10
warningsNo warnings
0/0
determinismOutput consistency check
+5/5
reliability3/3 runs passed
+10/10

Package installs and imports correctly. runtime checks passed.

install2.5s
import705ms
smoke410ms
tests1.5s

This package was executed and validated by AgentNode before listing. Install, import, and runtime checks passed.

Python 3.12.3ffmpegpopplertesseractuv

Last verified 18d ago· Runner v2.0.0

Use this when you need to...

  • Extract financial tables from quarterly earnings PDF reports
  • Pull embedded images and charts from research papers
  • Convert multi-page PDF contracts into structured JSON sections
  • Extract metadata and document properties from legal filings
  • OCR scanned PDF documents that contain no selectable text

README

PDF Extractor Pack

Extract text, tables, and images from PDF documents. Parse structured text, data tables, embedded images, and metadata with optional OCR for scanned documents.

Quick Start

agentnode install pdf-extractor-pack
from pdf_extractor_pack.tool import run

result = run(action="extract_pdf", file_path="/tmp/document.pdf")
print(result["text"])

Usage

Extract Text and Tables

result = run(
    action="extract_pdf",
    file_path="/tmp/report.pdf",
    extract=["text", "tables"],
    pages="1-10"
)
print(result["text"])
for table in result["tables"]:
    print(table["headers"], table["rows"][:2])

Extract Embedded Images

result = run(
    action="extract_pdf",
    file_path="/tmp/brochure.pdf",
    extract=["images"],
    image_output_dir="/tmp/extracted_images/"
)
for img in result["images"]:
    print(f"Saved {img['path']} ({img['width']}x{img['height']})")

OCR Scanned Documents

result = run(
    action="extract_pdf",
    file_path="/tmp/scanned-contract.pdf",
    extract=["text"],
    ocr_fallback=True,
    language="eng"
)
print(result["text"])
print(f"OCR used: {result['ocr_applied']}")

API Reference

CapabilityDescription
extract_pdfExtract text, tables, images, and metadata from PDF files with optional OCR

Requirements

No API keys required. All processing runs locally.

License

MIT

Version History

Capabilities

pdf_extractionextract_pdftool

Permissions

This package declares the following access levels. Review before installing.

Networknone
Filesystemtemp
Code Executionnone
Data Accessinput_only
User Approvalnever
bash
agentnode install pdf-extractor-pack

Files (3)

License

MIT

Stats

Downloads0
Installs0
Versionv1.0.0
Published3/16/2026
Channelstable
Typetoolpack
Entrypointpdf_extractor_pack.tool

Compatibility

Frameworks

langchaincrewaigeneric

Runtime

python

Python Version

>=3.10

Trust & Security

PublisherTrusted
SignatureNone
ProvenanceNone
Security Issues0

Publisher

A

AgentNode

@agentnode