PDF Extractor Pack
★Trustedv1.0.0MIT✔Verified88by AgentNode · published 22 days ago · toolpack
Extract text, tables, and images from PDF documents.
Parse PDF files to extract structured text, data tables, embedded images, and metadata. Supports OCR for scanned documents via pytesseract.
Quick Start
agentnode install pdf-extractor-packUsage
From packagefrom pdf_extractor_pack.tool import run
result = run(
action="extract_pdf",
file_path="/tmp/q4-earnings-report.pdf",
extract=["tables", "text"],
pages="3-8"
)
print(f"Pages processed: {result['page_count']}")
print(f"Tables found: {len(result['tables'])}")
for i, table in enumerate(result["tables"]):
print(f"\nTable {i+1} (page {table['page']})")
print(f" Columns: {table['headers']}")
for row in table["rows"][:3]:
print(f" {row}")
print(f" ... {len(table['rows'])} total rows")Verification
Package installs and imports correctly. runtime checks passed.
This package was executed and validated by AgentNode before listing. Install, import, and runtime checks passed.
Last verified 18d ago· Runner v2.0.0
Use this when you need to...
- ›Extract financial tables from quarterly earnings PDF reports
- ›Pull embedded images and charts from research papers
- ›Convert multi-page PDF contracts into structured JSON sections
- ›Extract metadata and document properties from legal filings
- ›OCR scanned PDF documents that contain no selectable text
README
PDF Extractor Pack
Extract text, tables, and images from PDF documents. Parse structured text, data tables, embedded images, and metadata with optional OCR for scanned documents.
Quick Start
agentnode install pdf-extractor-pack
from pdf_extractor_pack.tool import run
result = run(action="extract_pdf", file_path="/tmp/document.pdf")
print(result["text"])
Usage
Extract Text and Tables
result = run(
action="extract_pdf",
file_path="/tmp/report.pdf",
extract=["text", "tables"],
pages="1-10"
)
print(result["text"])
for table in result["tables"]:
print(table["headers"], table["rows"][:2])
Extract Embedded Images
result = run(
action="extract_pdf",
file_path="/tmp/brochure.pdf",
extract=["images"],
image_output_dir="/tmp/extracted_images/"
)
for img in result["images"]:
print(f"Saved {img['path']} ({img['width']}x{img['height']})")
OCR Scanned Documents
result = run(
action="extract_pdf",
file_path="/tmp/scanned-contract.pdf",
extract=["text"],
ocr_fallback=True,
language="eng"
)
print(result["text"])
print(f"OCR used: {result['ocr_applied']}")
API Reference
| Capability | Description |
|---|---|
extract_pdf | Extract text, tables, images, and metadata from PDF files with optional OCR |
Requirements
No API keys required. All processing runs locally.
License
MIT
Version History
Capabilities
Permissions
This package declares the following access levels. Review before installing.
agentnode install pdf-extractor-packFiles (3)
License
MITStats
Compatibility
Frameworks
Runtime
pythonPython Version
>=3.10Trust & Security
Publisher
AgentNode
@agentnode