py-pdf / benchmarksLinks
Benchmarking PDF libraries
☆321Updated 7 months ago
Alternatives and similar repositories for benchmarks
Users that are interested in benchmarks are comparing it to the libraries listed below
Sorting:
- Python bindings to PDFium, reasonably cross-platform.☆719Updated last week
- Extract structured text from pdfs quickly☆656Updated 7 months ago
- Simple package to extract text with coordinates from programmatic PDFs☆236Updated this week
- ☆201Updated this week
- A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.☆552Updated 3 months ago
- Docling core data types and transformations☆225Updated this week
- ☆182Updated last week
- ⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍☆642Updated 5 months ago
- Lightweight, performant, deep table extraction☆524Updated 3 weeks ago
- UniTable: Towards a Unified Table Foundation Model☆521Updated last year
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy☆1,473Updated this week
- Late Interaction Models Training & Retrieval☆694Updated 3 weeks ago
- Streamlit PDF viewer☆195Updated this week
- Adobe PDFServices python SDK Samples☆161Updated 6 months ago
- Python API for https://vespa.ai, the open big data serving engine☆158Updated this week
- Software that makes labeling PDFs easy.☆426Updated last year
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆409Updated 3 years ago
- OCR Benchmark☆613Updated 3 months ago
- Fast Multimodal Semantic Deduplication & Filtering☆879Updated 2 weeks ago
- FastFit ⚡ When LLMs are Unfit Use FastFit ⚡ Fast and Effective Text Classification with Many Classes☆214Updated 4 months ago
- 📚 Process PDFs, Word documents and more with spaCy☆847Updated 10 months ago
- Repository for deepdoctection tutorial notebooks☆50Updated last month
- Lite & Super-fast re-ranking for your search & retrieval pipelines. Supports SoTA Listwise and Pairwise reranking based on LLMs and cro…☆933Updated last month
- The scripts for training Detectron2-based Layout Models on popular layout analysis datasets☆218Updated 2 years ago
- ☆392Updated 2 years ago
- Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on task…☆184Updated last year
- A Python library to chunk/group your texts based on semantic similarity.☆103Updated last year
- Viewer for the structure extracted by Grobid on PDF documents☆57Updated 2 months ago
- DocLLM: A layout-aware generative language model for multimodal document understanding☆137Updated 2 years ago
- Generalist and Lightweight Model for Text Classification☆169Updated last week