py-pdf / benchmarksLinks
Benchmarking PDF libraries
☆315Updated 5 months ago
Alternatives and similar repositories for benchmarks
Users that are interested in benchmarks are comparing it to the libraries listed below
Sorting:
- Python bindings to PDFium, reasonably cross-platform.☆684Updated this week
- ☆199Updated last week
- Simple package to extract text with coordinates from programmatic PDFs☆218Updated 3 weeks ago
- A python library to define and validate data types in Docling.☆211Updated this week
- Extract structured text from pdfs quickly☆629Updated 5 months ago
- A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.☆486Updated last month
- ☆169Updated 3 weeks ago
- ☆389Updated last year
- 📚 Process PDFs, Word documents and more with spaCy☆816Updated 8 months ago
- ⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍☆624Updated 3 months ago
- Late Interaction Models Training & Retrieval☆661Updated 3 weeks ago
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy☆1,413Updated this week
- Adobe PDFServices python SDK Samples☆160Updated 4 months ago
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆397Updated 2 years ago
- Show the differences between two strings/text as a compact text, in markdown/HTML, in the terminal and more.☆147Updated last week
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆150Updated last month
- UniTable: Towards a Unified Table Foundation Model☆514Updated last year
- Streamlit PDF viewer☆189Updated 3 weeks ago
- Software that makes labeling PDFs easy.☆422Updated last year
- Lite & Super-fast re-ranking for your search & retrieval pipelines. Supports SoTA Listwise and Pairwise reranking based on LLMs and cro…☆892Updated 2 months ago
- A Python library to chunk/group your texts based on semantic similarity.☆100Updated last year
- OCR Benchmark☆595Updated last month
- Excel spreadsheet crawler and table parser for data extraction and querying☆164Updated 9 months ago
- Fast Semantic Text Deduplication & Filtering☆848Updated last month
- A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.☆348Updated last year
- FastFit ⚡ When LLMs are Unfit Use FastFit ⚡ Fast and Effective Text Classification with Many Classes☆214Updated 2 months ago
- DocLLM: A layout-aware generative language model for multimodal document understanding☆130Updated last year
- Pinecone text client library☆66Updated 3 months ago
- LitePali is a minimal, efficient implementation of ColPali for image retrieval and indexing, optimized for cloud deployment.☆65Updated last year
- Lightweight, performant, deep table extraction☆517Updated 3 months ago