py-pdf / benchmarks
Benchmarking PDF libraries
☆226Updated last year
Related projects ⓘ
Alternatives and complementary repositories for benchmarks
- Python bindings to PDFium☆427Updated 3 weeks ago
- A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.☆182Updated 4 months ago
- ☆162Updated 3 weeks ago
- Streamlit PDF viewer☆110Updated 3 weeks ago
- A spaCy wrapper for GliNER☆91Updated 4 months ago
- multimodal document analysis☆160Updated 5 months ago
- UniTable: Towards a Unified Table Foundation Model☆377Updated 5 months ago
- FastFit ⚡ When LLMs are Unfit Use FastFit ⚡ Fast and Effective Text Classification with Many Classes☆183Updated last month
- Extract structured text from pdfs quickly☆342Updated this week
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy☆908Updated last week
- Viewer for the structure extracted by Grobid on PDF documents☆38Updated 2 weeks ago
- Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from R…☆291Updated this week
- Multi-threaded matrix multiplication and cosine similarity calculations for dense and sparse matrices. Appropriate for calculating the K …☆75Updated 4 months ago
- Software that makes labeling PDFs easy.☆391Updated 6 months ago
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆274Updated last year
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.☆1,095Updated last week
- Lite & Super-fast re-ranking for your search & retrieval pipelines. Supports SoTA Listwise and Pairwise reranking based on LLMs and cro…☆665Updated last month
- Generalist and Lightweight Model for Relation Extraction (Extract any relationship types from text)☆128Updated this week
- Easily embed, cluster and semantically label text datasets☆463Updated 7 months ago
- RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF☆548Updated 2 weeks ago
- 🦦 weasel: A small and easy workflow system☆68Updated 4 months ago
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆173Updated last year
- This package, developed as part of our research detailed in the Chroma Technical Report, provides tools for text chunking and evaluation.…☆160Updated last month
- Adobe PDFServices python SDK Samples☆131Updated 2 weeks ago
- DocLLM: A layout-aware generative language model for multimodal document understanding☆113Updated 10 months ago
- A Python client for the Unstructured hosted API☆82Updated this week
- Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)☆348Updated 7 months ago
- The scripts for training Detectron2-based Layout Models on popular layout analysis datasets☆203Updated last year
- `pdfstructure` detects, splits and organizes the documents text content into its natural structure as envisioned by the author.☆101Updated 7 months ago
- SpanMarker for Named Entity Recognition☆401Updated 3 months ago