datalab-to / pdftextLinks
Extract structured text from pdfs quickly
☆605Updated 3 months ago
Alternatives and similar repositories for pdftext
Users that are interested in pdftext are comparing it to the libraries listed below
Sorting:
- Lightweight, performant, deep table extraction☆510Updated last month
- Detect and extract tables to markdown and csv☆750Updated 8 months ago
- TF-ID: Table/Figure IDentifier for academic papers☆240Updated last year
- A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.☆375Updated last month
- UniTable: Towards a Unified Table Foundation Model☆506Updated last year
- OCR Benchmark☆570Updated 4 months ago
- Simple package to extract text with coordinates from programmatic PDFs☆197Updated 2 weeks ago
- ☆194Updated 3 weeks ago
- Python bindings to PDFium, reasonably cross-platform.☆647Updated this week
- Benchmarking PDF libraries☆312Updated 3 months ago
- Use late-interaction multi-modal models such as ColPali in just a few lines of code.☆826Updated 8 months ago
- Structured information extraction from documents☆317Updated last year
- Recipes for learning, fine-tuning, and adapting ColPali to your multimodal RAG use cases. 👨🏻🍳☆336Updated 4 months ago
- Fast Semantic Text Deduplication & Filtering☆810Updated 3 weeks ago
- ☆147Updated last month
- PyMuPDF4LLM☆1,058Updated 2 months ago
- 📚 Process PDFs, Word documents and more with spaCy☆761Updated 6 months ago
- Math OCR model that outputs LaTeX and markdown☆1,078Updated 8 months ago
- ☆236Updated 3 months ago
- A python library to define and validate data types in Docling.☆185Updated this week
- Parse PDFs into markdown using Vision LLMs☆429Updated 2 weeks ago
- Vision-Augmented Retrieval and Generation (VARAG) - Vision first RAG Engine☆481Updated 2 months ago
- ☆124Updated 7 months ago
- Code for explaining and evaluating late chunking (chunked pooling)☆453Updated 9 months ago
- ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.☆1,422Updated last month
- TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inf…☆205Updated 4 months ago
- Unattended Lightweight Text Classifiers with LLM Embeddings☆184Updated last year
- Lite & Super-fast re-ranking for your search & retrieval pipelines. Supports SoTA Listwise and Pairwise reranking based on LLMs and cro…☆863Updated 2 weeks ago
- A very simple news crawler with a funny name☆403Updated this week
- High-performance retrieval engine for unstructured data☆1,502Updated 2 months ago