VikParuchuri / pdftext
Extract structured text from pdfs quickly
☆450Updated last month
Alternatives and similar repositories for pdftext:
Users that are interested in pdftext are comparing it to the libraries listed below
- TF-ID: Table/Figure IDentifier for academic papers☆230Updated 8 months ago
- Fast Semantic Text Deduplication☆597Updated last week
- Detect and extract tables to markdown and csv☆734Updated 2 months ago
- Structured information extraction from documents☆312Updated 6 months ago
- Lightweight, performant, deep table extraction☆440Updated this week
- UniTable: Towards a Unified Table Foundation Model☆452Updated 9 months ago
- Use late-interaction multi-modal models such as ColPali in just a few lines of code.☆757Updated 2 months ago
- Recipes for learning, fine-tuning, and adapting ColPali to your multimodal RAG use cases. 👨🏻🍳☆264Updated 3 months ago
- OCR Benchmark☆332Updated this week
- A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.☆272Updated this week
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.☆1,359Updated 2 weeks ago
- Python bindings to PDFium☆552Updated 2 weeks ago
- ☆176Updated 2 weeks ago
- This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.☆1,203Updated this week
- RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF☆842Updated this week
- ☆215Updated 3 months ago
- Simple package to extract text with coordinates from programmatic PDFs☆85Updated last week
- A Comprehensive Benchmark for Document Parsing and Evaluation☆305Updated last month
- Code for explaining and evaluating late chunking (chunked pooling)☆355Updated 3 months ago
- FastAPI wrapper around DSPy☆235Updated last year
- 📚 Process PDFs, Word documents and more with spaCy☆500Updated 3 weeks ago
- A Docker-powered service for PDF document layout analysis. This service provides a powerful and flexible PDF analysis service. The servic…☆287Updated this week
- High-performance retrieval engine for unstructured data☆1,292Updated this week
- Developer APIs to Accelerate LLM Projects☆1,620Updated 5 months ago
- Running Docling as an API service☆196Updated this week
- HTML to Markdown converter and crawler.☆532Updated last year
- 🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite☆876Updated 2 weeks ago
- Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on task…☆160Updated 6 months ago
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy☆1,078Updated last week
- Solving data for LLMs - Create quality synthetic datasets!☆145Updated 2 months ago