Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
☆1,692Dec 21, 2024Updated last year
Alternatives and similar repositories for extractous
Users that are interested in extractous are comparing it to the libraries listed below
Sorting:
- Vision infrastructure to turn complex documents into RAG/LLM-ready data☆2,940Sep 24, 2025Updated 5 months ago
- Convert PDF to markdown + JSON quickly with high accuracy☆32,069Updated this week
- Fast, flexible LLM inference☆6,653Feb 27, 2026Updated last week
- OCR, layout analysis, reading order, table recognition in 90+ languages☆19,392Updated this week
- OCR & Document Extraction using vision models☆12,155May 20, 2025Updated 9 months ago
- Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents…☆2,987Dec 8, 2025Updated 2 months ago
- Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean…☆14,135Updated this week
- A Comprehensive Toolkit for High-Quality PDF Content Extraction☆9,433Jan 3, 2025Updated last year
- Get your documents ready for gen AI☆54,754Updated this week
- 💡 All-in-one AI framework for semantic search, LLM orchestration and language model workflows☆12,247Feb 25, 2026Updated last week
- Developer-friendly OSS embedded retrieval library for multimodal AI. Search More; Manage Less.☆9,275Updated this week
- AI reads books: Page-by-Page PDF Knowledge Extractor & Summarizer. script performs an intelligent page-by-page analysis of PDF books, met…☆1,577Jan 20, 2025Updated last year
- An open-source RAG-based tool for chatting with your documents.☆25,168Updated this week
- Plano is an AI-native proxy and data plane for agentic apps — with built-in orchestration, safety, observability, and smart LLM routing s…☆5,841Updated this week
- All-in-one platform for search, recommendations, RAG, and analytics offered via API☆2,609Jan 25, 2026Updated last month
- Enhances Tesseract OCR output using LLMs (local or API) for error correction, smart chunking, and markdown formatting of scanned PDFs☆2,880Updated this week
- SoTA production-ready AI retrieval system. Agentic Retrieval-Augmented Generation (RAG) with a RESTful API.☆7,711Nov 7, 2025Updated 4 months ago
- screenpipe turns your computer into a personal AI that knows everything you've done. record. search. automate. all local, all private, al…☆17,068Updated this week
- Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks☆6,804Dec 12, 2025Updated 2 months ago
- Modern, fast, document parser written in 🦀☆572Feb 18, 2026Updated 2 weeks ago
- Detect and extract tables to markdown and csv☆754Jan 24, 2025Updated last year
- SeekStorm - sub-millisecond full-text search library & multi-tenancy server in Rust☆1,842Feb 16, 2026Updated 2 weeks ago
- Toolkit for linearizing PDFs for LLM datasets/training☆16,947Feb 19, 2026Updated 2 weeks ago
- Multi-modal OCR pipeline optimized for ML training (text, figure, math, tables, diagrams)☆682May 20, 2025Updated 9 months ago
- ContextGem: Effortless LLM extraction from documents☆1,805Feb 22, 2026Updated last week
- Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model☆8,089Feb 10, 2025Updated last year
- OpenSource Production ready Customer service with built in Evals and monitoring☆1,437Jan 12, 2026Updated last month
- Python tool for converting files and office documents to Markdown.☆88,637Feb 20, 2026Updated 2 weeks ago
- This React component is used to render Markdown into a beautiful poster image, with support for copying as an image. Md to Poster/Image/Q…☆1,856Mar 5, 2025Updated last year
- High-performance retrieval engine for unstructured data☆1,562Nov 10, 2025Updated 3 months ago
- MemFree - Hybrid AI Search Engine & AI Page Generator☆1,490Aug 8, 2025Updated 6 months ago
- Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust☆14,638Updated this week
- The first open-source agent skills builder. Define skills by vibe workflow, run on Claude Code, Cursor, Codex & more. Build Clawdbot 🦞· …☆6,845Feb 28, 2026Updated last week
- ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.☆1,485Aug 27, 2025Updated 6 months ago
- No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents☆6,452Updated this week
- ⚙️🦀 Build modular and scalable LLM Applications in Rust☆6,221Updated this week
- Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.☆55,275Updated this week
- 📃 A better UX for chat, writing content, and coding with LLMs.☆5,389Feb 25, 2026Updated last week
- Minimalist ML framework for Rust☆19,509Feb 28, 2026Updated last week