Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
☆1,754Dec 21, 2024Updated last year
Alternatives and similar repositories for extractous
Users that are interested in extractous are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Convert PDF to markdown + JSON quickly with high accuracy☆35,659May 5, 2026Updated last month
- Vision infrastructure to turn complex documents into RAG/LLM-ready data☆2,945Apr 9, 2026Updated 2 months ago
- Fast, flexible LLM inference☆7,255Updated this week
- Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean…☆14,841Updated this week
- OCR, layout analysis, reading order, table recognition in 90+ languages☆20,618Jun 2, 2026Updated last week
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- OCR & Document Extraction using vision models☆12,236May 20, 2025Updated last year
- Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents…☆3,103Dec 8, 2025Updated 6 months ago
- A Comprehensive Toolkit for High-Quality PDF Content Extraction☆9,696Jan 3, 2025Updated last year
- Get your documents ready for gen AI☆60,897Updated this week
- Modern, fast, document parser written in 🦀☆604Apr 26, 2026Updated last month
- 💡 All-in-one AI framework for semantic search, LLM orchestration and language model workflows☆12,622Jun 1, 2026Updated last week
- AI reads books: Page-by-Page PDF Knowledge Extractor & Summarizer. script performs an intelligent page-by-page analysis of PDF books, met…☆2,144Jan 20, 2025Updated last year
- SoTA production-ready AI retrieval system. Agentic Retrieval-Augmented Generation (RAG) with a RESTful API.☆7,872Nov 7, 2025Updated 7 months ago
- An open-source RAG-based tool for chatting with your documents.☆25,438Updated this week
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Multi-modal OCR pipeline optimized for ML training (text, figure, math, tables, diagrams)☆682May 13, 2026Updated 3 weeks ago
- ContextGem: Effortless LLM extraction from documents☆1,844Updated this week
- Developer-friendly OSS embedded retrieval library for multimodal AI. Search More; Manage Less.☆10,504Updated this week
- Enhances Tesseract OCR output using LLMs (local or API) for error correction, smart chunking, and markdown formatting of scanned PDFs☆2,929Mar 22, 2026Updated 2 months ago
- This React component is used to render Markdown into a beautiful poster image, with support for copying as an image. Md to Poster/Image/Q…☆1,938Mar 5, 2025Updated last year
- Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model☆8,137Feb 10, 2025Updated last year
- SeekStorm: vector & lexical search - in-process library & multi-tenancy server, in Rust.☆1,888Updated this week
- All-in-one platform for search, recommendations, RAG, and analytics offered via API☆2,672Jan 25, 2026Updated 4 months ago
- ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.☆1,559Aug 27, 2025Updated 9 months ago
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- Toolkit for linearizing PDFs for LLM datasets/training☆17,375Mar 25, 2026Updated 2 months ago
- High-performance retrieval engine for unstructured data☆1,584Nov 10, 2025Updated 6 months ago
- Plano is an AI-native proxy and data plane for agentic apps — with built-in orchestration, safety, observability, and smart LLM routing s…☆6,565Jun 1, 2026Updated last week
- Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust☆15,305Jun 2, 2026Updated last week
- YC (S26) | AI that knows what you've seen, said, or heard. Records everything you do, say, hear 24/7, local, private, secure☆19,062Jun 2, 2026Updated last week
- Detect and extract tables to markdown and csv☆752Jan 24, 2025Updated last year
- Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks☆7,493Dec 12, 2025Updated 5 months ago
- No-code ETL and data pipelines with AI and NLP☆317Feb 20, 2025Updated last year
- Minimalist ML framework for Rust☆20,426Updated this week
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- ⚙️🦀 Build modular and scalable LLM Applications in Rust☆7,497Jun 2, 2026Updated last week
- MemFree - Hybrid AI Search Engine & AI Page Generator☆1,496Aug 8, 2025Updated 10 months ago
- LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows☆6,635Updated this week
- Python tool for converting files and office documents to Markdown.☆146,834May 26, 2026Updated 2 weeks ago
- pingcap/autoflow is a Graph RAG based and conversational knowledge base tool built with TiDB Serverless Vector Storage. Demo: https://tid…☆2,788Apr 27, 2026Updated last month
- Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.☆66,024May 31, 2026Updated last week
- An example of how to use Rust to parse PDF's in Elixir and LiveView☆29Jan 29, 2025Updated last year