yobix-ai/extractous

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/yobix-ai/extractous)

yobix-ai / extractous

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

☆1,767

Alternatives and similar repositories for extractous

Users that are interested in extractous are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

datalab-to / marker
View on GitHub
Convert PDF to markdown + JSON quickly with high accuracy
☆37,711Updated this week
shcherbak-ai / contextgem
View on GitHub
ContextGem: Effortless LLM extraction from documents
☆1,856Jun 6, 2026Updated last month
xberg-io / xberg
View on GitHub
A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office…
☆8,682Updated this week
lumina-ai-inc / chunkr
View on GitHub
Vision infrastructure to turn complex documents into RAG/LLM-ready data
☆4,038Apr 9, 2026Updated 3 months ago
Unstructured-IO / unstructured
View on GitHub
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean…
☆15,176Updated this week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
EricLBuehler / mistral.rs
View on GitHub
Fast, flexible LLM inference
☆7,508Updated this week
docling-project / docling
View on GitHub
Get your documents ready for gen AI
☆63,561Updated this week
getomni-ai / zerox
View on GitHub
OCR & Document Extraction using vision models
☆12,260May 20, 2025Updated last year
datalab-to / surya
View on GitHub
OCR, layout analysis, reading order, table recognition in 90+ languages
☆21,130Updated this week
lancedb / lancedb
View on GitHub
Developer-friendly OSS embedded retrieval library for multimodal AI. Search More; Manage Less.
☆10,945Updated this week
CatchTheTornado / text-extract-api
View on GitHub
Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents…
☆3,143Dec 8, 2025Updated 7 months ago
NanoNets / docext
View on GitHub
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
☆2,032Mar 17, 2026Updated 4 months ago
Anush008 / fastembed-rs
View on GitHub
Rust library for generating vector embeddings, reranking locally!
☆970Updated this week
katanemo / plano
View on GitHub
Plano is an AI-native proxy server and data plane for agentic apps. Smart LLM routing, observability, agent orchestration, and guardrails…
☆6,882Updated this week
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
echohive42 / AI-reads-books-page-by-page
View on GitHub
AI reads books: Page-by-Page PDF Knowledge Extractor & Summarizer. script performs an intelligent page-by-page analysis of PDF books, met…
☆2,291Jun 27, 2026Updated 3 weeks ago
Cinnamon / kotaemon
View on GitHub
An open-source RAG-based tool for chatting with your documents.
☆25,572Jul 14, 2026Updated last week
0xPlaygrounds / rig
View on GitHub
⚙️🦀 Build modular and scalable LLM Applications in Rust
☆7,992Updated this week
ucbepic / docetl
View on GitHub
A system for agentic LLM-powered data processing and ETL
☆3,909Updated this week
opendatalab / PDF-Extract-Kit
View on GitHub
A Comprehensive Toolkit for High-Quality PDF Content Extraction
☆9,797Jan 3, 2025Updated last year
neuml / txtai
View on GitHub
💡 All-in-one AI framework for semantic search, LLM orchestration and language model workflows
☆12,741Updated this week
allenai / olmocr
View on GitHub
Toolkit for linearizing PDFs for LLM datasets/training
☆19,151Mar 25, 2026Updated 3 months ago
Michael-A-Kuykendall / shimmy
View on GitHub
⚡ Pure-Rust WebGPU inference engine — OpenAI-API compatible, GGUF native, runs on any GPU. No Python. No llama.cpp. Single binary.
☆5,678Updated this week
screenpipe / screenpipe
View on GitHub
YC (S26) | Record your screen 24/7 and plug into your agents. Local, private, secure. Connect to OpenClaw, Hermes agent and 100+ apps
☆20,362Updated this week
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
google / langextract
View on GitHub
A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive vi…
☆37,641Jul 2, 2026Updated 2 weeks ago
bytedance / Dolphin
View on GitHub
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
☆9,037Mar 25, 2026Updated 3 months ago
gcui-art / markdown-to-image
View on GitHub
This React component is used to render Markdown into a beautiful poster image, with support for copying as an image. Md to Poster/Image/Q…
☆1,951Mar 5, 2025Updated last year
refly-ai / refly
View on GitHub
The first open-source agent skills builder. Define skills by vibe workflow, run on Claude Code, Cursor, Codex & more. Build Clawdbot 🦞· …
☆7,450Mar 25, 2026Updated 3 months ago
lance-format / lance
View on GitHub
Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data ve…
☆6,832Updated this week
AmineDiro / ferrules
View on GitHub
Modern, fast, document parser written in 🦀
☆613Apr 26, 2026Updated 2 months ago
HelixDB / helix-db
View on GitHub
HelixDB is an OLTP graph-vector database built in Rust on Object Storage.
☆5,657Updated this week
pykeio / ort
View on GitHub
Fast ML inference & training for ONNX models in Rust
☆2,412Updated this week
morphik-org / morphik-core
View on GitHub
Open-source multimodal retrieval engine (Morphik Core). By Morphik — AI back office for skilled nursing & senior living (morphik.ai).
☆3,633Jul 5, 2026Updated 2 weeks ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
Ucas-HaoranWei / GOT-OCR2.0
View on GitHub
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
☆8,155Feb 10, 2025Updated last year
SciPhi-AI / R2R
View on GitHub
SoTA production-ready AI retrieval system. Agentic Retrieval-Augmented Generation (RAG) with a RESTful API.
☆7,937Nov 7, 2025Updated 8 months ago
benbrandt / text-splitter
View on GitHub
Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from R…
☆621Updated this week
cocoindex-io / cocoindex
View on GitHub
Incremental engine for long horizon agents 🌟 Star if you like it!
☆10,977Updated this week
superradcompany / microsandbox
View on GitHub
🧱 easy, fast and local-first microVM runtime
☆6,987Updated this week
raphael-seo / Versatile-OCR-Program
View on GitHub
Multi-modal OCR pipeline optimized for ML training (text, figure, math, tables, diagrams)
☆677May 13, 2026Updated 2 months ago
memfreeme / memfree
View on GitHub
MemFree - Hybrid AI Search Engine & AI Page Generator
☆1,506Jul 6, 2026Updated 2 weeks ago