NanoNets/docext

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/NanoNets/docext)

NanoNets / docext

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

☆2,031

Alternatives and similar repositories for docext

Users that are interested in docext are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

chatdoc-com / OCRFlux
View on GitHub
OCRFlux is a lightweight yet powerful multimodal toolkit that significantly advances PDF-to-Markdown conversion, excelling in complex lay…
☆2,523Apr 14, 2026Updated 3 months ago
bytedance / Dolphin
View on GitHub
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
☆9,038Mar 25, 2026Updated 3 months ago
Yuliang-Liu / MonkeyOCR
View on GitHub
A lightweight LMM-based Document Parsing Model
☆6,608Updated this week
studio-dots-ai / dots.ocr
View on GitHub
Multilingual Document Layout Parsing in a Single Vision-Language Model
☆9,025Mar 24, 2026Updated 3 months ago
allenai / olmocr
View on GitHub
Toolkit for linearizing PDFs for LLM datasets/training
☆19,164Mar 25, 2026Updated 3 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
shcherbak-ai / contextgem
View on GitHub
ContextGem: Effortless LLM extraction from documents
☆1,859Jun 6, 2026Updated last month
chatclimate-ai / ParseStudio
View on GitHub
python package to parse pdfs with different parsers
☆269Sep 12, 2025Updated 10 months ago
NanoNets / docstrange
View on GitHub
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with int…
☆1,508Oct 31, 2025Updated 8 months ago
opendatalab / DocLayout-YOLO
View on GitHub
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
☆2,234Apr 14, 2025Updated last year
Ucas-HaoranWei / GOT-OCR2.0
View on GitHub
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
☆8,153Feb 10, 2025Updated last year
datalab-to / surya
View on GitHub
OCR, layout analysis, reading order, table recognition in 90+ languages
☆21,140Updated this week
docling-project / docling
View on GitHub
Get your documents ready for gen AI
☆63,674Updated this week
datalab-to / marker
View on GitHub
Convert PDF to markdown + JSON quickly with high accuracy
☆37,784Updated this week
opendatalab / MinerU
View on GitHub
Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.
☆75,550Updated this week
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
google / langextract
View on GitHub
A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive vi…
☆37,771Jul 2, 2026Updated 3 weeks ago
landing-ai / agentic-doc
View on GitHub
Legacy Python library for Agentic Document Extraction (ADE). Use the landingai-ade library for all new projects.
☆2,396Mar 24, 2026Updated 3 months ago
getomni-ai / zerox
View on GitHub
OCR & Document Extraction using vision models
☆12,259May 20, 2025Updated last year
PaddlePaddle / PaddleOCR
View on GitHub
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/…
☆86,108Updated this week
opendatalab / OmniDocBench
View on GitHub
[CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation
☆1,910Jun 26, 2026Updated 3 weeks ago
opendatalab / PDF-Extract-Kit
View on GitHub
A Comprehensive Toolkit for High-Quality PDF Content Extraction
☆9,804Jan 3, 2025Updated last year
alibaba / Logics-Parsing
View on GitHub
☆1,393May 13, 2026Updated 2 months ago
ucbepic / docetl
View on GitHub
A system for agentic LLM-powered data processing and ETL
☆3,922Updated this week
morphik-org / morphik-core
View on GitHub
Open-source multimodal retrieval engine (Morphik Core). By Morphik — AI back office for skilled nursing & senior living (morphik.ai).
☆3,634Updated this week
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
infiniflow / ragflow
View on GitHub
RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to creat…
☆85,786Updated this week
getzep / graphiti
View on GitHub
Build Real-Time Knowledge Graphs for AI Agents
☆29,110Updated this week
RapidAI / RapidOCR
View on GitHub
📄 Awesome OCR multiple programing languages toolkits based on ONNX Runtime, OpenVINO, MNN, PaddlePaddle, TensorRT and PyTorch.
☆7,244Updated this week
agno-agi / agno
View on GitHub
Build, run, and manage agent platforms.
☆41,381Updated this week
ispras / dedoc
View on GitHub
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical …
☆716Updated this week
yobix-ai / extractous
View on GitHub
Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
☆1,767Dec 21, 2024Updated last year
unslothai / unsloth
View on GitHub
Unsloth is a local UI for training and running Gemma 4, Qwen3.6, DeepSeek, Kimi, GLM and other models.
☆68,793Updated this week
datalab-to / chandra
View on GitHub
OCR model that handles complex tables, forms, handwriting with full layout.
☆11,763Jun 26, 2026Updated 3 weeks ago
huridocs / pdf-document-layout-analysis
View on GitHub
A Docker-powered service for PDF document layout analysis. This service provides a powerful and flexible PDF analysis service. The servic…
☆1,273Jul 13, 2026Updated last week
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
Cinnamon / kotaemon
View on GitHub
An open-source RAG-based tool for chatting with your documents.
☆25,579Jul 14, 2026Updated last week
HKUDS / LightRAG
View on GitHub
[EMNLP2025] "LightRAG: Simple and Fast Retrieval-Augmented Generation"
☆38,033Updated this week
xberg-io / xberg
View on GitHub
A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office…
☆8,690Updated this week
X-PLUG / mPLUG-DocOwl
View on GitHub
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
☆2,409May 30, 2025Updated last year
QwenLM / Qwen-Agent
View on GitHub
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
☆16,840Mar 4, 2026Updated 4 months ago
Unstructured-IO / unstructured
View on GitHub
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean…
☆15,189Updated this week
VectifyAI / PageIndex
View on GitHub
📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG
☆34,187Updated this week