CatchTheTornado/text-extract-api

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/CatchTheTornado/text-extract-api)

CatchTheTornado / text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown

☆3,152

Alternatives and similar repositories for text-extract-api

Users that are interested in text-extract-api are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

QuivrHQ / MegaParse
View on GitHub
File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
☆7,410Feb 21, 2025Updated last year
getomni-ai / zerox
View on GitHub
OCR & Document Extraction using vision models
☆12,261May 20, 2025Updated last year
datalab-to / surya
View on GitHub
OCR, layout analysis, reading order, table recognition in 90+ languages
☆21,176Updated this week
echohive42 / AI-reads-books-page-by-page
View on GitHub
AI reads books: Page-by-Page PDF Knowledge Extractor & Summarizer. script performs an intelligent page-by-page analysis of PDF books, met…
☆2,294Jun 27, 2026Updated last month
lumina-ai-inc / chunkr
View on GitHub
Vision infrastructure to turn complex documents into RAG/LLM-ready data
☆4,055Apr 9, 2026Updated 3 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
imanoop7 / Ollama-OCR
View on GitHub
☆2,680Mar 17, 2025Updated last year
opendatalab / PDF-Extract-Kit
View on GitHub
A Comprehensive Toolkit for High-Quality PDF Content Extraction
☆9,811Jan 3, 2025Updated last year
docling-project / docling
View on GitHub
Get your documents ready for gen AI
☆63,950Updated this week
enoch3712 / ExtractThinker
View on GitHub
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
☆1,587Aug 27, 2025Updated 11 months ago
GitHamza0206 / simba
View on GitHub
OpenSource Production ready Customer service with built in Evals and monitoring
☆1,451Jun 18, 2026Updated last month
Cinnamon / kotaemon
View on GitHub
An open-source RAG-based tool for chatting with your documents.
☆25,666Jul 14, 2026Updated 2 weeks ago
ucbepic / docetl
View on GitHub
A system for agentic LLM-powered data processing and ETL
☆3,950Jul 21, 2026Updated last week
Nutlope / llama-ocr
View on GitHub
Document to Markdown OCR library with Llama 3.2 vision
☆2,429Jul 12, 2026Updated 2 weeks ago
allenai / olmocr
View on GitHub
Toolkit for linearizing PDFs for LLM datasets/training
☆19,209Mar 25, 2026Updated 4 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
yigitkonur / api-llm-ocr
View on GitHub
PDF to markdown using vision LLMs — tables, layouts, and structure preserved
☆899Feb 21, 2026Updated 5 months ago
datalab-to / marker
View on GitHub
Convert PDF to markdown + JSON quickly with high accuracy
☆37,994Jul 20, 2026Updated last week
steel-dev / steel-browser
View on GitHub
🔥 Open Source Browser API for AI Agents & Apps. Steel Browser is a batteries-included browser sandbox that lets you automate the web wit…
☆7,395Updated this week
VikParuchuri / tabled
View on GitHub
Detect and extract tables to markdown and csv
☆748Jan 24, 2025Updated last year
wisupai / e2m
View on GitHub
E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with ded…
☆1,295Sep 8, 2024Updated last year
katanaml / sparrow
View on GitHub
Structured data extraction, instruction calling and agentic workflows with ML, LLM and Vision LLM
☆5,188Jun 30, 2026Updated 3 weeks ago
microsoft / data-formulator
View on GitHub
🪄 Data Formulator is an interactive AI-powered data analysis system makes it easy to connect, explore and visualize data.
☆15,987Updated this week
SouthBridgeAI / offmute
View on GitHub
An experiment in meeting transcription and diarization with just an LLM. Maybe I went a little overboard though
☆568Apr 8, 2026Updated 3 months ago
tjmlabs / ColiVara
View on GitHub
Colivara is a suite of services that allows you to store, search, and retrieve documents based on their visual embedding. ColiVara has st…
☆1,484Jul 8, 2026Updated 3 weeks ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
Nutlope / logocreator
View on GitHub
A free + OSS logo generator powered by Flux on Together AI
☆7,182Jun 26, 2026Updated last month
superlinear-ai / raglite
View on GitHub
🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL
☆1,196Jul 9, 2026Updated 2 weeks ago
langchain-ai / open-canvas
View on GitHub
📃 A better UX for chat, writing content, and coding with LLMs.
☆5,494Feb 25, 2026Updated 5 months ago
agno-agi / agno
View on GitHub
Build, run, and manage agent platforms.
☆41,489Updated this week
shcherbak-ai / contextgem
View on GitHub
ContextGem: Effortless LLM extraction from documents
☆1,865Updated this week
bytedance / Dolphin
View on GitHub
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
☆9,041Mar 25, 2026Updated 4 months ago
gptme / gptme
View on GitHub
Your agent in your terminal, equipped with local tools: writes code, uses the terminal, browses the web. Make your own persistent autonom…
☆4,372Updated this week
bjesus / pipet
View on GitHub
Swiss-army tool for scraping and extracting data from online assets, made for hackers
☆4,759Oct 12, 2024Updated last year
Ucas-HaoranWei / GOT-OCR2.0
View on GitHub
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
☆8,212Feb 10, 2025Updated last year
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
adithya-s-k / omniparse
View on GitHub
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
☆7,650Dec 12, 2025Updated 7 months ago
opendatalab / MinerU
View on GitHub
Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.
☆76,160Updated this week
egoist / sitefetch
View on GitHub
Fetch an entire site and save it as a text file (to be used with AI models).
☆1,734Jan 18, 2025Updated last year
Dicklesworthstone / llm_aided_ocr
View on GitHub
Enhances Tesseract OCR output using LLMs (local or API) for error correction, smart chunking, and markdown formatting of scanned PDFs
☆2,947Mar 22, 2026Updated 4 months ago
microsoft / PromptWizard
View on GitHub
Task-Aware Agent-driven Prompt Optimization Framework
☆3,906Oct 13, 2025Updated 9 months ago
unclecode / crawl4ai
View on GitHub
🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN
☆75,515Updated this week
zaidmukaddam / scira
View on GitHub
Scira (Formerly MiniPerplx) is a minimalistic AI-powered search engine that helps you find information on the internet and cites it too. …
☆11,817Mar 20, 2026Updated 4 months ago