CatchTheTornado / text-extract-api
Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown
☆2,546Updated last week
Alternatives and similar repositories for text-extract-api:
Users that are interested in text-extract-api are comparing it to the libraries listed below
- AI reads books: Page-by-Page PDF Knowledge Extractor & Summarizer. script performs an intelligent page-by-page analysis of PDF books, met…☆1,465Updated 3 months ago
- Vision infrastructure to turn complex documents into RAG/LLM-ready data☆2,124Updated this week
- File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.☆5,997Updated 2 months ago
- Document to Markdown OCR library with Llama 3.2 vision☆2,262Updated 3 months ago
- Portable KMS (knowledge management system) designed to integrate seamlessly with any Retrieval-Augmented Generation (RAG) system☆1,184Updated last week
- ☆1,464Updated last month
- An open-source OCR API that leverages OpenAI's powerful language models with optimized performance techniques like parallel processing an…☆848Updated 7 months ago
- 🔥 Open Source Browser API for AI Agents & Apps. Steel Browser is a batteries-included browser instance that lets you automate the web wi…☆4,239Updated this week
- A text extraction library supporting PDFs, images, office documents and more☆1,784Updated 2 weeks ago
- E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with ded…☆1,070Updated 7 months ago
- A visual playground for agentic workflows: Iterate over your agents 10x faster☆4,611Updated 2 weeks ago
- NVIDIA Ingest is an early access set of microservices for parsing hundreds of thousands of complex, messy unstructured PDFs and other ent…☆2,653Updated this week
- Company Researcher tool helps you instantly understand any company inside out.☆1,158Updated 2 months ago
- Easily deployable and scalable backend server that efficiently converts various document formats (pdf, docx, pptx, html, images, etc) int…☆552Updated last month
- 🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!☆2,932Updated this week
- SOTA Open-Source Browser Agent for autonomously performing complex tasks on the web☆1,548Updated this week
- Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.☆1,064Updated 4 months ago
- The ultimate LLM Ops platform - Monitoring, Analytics, Evaluations, Datasets and Prompt Optimization ✨☆1,308Updated this week
- Open source multi-modal RAG for building AI apps over private knowledge.☆1,667Updated this week
- Turn any webpage into structured data using LLMs☆4,762Updated 7 months ago
- Swiss-army tool for scraping and extracting data from online assets, made for hackers☆3,442Updated 6 months ago
- Transform PDFs into AI podcasts for engaging on-the-go audio content.☆623Updated last week
- The python library for real-time communication☆3,750Updated this week
- A free and open source, self hosted Ai based live meeting note taker and minutes summary generator that can completely run in your Local …☆4,453Updated this week
- Build Real-Time Knowledge Graphs for AI Agents☆4,092Updated this week
- An Open Source implementation of Notebook LM with more flexibility and features☆1,371Updated 2 weeks ago
- Toolkit for linearizing PDFs for LLM datasets/training☆11,889Updated this week
- A superfast full-text search application☆1,088Updated 4 months ago
- Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections.☆2,623Updated last month
- A Comprehensive Toolkit for High-Quality PDF Content Extraction☆7,431Updated 3 months ago