CatchTheTornado / text-extract-api
Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown
β2,509Updated last week
Alternatives and similar repositories for text-extract-api:
Users that are interested in text-extract-api are comparing it to the libraries listed below
- File Parser optimised for LLM Ingestion with no loss π§ Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.β5,917Updated last month
- β1,391Updated 2 weeks ago
- Document to Markdown OCR library with Llama 3.2 visionβ2,238Updated 2 months ago
- Vision infrastructure to turn complex documents into RAG/LLM-ready dataβ2,090Updated this week
- AI reads books: Page-by-Page PDF Knowledge Extractor & Summarizer. script performs an intelligent page-by-page analysis of PDF books, metβ¦β1,440Updated 2 months ago
- E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. Itβs easy to install, with dedβ¦β1,057Updated 6 months ago
- π·οΈ An undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy again!β2,798Updated last week
- π₯ Open Source Browser API for AI Agents & Apps. Steel Browser is a batteries-included browser instance that lets you automate the web wiβ¦β4,073Updated this week
- Detect and extract tables to markdown and csvβ734Updated 2 months ago
- Portable KMS (knowledge management system) designed to integrate seamlessly with any Retrieval-Augmented Generation (RAG) systemβ1,150Updated this week
- Easily deployable and scalable backend server that efficiently converts various document formats (pdf, docx, pptx, html, images, etc) intβ¦β476Updated 3 weeks ago
- Turn any webpage into structured data using LLMsβ4,679Updated 7 months ago
- A Powerful web scraper powered by LLM | OpenAI, Gemini & Ollamaβ1,647Updated 2 months ago
- Company Researcher tool helps you instantly understand any company inside out.β1,142Updated 2 months ago
- A Comprehensive Toolkit for High-Quality PDF Content Extractionβ7,222Updated 3 months ago
- Colivara is a suite of services that allows you to store, search, and retrieve documents based on their visual embedding. ColiVara has stβ¦β879Updated last month
- NVIDIA Ingest is an early access set of microservices for parsing hundreds of thousands of complex, messy unstructured PDFs and other entβ¦β2,631Updated this week
- A text extraction library supporting PDFs, images, office documents and moreβ1,712Updated this week
- Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.β1,031Updated 3 months ago
- β2,685Updated last week
- Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections.β2,595Updated last month
- Task-Aware Agent-driven Prompt Optimization Frameworkβ3,073Updated last week
- π₯€ RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLiteβ876Updated 2 weeks ago
- Sample apps to help developers get started with Structured Outputsβ622Updated 2 months ago
- β1,205Updated 6 months ago
- Keep searching, reading webpages, reasoning until it finds the answer (or exceeding the token budget)β3,725Updated this week
- AI-powered markdown note taking app - leverage vector embeddings and LLMs with your knowledge base - 100% local or in the cloudβ1,184Updated this week
- A system for agentic LLM-powered data processing and ETLβ1,728Updated last week
- AI Agent Framework For Software Engineersβ1,136Updated this week
- πΌ Your own AI-powered voice interviewer for hiring.β763Updated 3 weeks ago