Goldziher / kreuzbergLinks
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
☆2,314Updated this week
Alternatives and similar repositories for kreuzberg
Users that are interested in kreuzberg are comparing it to the libraries listed below
Sorting:
- A powerful document AI question-answering tool that connects to your local Ollama models. Create, manage, and interact with RAG systems f…☆1,066Updated 3 weeks ago
- ☆862Updated 3 months ago
- Concurrent Python made simple☆1,463Updated 6 months ago
- Detect and extract tables to markdown and csv☆753Updated 7 months ago
- Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.☆1,228Updated 8 months ago
- Multi-modal OCR pipeline optimized for ML training (text, figure, math, tables, diagrams)☆665Updated 3 months ago
- A self-hosted API that takes a URL and returns a file with browser screenshots.☆1,042Updated 5 months ago
- An open-source OCR API that leverages OpenAI's powerful language models with optimized performance techniques like parallel processing an…☆867Updated 11 months ago
- High Performace IDE for Jupyter Notebooks☆2,219Updated last week
- The SOTA Open-Source Browser Agent for autonomously performing complex tasks on the web☆2,319Updated 2 months ago
- Open-source platform for extracting structured data from documents using AI.☆1,406Updated 3 months ago
- 🦛 CHONK your texts with Chonkie ✨ — The no-nonsense RAG chunking library☆2,076Updated this week
- The most accurate document search and store for building AI apps☆3,152Updated this week
- Lightweight library for scraping web-sites with LLMs☆1,212Updated this week
- CLI tool and python library to inspect databases fast.☆498Updated 2 months ago
- ☆1,994Updated 5 months ago
- Deep inspection of Python objects☆1,888Updated last week
- Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents…☆2,815Updated 3 weeks ago
- Turn docstrings into LLM-functions☆499Updated 4 months ago
- A hub for various industry-specific schemas to be used with VLMs.☆532Updated 3 months ago
- PgQueuer is a Python library leveraging PostgreSQL for efficient job queuing.☆1,341Updated last week
- NeMo Retriever extraction is a scalable, performance-oriented document content and metadata extraction microservice. NeMo Retriever extra…☆2,733Updated this week
- 🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL☆1,056Updated this week
- Vision infrastructure to turn complex documents into RAG/LLM-ready data☆2,796Updated last week
- A web framework for building products with Python.☆620Updated this week
- A Python Library for Generating PDFs and Images from HTML, powered by PlutoBook☆501Updated this week
- ContextGem: Effortless LLM extraction from documents☆1,477Updated this week
- Portable KMS (knowledge management system) designed to integrate seamlessly with any Retrieval-Augmented Generation (RAG) system☆1,350Updated 3 weeks ago
- A tool for Python developers to easily debug the HTTP(S) client and server requests in a Python program.☆876Updated this week
- Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections.☆2,742Updated 6 months ago