Goldziher / kreuzbergLinks

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Ruby, Go, and TypeScript/Node.js—or use via CLI, REST API, or MCP server.

☆2,538

Alternatives and similar repositories for kreuzberg

Users that are interested in kreuzberg are comparing it to the libraries listed below

Sorting:

PragmaticMachineLearning / probly
☆876Updated 6 months ago
pyper-dev / pyper
Concurrent Python made simple
☆1,510Updated 9 months ago
DonTizi / rlama
A powerful document AI question-answering tool that connects to your local Ollama models. Create, manage, and interact with RAG systems f…
☆1,085Updated 3 months ago
yobix-ai / extractous
Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
☆1,634Updated 11 months ago
goodreasonai / ScrapeServ
A self-hosted API that takes a URL and returns a file with browser screenshots.
☆1,050Updated 8 months ago
lmnr-ai / index
The SOTA Open-Source Browser Agent for autonomously performing complex tasks on the web
☆2,323Updated 5 months ago
yigitkonur / llm-based-ocr
High-accuracy PDF-to-Markdown OCR API using LLMs with vision capabilities. Features parallel processing, batching, and auto-retry logic f…
☆875Updated this week
VikParuchuri / tabled
Detect and extract tables to markdown and csv
☆755Updated 10 months ago
plutoprint / plutoprint
A Python Library for Generating PDFs and Images from HTML, powered by PlutoBook
☆1,004Updated last week
ses4255 / Versatile-OCR-Program
Multi-modal OCR pipeline optimized for ML training (text, figure, math, tables, diagrams)
☆678Updated 6 months ago
shcherbak-ai / contextgem
ContextGem: Effortless LLM extraction from documents
☆1,727Updated 2 weeks ago
evangelosmeklis / peepdb
CLI tool and python library to inspect databases fast.
☆496Updated 5 months ago
morphik-org / morphik-core
The most accurate document search and store for building AI apps
☆3,383Updated last week
chonkie-inc / chonkie
🦛 CHONK docs with Chonkie ✨ — The lightweight ingestion library for fast, efficient and robust RAG pipelines
☆3,288Updated this week
DocumindHQ / documind
Open-source platform for extracting structured data from documents using AI.
☆1,453Updated 6 months ago
CatchTheTornado / text-extract-api
Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents…
☆2,940Updated 2 months ago
gsidhu / buzee-tauri
A superfast full-text search application
☆1,137Updated 2 months ago
Pravko-Solutions / FlashLearn
Integrate LLM in any pipeline - fit/predict pattern, JSON driven flows, and built in concurency support.
☆606Updated 8 months ago
GitHamza0206 / simba
Portable KMS (knowledge management system) designed to integrate seamlessly with any Retrieval-Augmented Generation (RAG) system
☆1,379Updated 3 months ago
zasper-io / zasper
High Performace IDE for Jupyter Notebooks
☆2,269Updated last month
lumina-ai-inc / chunkr
Vision infrastructure to turn complex documents into RAG/LLM-ready data
☆2,913Updated 2 months ago
koaning / smartfunc
Turn docstrings into LLM-functions
☆509Updated 3 weeks ago
ofek / pyapp
Runtime installer for Python applications
☆1,874Updated last month
igrek51 / wat
Deep inspection of Python objects
☆1,919Updated 3 months ago
bodo-run / yek
A fast Rust based tool to serialize text-based files in a repository or directory for LLM consumption
☆2,375Updated last month
plexe-ai / plexe
✨ Build a machine learning model from a prompt
☆2,275Updated 3 months ago
NanoNets / docstrange
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with int…
☆1,059Updated last month
janbjorge / pgqueuer
PgQueuer is a Python library leveraging PostgreSQL for efficient job queuing.
☆1,403Updated this week
imanoop7 / Ollama-OCR
☆2,069Updated 8 months ago
ngafar / llama-scan
Transcribe PDFs with local LLMs
☆734Updated last month