DS4SD / docling-parse
Simple package to extract text with coordinates from programmatic PDFs
☆21Updated last week
Related projects ⓘ
Alternatives and complementary repositories for docling-parse
- Create fast graph language models from converted PDF documents for knowledge extraction and Q&A.☆21Updated 3 weeks ago
- ☆34Updated last week
- Running Docling as an API service☆13Updated last month
- Build document-native LLM applications☆50Updated 2 months ago
- A python library to define and validate data types in Docling.☆28Updated this week
- Examples using the Deep Search functionalities☆44Updated 3 months ago
- Codes and datasets for the paper Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Ref…☆23Updated last month
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆59Updated last week
- Data preparation code for Amber 7B LLM☆82Updated 6 months ago
- C++ inference wrappers for running blazing fast embedding services on your favourite serverless like AWS Lambda. By Prithivi Da, PRs welc…☆19Updated 8 months ago
- Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafte…☆53Updated 2 weeks ago
- This project is a collection of fine-tuning scripts to help researchers fine-tune Qwen 2 VL on HuggingFace datasets.☆47Updated last month
- Open source and AI-powered web search engine: local, private, dockerized and supported by a fluffy llama🦙☆51Updated 3 months ago
- Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provi…☆30Updated 2 months ago
- Using open source LLMs to build synthetic datasets for direct preference optimization☆40Updated 8 months ago
- GLiNER model in a FastAPI microservice.☆28Updated 2 weeks ago
- ☆43Updated 3 months ago
- DocLLM: A layout-aware generative language model for multimodal document understanding☆112Updated 10 months ago
- Indox is an advanced search and retrieval technique that efficiently extracts data from diverse document types, including PDFs and HTML, …☆16Updated 2 weeks ago
- A pipeline parallel training script for LLMs.☆83Updated this week
- Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on task…☆131Updated last month
- Self-host LLMs with vLLM and BentoML☆72Updated last week
- awesome synthetic (text) datasets☆239Updated 2 weeks ago
- Enhancing Translation with RAG-Powered Large Language Models☆64Updated 3 weeks ago
- Pipeline for converting PDFs to raw text with PaddleOCR☆20Updated last year
- A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.☆175Updated 4 months ago
- Small python package to measure OCR quality and other related metrics.☆20Updated 8 months ago
- Implementation of nougat that focuses on processing pdf locally.☆73Updated 6 months ago
- MLX-Embeddings is the best package for running Vision and Language Embedding models locally on your Mac using MLX.☆76Updated 3 weeks ago
- ☆43Updated 3 months ago