huridocs / pdf-text-extraction
This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.
☆33Updated 3 months ago
Alternatives and similar repositories for pdf-text-extraction
Users that are interested in pdf-text-extraction are comparing it to the libraries listed below
Sorting:
- ☆180Updated 3 weeks ago
- Simple package to extract text with coordinates from programmatic PDFs☆121Updated last month
- A python library to define and validate data types in Docling.☆131Updated this week
- Create fast graph language models from converted PDF documents for knowledge extraction and Q&A.☆48Updated 3 months ago
- Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset☆27Updated 2 years ago
- ☆113Updated 2 weeks ago
- LitePali is a minimal, efficient implementation of ColPali for image retrieval and indexing, optimized for cloud deployment.☆51Updated 7 months ago
- Lightweight, performant, deep table extraction☆459Updated 2 weeks ago
- Pipeline for converting PDFs to raw text with PaddleOCR☆23Updated last year
- UniTable: Towards a Unified Table Foundation Model☆465Updated 11 months ago
- Python API for https://vespa.ai, the open big data serving engine☆122Updated this week
- This project aims to extract Table of Contents (TOC) information from PDF files using the outputs generated by the pdf-document-layout-an…☆17Updated 3 months ago
- A Docker-powered service for PDF document layout analysis. This service provides a powerful and flexible PDF analysis service. The servic…☆538Updated this week
- ☆22Updated last year
- A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.☆300Updated last month
- Build document-native LLM applications☆53Updated 8 months ago
- Structured information extraction from documents☆314Updated 7 months ago
- [CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation☆424Updated last month
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆339Updated 2 years ago
- DocLLM: A layout-aware generative language model for multimodal document understanding☆125Updated last year
- Extract structured text from pdfs quickly☆474Updated 2 months ago
- An enterprise-grade AI retriever designed to streamline AI integration into your applications, ensuring cutting-edge accuracy.☆286Updated 2 weeks ago
- TF-ID: Table/Figure IDentifier for academic papers☆231Updated 10 months ago
- A High-efficiency Open-source Toolkit for Table-to-Latex Task☆237Updated 5 months ago
- Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provi…☆36Updated last month
- Generalist and Lightweight Model for Text Classification☆124Updated last week
- A Python library to chunk/group your texts based on semantic similarity.☆97Updated 10 months ago
- Interact with the Deep Search platform for new knowledge explorations and discoveries☆198Updated 3 months ago
- Code for explaining and evaluating late chunking (chunked pooling)☆382Updated 4 months ago
- YOLOv10 trained on DocLayNet dataset.☆73Updated 6 months ago