huridocs / pdf-table-of-contents-extractor
This project aims to extract Table of Contents (TOC) information from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of identifying and structuring the document's TOC.
☆17Updated 3 months ago
Alternatives and similar repositories for pdf-table-of-contents-extractor
Users that are interested in pdf-table-of-contents-extractor are comparing it to the libraries listed below
Sorting:
- This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging t…☆33Updated 3 months ago
- ☆113Updated 2 weeks ago
- A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents☆23Updated 2 years ago
- An index of PDF-centric corpora☆128Updated last month
- Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset☆27Updated 2 years ago
- Simple package to extract text with coordinates from programmatic PDFs☆121Updated last month
- Examples using the Deep Search functionalities☆78Updated 3 months ago
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆339Updated 2 years ago
- Extract tables from PDFs using LLMWhisperer and extract structured information from those tables using Langchain☆38Updated 7 months ago
- Create fast graph language models from converted PDF documents for knowledge extraction and Q&A.☆48Updated 3 months ago
- Ergonomic line-by-line transcription of scanned text.☆51Updated 4 years ago
- ☆13Updated 9 months ago
- LitePali is a minimal, efficient implementation of ColPali for image retrieval and indexing, optimized for cloud deployment.☆51Updated 7 months ago
- Build document-native LLM applications☆53Updated 8 months ago
- A basic tool that extracts the structure from the PDF files of scientific articles.☆74Updated 3 years ago
- Document Layout Analysis☆372Updated this week
- PAGE XML format collection for document image page content and more☆67Updated 3 years ago
- ☆64Updated 6 months ago
- Interact with the Deep Search platform for new knowledge explorations and discoveries☆198Updated 3 months ago
- YOLOv10 trained on DocLayNet dataset.☆73Updated 6 months ago
- Object Detection Model for Scanned Documents☆93Updated 2 months ago
- A step-by-step C# implementation of the Docstrum algorithm☆23Updated 4 years ago
- Logical structure analysis for visually structured documents☆89Updated 2 years ago
- A Faster LayoutReader Model based on LayoutLMv3, Sort OCR bboxes to reading order.☆219Updated 11 months ago
- VerifAI initiative to build open-source easy-to-deploy generative question-answering engine that can reference and verify answers for cor…☆70Updated 2 months ago
- Industry-based resolutions for issues and errata reported against any PDF-related specification☆73Updated last week
- ☆122Updated 2 months ago
- ☆91Updated this week
- Pipeline for converting PDFs to raw text with PaddleOCR☆23Updated last year
- A list of selected resources, methods, and tools dedicated to legal data schemes and ontologies.☆114Updated last year