huridocs / pdf_paragraphs_extraction
☆49Updated 4 months ago
Related projects ⓘ
Alternatives and complementary repositories for pdf_paragraphs_extraction
- A Faster LayoutReader Model based on LayoutLMv3, Sort OCR bboxes to reading order.☆90Updated 5 months ago
- My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"☆68Updated this week
- DocLLM: A layout-aware generative language model for multimodal document understanding☆112Updated 10 months ago
- Object Detection Model for Scanned Documents☆82Updated last year
- Code and data for "StructLM: Towards Building Generalist Models for Structured Knowledge Grounding" (COLM 2024)☆68Updated 3 weeks ago
- This project is a collection of fine-tuning scripts to help researchers fine-tune Qwen 2 VL on HuggingFace datasets.☆46Updated last month
- A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.☆172Updated 3 months ago
- Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset☆23Updated last year
- ☆21Updated 7 months ago
- [EMNLP 2024] LongRAG: A Dual-perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering☆73Updated this week
- ☆41Updated last month
- Universal text classifier for generative models☆20Updated 3 months ago
- ☆161Updated 2 weeks ago
- Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers☆53Updated 5 months ago
- Deployment a light and full OpenAI API for production with vLLM to support /v1/embeddings with all embeddings models.☆37Updated 3 months ago
- Official repository for paper "TableBench: A Comprehensive and Complex Benchmark for Table Question Answering"☆29Updated 3 weeks ago
- ☆41Updated 7 months ago
- ☆178Updated last month
- Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafte…☆52Updated last week
- A Python library to chunk/group your texts based on semantic similarity.☆85Updated 3 months ago
- ☆64Updated last month
- YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis☆60Updated last month
- Code, datasets, and checkpoints for the paper "CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval an…☆25Updated last month
- C++ inference wrappers for running blazing fast embedding services on your favourite serverless like AWS Lambda. By Prithivi Da, PRs welc…☆19Updated 8 months ago
- A library to extract the main content from html. Developed for information on LLM and for feeding data into LangChain and LlamaIndex.☆21Updated 5 months ago
- ☆74Updated 3 weeks ago
- Open Source Text Embedding Models with OpenAI Compatible API☆131Updated 3 months ago
- official code for "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding"☆126Updated 5 months ago
- ☆37Updated 11 months ago
- ☆45Updated 2 months ago