huridocs / pdf-reading-order
☆12Updated 11 months ago
Alternatives and similar repositories for pdf-reading-order:
Users that are interested in pdf-reading-order are comparing it to the libraries listed below
- Small python package to measure OCR quality and other related metrics.☆21Updated last year
- ☆13Updated 2 months ago
- This repository is part of an NLP course for humanities and cultural studies. This course uses historical newspapers as a source and appl…☆15Updated last week
- Segmenting a given document using recursive xy-cut algorithm.☆12Updated 6 years ago
- The collection of bulding blocks building fine-tunable metric learning models☆32Updated 2 months ago
- Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and…☆19Updated last week
- Large-scale query-focused multi-document Summarization dataset☆10Updated 3 years ago
- ☆16Updated 3 years ago
- Index of URLs to pdf files all over the internet and scripts☆23Updated last year
- Lightweight Non-Parametric Embedding Fine-Tuning☆23Updated 6 months ago
- ☆21Updated 2 months ago
- Demo example of consumer goods categorization☆26Updated last year
- No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval☆29Updated 2 years ago
- Tokenization across languages. Useful as preprocessing for subword tokenization.☆22Updated 2 years ago
- DocAI helps developers quickly build document, image and text processing pipelines using open source and cloud-based machine learning mod…☆20Updated 2 years ago
- ☆20Updated 9 months ago
- CTE: Contextualized Table Extraction Dataset☆17Updated 2 years ago
- Search through Facebook Research's PyTorch BigGraph Wikidata-dataset with the Weaviate vector search engine☆31Updated 3 years ago
- Official repository of the paper: "A Comprehensive Gold Standard and Benchmark for Comics Text Detection and Recognition"☆26Updated last year
- ☆24Updated 3 years ago
- KITE (Knowledge-Intensive Task Evaluation) is an end-to-end benchmark for RAG pipelines☆14Updated 7 months ago
- Ranking of fine-tuned HF models as base models.☆35Updated last year
- PyTorch Implementation of the paper "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training"☆23Updated last week
- Efficient few-shot learning with cross-encoders.☆50Updated last year
- NLG Best Practices for Data-Efficient Modeling How to Train Production-Ready Models with Little Data☆10Updated 3 years ago
- ☆43Updated last month
- ☆14Updated 5 months ago
- Seed Machine Translation Data☆31Updated 4 months ago
- ☆22Updated last year
- Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pip…☆67Updated last week