andreysaf / PDF-SDK-Evaluation-2021
What to look for in a PDF library in 2021
☆27Updated 4 years ago
Alternatives and similar repositories for PDF-SDK-Evaluation-2021:
Users that are interested in PDF-SDK-Evaluation-2021 are comparing it to the libraries listed below
- Convert a corpus of PDF to clean text files on a distributed architecture☆38Updated last year
- This repository provides various Python methods for finding and aggregating synonyms for an individual word or a list of words.☆34Updated 2 years ago
- Wrapper for pdftohtml that tries to extract paragraph structure☆50Updated 6 years ago
- multimodal document analysis☆164Updated 9 months ago
- Dataiku DSS plugin to detect languages, correct misspellings, and clean text data 🧼☆22Updated 2 months ago
- An intelligent OCR to detect tables and pure text inside PDFs and obtaing a csv file and a txt from it☆14Updated 6 years ago
- Framework for information extraction from tables☆41Updated 5 years ago
- ☆14Updated 2 years ago
- 📑 Python Package to reconstruct the original continuous text from PDFs with language models☆32Updated last year
- Annotate entities directly onto a PDF with automatic OCR for scanned PDFs☆59Updated last year
- Scripts to parse arxiv documents for NLP tasks☆17Updated last year
- Indri search implementation on top of Lucene search engine☆34Updated last year
- ☆36Updated 6 years ago
- Pipeline for converting PDFs to raw text with PaddleOCR☆21Updated last year
- Python tools for Tesseract OCR training☆25Updated 2 years ago
- Dense Article Dataset (DAD): A Benchmark Dataset for Document Layout Analysis☆15Updated 3 years ago
- Summarize. is a Streamlit application that performs automatic text summarization using both extractive and abstractive models.☆16Updated 3 years ago
- ReadingBank: A Benchmark Dataset for Reading Order Detection☆104Updated 7 months ago
- `pdfstructure` detects, splits and organizes the documents text content into its natural structure as envisioned by the author.☆103Updated 11 months ago
- Fast Word Segmentation with Triangular Matrix☆81Updated 3 years ago
- Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML☆63Updated 2 months ago
- Application configuration and scripts for search on https://docs.vespa.ai/☆12Updated last week
- Chrome Extension to highlight text and save marks on any website☆14Updated 2 years ago
- A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents☆23Updated 2 years ago
- Logical structure analysis for visually structured documents☆86Updated 2 years ago
- Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Sear…☆85Updated 3 years ago
- PDF Extraction Toolkit☆41Updated 4 years ago
- Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai☆40Updated 2 years ago
- Implementation of BertGrid : https://arxiv.org/abs/1909.04948☆30Updated 11 months ago
- This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified an…☆22Updated 4 years ago