ckorzen / pdf-text-extraction-benchmarkLinks
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
☆67Updated 4 years ago
Alternatives and similar repositories for pdf-text-extraction-benchmark
Users that are interested in pdf-text-extraction-benchmark are comparing it to the libraries listed below
Sorting:
- A Named-Entity Recogniser based on Grobid.☆53Updated 2 weeks ago
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆177Updated 2 years ago
- BERT and ELECTRA models trained on Europeana Newspapers☆38Updated 3 years ago
- spaCy pipeline component for generating spaCy KnowledgeBase Alias Candidates for Entity Linking☆85Updated 2 years ago
- ☆91Updated 3 years ago
- GROBID extension for identifying and normalizing physical quantities.☆82Updated 2 weeks ago
- A basic tool that extracts the structure from the PDF files of scientific articles.☆74Updated 3 years ago
- multimodal document analysis☆164Updated 11 months ago
- Identifying Historical People, Places and other Entities: Shared Task on Named Entity Recognition and Linking on Historical Newspapers at…☆22Updated 10 months ago
- Toolbox for OCR post-correction☆121Updated 5 years ago
- Functional and structural analysis of tables in research papers (Table disentangling)☆20Updated 7 years ago
- Corpus of Open Access articles from multiple fields in Science, Technology, and Medicine.☆73Updated 8 years ago
- PDF to XML ALTO file converter☆240Updated last week
- Mining Legal Arguments in Court Decisions - Data and software☆68Updated 2 years ago
- Source code for the paper "Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models"☆36Updated last year
- ☆80Updated 3 years ago
- A Flexible Deep Learning Approach to Fuzzy String Matching☆145Updated 7 months ago
- The Semantic Scholar Search Reranker☆110Updated 4 years ago
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆161Updated 2 years ago
- A python module for evaluating NERC and NEL system performances as defined in the HIPE shared tasks (formerly CLEF-HIPE-2020-scorer).☆14Updated 11 months ago
- Companion code to the paper "Extracting Scientific Figures with Distantly Supervised Neural Networks" 🤖☆141Updated 2 years ago
- An OCR evaluation tool☆66Updated 3 weeks ago
- Framework for information extraction from tables☆41Updated 6 years ago
- A spaCy wrapper of OpenTapioca for named entity linking on Wikidata☆94Updated 2 years ago
- An open information extraction system that provides compact extractions☆91Updated 3 years ago
- METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)☆53Updated 2 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆49Updated 2 years ago
- The official tool for transforming doccano format into common dataset formats.☆107Updated 2 years ago
- ☆57Updated 3 years ago
- ☆40Updated 7 years ago