ckorzen / pdf-text-extraction-benchmarkLinks
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
☆68Updated 4 years ago
Alternatives and similar repositories for pdf-text-extraction-benchmark
Users that are interested in pdf-text-extraction-benchmark are comparing it to the libraries listed below
Sorting:
- A Named-Entity Recogniser based on Grobid.☆53Updated last month
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆178Updated 2 years ago
- GROBID extension for identifying and normalizing physical quantities.☆82Updated last week
- Framework for information extraction from tables☆41Updated 6 years ago
- Functional and structural analysis of tables in research papers (Table disentangling)☆20Updated 7 years ago
- Identifying Historical People, Places and other Entities: Shared Task on Named Entity Recognition and Linking on Historical Newspapers at…☆22Updated 10 months ago
- BERT and ELECTRA models trained on Europeana Newspapers☆38Updated 3 years ago
- A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents☆25Updated 2 years ago
- A basic tool that extracts the structure from the PDF files of scientific articles.☆74Updated 3 years ago
- A simple web application for searching Word2Vec embeddings derived from approximately 2,000 law reports published by the The Incorporated…☆26Updated 2 years ago
- multimodal document analysis☆164Updated last year
- Get annotation suggestions for the INCEpTION text annotation platform from spaCy, Sentence BERT, scikit-learn and more. Runs as a web-ser…☆46Updated 8 months ago
- ☆32Updated 2 years ago
- spaCy pipeline component for generating spaCy KnowledgeBase Alias Candidates for Entity Linking☆85Updated 2 years ago
- ☆91Updated 3 years ago
- ☆80Updated 3 years ago
- Corpus of Open Access articles from multiple fields in Science, Technology, and Medicine.☆73Updated 8 years ago
- Companion code to the paper "Extracting Scientific Figures with Distantly Supervised Neural Networks" 🤖☆141Updated 3 years ago
- Toolbox for OCR post-correction☆121Updated 5 years ago
- Neuralized version of the Reference String Parser component of the ParsCit package.☆81Updated 3 years ago
- 🚀GUI for training spaCy models☆55Updated 4 years ago
- Open Access PDF harvester☆40Updated last year
- Mining Legal Arguments in Court Decisions - Data and software☆68Updated 2 years ago
- MultiCite code and data. Models are available on Huggingface.☆32Updated 3 years ago
- Service for converting and enhancing heterogeneous publisher XML formats into TEI☆56Updated 9 months ago
- An example of how to use spaCy for extremely large files without running into memory issues☆36Updated 2 years ago
- SciWING is a modern toolkit for scientific document processing from WING-NUS☆63Updated 2 years ago
- Post-processing OCR errors with seq2seq models☆28Updated 4 years ago
- A machine learning software for extracting information from scholarly documents☆23Updated 4 years ago
- Some examples of usage of Grobid in a third party java project.☆19Updated 2 years ago