A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
☆69Nov 7, 2020Updated 5 years ago
Alternatives and similar repositories for pdf-text-extraction-benchmark
Users that are interested in pdf-text-extraction-benchmark are comparing it to the libraries listed below
Sorting:
- The repository of Icecite, a research paper management system.☆15Mar 29, 2018Updated 7 years ago
- A basic tool that extracts the structure from the PDF files of scientific articles.☆76Jan 4, 2022Updated 4 years ago
- A Knowledge Base for research software relying on large-scale text mining and curated knowledge sources☆17May 14, 2023Updated 2 years ago
- my take at a PDF text extraction utility☆25Jun 15, 2015Updated 10 years ago
- PDF Extraction Toolkit☆42Nov 23, 2020Updated 5 years ago
- Open Access PDF harvester☆42May 3, 2024Updated last year
- Reference implementation of algorithms for reinforcement learning and Markov decision processes.☆12Jan 28, 2021Updated 5 years ago
- Material parsers and other tools, scripts Initially developed for Grobid Superconductor☆13Feb 21, 2025Updated last year
- Softcite software mention recognizer, finding mentions and citations to software from within the academic literature☆82Sep 30, 2025Updated 5 months ago
- Data and code for the paper "CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding"☆14Sep 8, 2022Updated 3 years ago
- ☆13Jun 2, 2022Updated 3 years ago
- utilities for filesystem exploration and automated builds☆21Jan 24, 2026Updated last month
- S2APLER: S2 Agglomeration of Papers with Low Error Rate (it's for academic paper clustering)☆21Nov 4, 2025Updated 4 months ago
- Packaging Metadata Comparions☆18Apr 3, 2020Updated 5 years ago
- Python wrapper for xpdf☆19Nov 28, 2019Updated 6 years ago
- A browser extension providing Open Access bibliographical services☆18Dec 9, 2022Updated 3 years ago
- Functional and structural analysis of tables in research papers (Table disentangling)☆20Aug 7, 2017Updated 8 years ago
- Science Parse parses scientific papers (in PDF form) and returns them in structured form.☆696May 26, 2024Updated last year
- ☆17Feb 1, 2023Updated 3 years ago
- DeltaCode: compare two codebase scans (from ScanCode) to detect significant changes.☆22Sep 3, 2024Updated last year
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆179Mar 18, 2023Updated 2 years ago
- Grobid module for superconductor material and properties extraction☆22May 17, 2025Updated 9 months ago
- An introduction to global assessment techniques using Python☆12Apr 24, 2023Updated 2 years ago
- A Test Collection of Computer Science Papers for Faceted Query by Example☆22Nov 28, 2021Updated 4 years ago
- Repository for NAACL 2019 paper on Citation Intent prediction☆129Dec 1, 2019Updated 6 years ago
- Superconductors material dataset☆26Dec 5, 2023Updated 2 years ago
- COMM 4940: Governing Human-Algorithm Behavior☆24Updated this week
- SciGen☆24Aug 10, 2021Updated 4 years ago
- Convert between Tesseract hOCR and ALTO XML using XSL stylesheets☆59Sep 25, 2025Updated 5 months ago
- Scripts as a service. Builds on systemd (for Linux)☆21Jun 18, 2023Updated 2 years ago
- Code accompanying NCA paper titled "Attribute-based Regularization of Latent Spaces for Variational Auto-Encoders"☆27May 26, 2021Updated 4 years ago
- Open Access PDF harvester, metadata aggregator and full-text ingester☆62May 3, 2024Updated last year
- Code for ECIR 2022 paper Local Citation Recommendation with Hierarchical-Attention Text Encoder and SciBERT-based Reranking☆25Jul 30, 2024Updated last year
- A machine learning software for extracting information from scholarly documents☆4,670Updated this week
- ☆1,040Jul 9, 2025Updated 7 months ago
- An online rss reader written in clojure & javascript & java.☆148May 13, 2013Updated 12 years ago
- Maxwell Forbes & Yejin Choi — ACL 2017☆25Dec 23, 2021Updated 4 years ago
- An open-source CRF Reference String Parsing Package☆161May 6, 2020Updated 5 years ago
- a large scientific paraphrase dataset for longer paraphrase generation☆39Oct 17, 2022Updated 3 years ago