A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
☆69Nov 7, 2020Updated 5 years ago
Alternatives and similar repositories for pdf-text-extraction-benchmark
Users that are interested in pdf-text-extraction-benchmark are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- my take at a PDF text extraction utility☆25Jun 15, 2015Updated 10 years ago
- The repository of Icecite, a research paper management system.☆15Mar 29, 2018Updated 7 years ago
- A basic tool that extracts the structure from the PDF files of scientific articles.☆76Jan 4, 2022Updated 4 years ago
- A Knowledge Base for research software relying on large-scale text mining and curated knowledge sources☆17May 14, 2023Updated 2 years ago
- Service to scan licenses from source code☆12Aug 14, 2023Updated 2 years ago
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- Multi-Entity Extraction Framework for Academic Documents (with default extraction tools)☆31Oct 3, 2023Updated 2 years ago
- Data and code for the paper "CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding"☆14Sep 8, 2022Updated 3 years ago
- A library for parsing security advisories☆13Feb 5, 2026Updated last month
- Open Access PDF harvester☆42May 3, 2024Updated last year
- Softcite software mention recognizer, finding mentions and citations to software from within the academic literature☆82Sep 30, 2025Updated 5 months ago
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆180Mar 18, 2023Updated 3 years ago
- Finding mentions and citations to named and implicit research datasets from within the academic literature☆30Jun 14, 2025Updated 9 months ago
- Functional and structural analysis of tables in research papers (Table disentangling)☆20Aug 7, 2017Updated 8 years ago
- A pure python rpm reader☆20Apr 11, 2024Updated last year
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents (JCDL 2022)☆14Jul 22, 2022Updated 3 years ago
- Project archived. See: https://github.com/plus3it/gravitybee/issues/486#issuecomment-1414501191☆18Feb 13, 2023Updated 3 years ago
- A browser extension providing Open Access bibliographical services☆18Dec 9, 2022Updated 3 years ago
- This Python project develops a LDA model which trains on various Wikipedia articles based on a keyword and then suggests Wikipedia articl…☆10Oct 22, 2019Updated 6 years ago
- An index data structure for approximate string search.☆23May 6, 2019Updated 6 years ago
- An open-source CRF Reference String Parsing Package☆161May 6, 2020Updated 5 years ago
- utilities for filesystem exploration and automated builds☆21Jan 24, 2026Updated 2 months ago
- A simple toolkit for speaker segmentation and identification☆31Jun 15, 2013Updated 12 years ago
- Data structures and code to read/write BioC XML and Json.☆33Aug 21, 2023Updated 2 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Superconductors material dataset☆27Dec 5, 2023Updated 2 years ago
- ☆98May 20, 2022Updated 3 years ago
- Repository for NAACL 2019 paper on Citation Intent prediction☆129Dec 1, 2019Updated 6 years ago
- Converter from UD-trees to BART representation☆35Mar 6, 2024Updated 2 years ago
- Python bindings for the Unitex/GramLab corpus processor☆10Nov 25, 2022Updated 3 years ago
- Science-parse version 2☆255Nov 20, 2019Updated 6 years ago
- Grobid module for superconductor material and properties extraction☆22May 17, 2025Updated 10 months ago
- A machine learning software for extracting information from scholarly documents☆4,727Updated this week
- Audit python packages for known vulnerabilities☆34Mar 9, 2022Updated 4 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- The Semantic Scholar Search Reranker☆112Oct 26, 2020Updated 5 years ago
- A text parser.☆33Feb 27, 2026Updated last month
- Building an effective preprocessing tool for African languages☆13Jan 24, 2024Updated 2 years ago
- The NLPStatTest project☆12Mar 12, 2022Updated 4 years ago
- Used Python, NLTK, NLP techniques to make a search engine that ranks documents based on search keyword, based on TF-IDF weights and cosin…☆17Jul 11, 2017Updated 8 years ago
- A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network☆298Sep 28, 2024Updated last year
- ☆20Apr 22, 2018Updated 7 years ago