ckorzen/pdf-text-extraction-benchmark

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ckorzen/pdf-text-extraction-benchmark)

ckorzen / pdf-text-extraction-benchmark

A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

☆73

Alternatives and similar repositories for pdf-text-extraction-benchmark

Users that are interested in pdf-text-extraction-benchmark are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

allenai / figura11y
View on GitHub
AI Assistance for Writing Scientific Alt Text
☆14Feb 7, 2024Updated 2 years ago
ad-freiburg / pdfact
View on GitHub
A basic tool that extracts the structure from the PDF files of scientific articles.
☆77Jan 4, 2022Updated 4 years ago
softcite / softcite_kb
View on GitHub
A Knowledge Base for research software relying on large-scale text mining and curated knowledge sources
☆18May 14, 2023Updated 3 years ago
kermitt2 / biblio_glutton_harvester
View on GitHub
Open Access PDF harvester
☆42May 3, 2024Updated 2 years ago
copenlu / cite-worth
View on GitHub
Data and code for the paper "CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding"
☆14Sep 8, 2022Updated 3 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
RedHatProductSecurity / advisory-parser
View on GitHub
A library for parsing security advisories
☆14Jul 16, 2026Updated last week
softcite / software-mentions
View on GitHub
Softcite software mention recognizer, finding mentions and citations to software from within the academic literature
☆85Jun 6, 2026Updated last month
allenai / vila
View on GitHub
Incorporating VIsual LAyout Structures for Scientific Text Classification
☆180Mar 18, 2023Updated 3 years ago
uranusjr / packaging-metadata-comparisons
View on GitHub
Packaging Metadata Comparions
☆18Apr 3, 2020Updated 6 years ago
valentinhofmann / flota
View on GitHub
☆18Feb 1, 2023Updated 3 years ago
knmnyn / ParsCit
View on GitHub
An open-source CRF Reference String Parsing Package
☆161May 6, 2020Updated 6 years ago
mjvm / pyrpm
View on GitHub
A pure python rpm reader
☆20Apr 11, 2024Updated 2 years ago
DataSeer / dataseer-ml
View on GitHub
DataSeer machine-learning service
☆28Sep 4, 2025Updated 10 months ago
fujimotos / TinyFastSS
View on GitHub
An index data structure for approximate string search.
☆23May 6, 2019Updated 7 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
singh1114 / theJekyllProject
View on GitHub
A Django project to help users to create free, fast and secure blogs on GitHub Pages and Jekyll.
☆20Dec 8, 2022Updated 3 years ago
allenai / science-parse
View on GitHub
Science Parse parses scientific papers (in PDF form) and returns them in structured form.
☆702May 26, 2024Updated 2 years ago
cburgmer / pdfserver
View on GitHub
Pdfserver is a webservice that offers common PDF operations like joining documents, selecting pages or "n pages on one".
☆17Mar 11, 2012Updated 14 years ago
michaelfaerber / scholarly-entity-usage-detection
View on GitHub
Identifying Used Methods and Datasets in Scientific Publications
☆18Jan 14, 2021Updated 5 years ago
kootenpv / sysdm
View on GitHub
Scripts as a service. Builds on systemd (for Linux)
☆21Mar 10, 2026Updated 4 months ago
lfoppiano / SuperMat
View on GitHub
Superconductors material dataset
☆28Dec 5, 2023Updated 2 years ago
kermitt2 / article_dataset_builder
View on GitHub
Open Access PDF harvester, metadata aggregator and full-text ingester
☆62May 3, 2024Updated 2 years ago
bionlplab / bioc
View on GitHub
Data structures and code to read/write BioC XML and Json.
☆34Aug 21, 2023Updated 2 years ago
ogrisel / wheelhouse-uploader
View on GitHub
Script to help maintain a wheelhouse folder on a cloud storage.
☆33Aug 4, 2020Updated 5 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
lfoppiano / grobid-superconductors
View on GitHub
Grobid module for superconductor material and properties extraction
☆23May 17, 2025Updated last year
uqfoundation / pox
View on GitHub
utilities for filesystem exploration and automated builds
☆21Jun 22, 2026Updated last month
allenai / scicite
View on GitHub
Repository for NAACL 2019 paper on Citation Intent prediction
☆130Dec 1, 2019Updated 6 years ago
allenai / pybart
View on GitHub
Converter from UD-trees to BART representation
☆35Mar 6, 2024Updated 2 years ago
LaSTUS-TALN-UPF / TSAR-2022-Shared-Task
View on GitHub
TSAR2022 Shared Task on Lexical Simplification - Datasets and Evaluation scripts
☆10Oct 27, 2022Updated 3 years ago
allenai / spv2
View on GitHub
Science-parse version 2
☆257Nov 20, 2019Updated 6 years ago
therkildsen-lab / data-processing
View on GitHub
Scripts for going from raw .fastq files to processed and quality-checked .bam files for downstream analysis
☆14Nov 23, 2021Updated 4 years ago
wernsey / miscsrc
View on GitHub
My collection of miscellaneous source code
☆38Aug 31, 2025Updated 10 months ago
allenai / s2search
View on GitHub
The Semantic Scholar Search Reranker
☆112Oct 26, 2020Updated 5 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
omnilib / attribution
View on GitHub
Generate changelogs from commit tags and shortlogs
☆28Nov 2, 2025Updated 8 months ago
allenai / S2APLER
View on GitHub
S2APLER: S2 Agglomeration of Papers with Low Error Rate (it's for academic paper clustering)
☆22Jul 8, 2026Updated 3 weeks ago
cverluise / openPatstat
View on GitHub
Load, build and explore Patstat using the Google Cloud Platform
☆10Jan 19, 2019Updated 7 years ago
nlp-stat-test / nlp-stat-test
View on GitHub
The NLPStatTest project
☆12Mar 12, 2022Updated 4 years ago
UB-Mannheim / FAIR-GPT
View on GitHub
A documentation for FAIR GPT, a virtual RDM consultant
☆16Oct 10, 2024Updated last year
NabaviLab / CNV-Sim
View on GitHub
Copy Number Variations (CNV) Simulator
☆11Jul 30, 2018Updated 7 years ago
buschmo / Simple-German-Corpus
View on GitHub
Code to create the dataset from "A New Aligned Simple German Corpus
☆11Jan 8, 2024Updated 2 years ago