grobidOrg/grobid

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/grobidOrg/grobid)

grobidOrg / grobid

A machine learning software for extracting information from scholarly documents

☆5,010

Alternatives and similar repositories for grobid

Users that are interested in grobid are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

grobidOrg / grobid-client-python
View on GitHub
Python client for GROBID Web services
☆410Mar 5, 2026Updated 4 months ago
allenai / science-parse
View on GitHub
Science Parse parses scientific papers (in PDF form) and returns them in structured form.
☆699May 26, 2024Updated 2 years ago
kermitt2 / biblio-glutton
View on GitHub
A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
☆150Apr 8, 2026Updated 3 months ago
titipata / scipdf_parser
View on GitHub
Python PDF parser for scientific publications: content and figures
☆455Mar 21, 2024Updated 2 years ago
CeON / CERMINE
View on GitHub
Content ExtRactor and MINEr
☆512Jun 30, 2022Updated 4 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
allenai / s2orc-doc2json
View on GitHub
Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)
☆472Apr 11, 2024Updated 2 years ago
kermitt2 / delft
View on GitHub
a Deep Learning Framework for Text https://delft.readthedocs.io/
☆416Updated this week
allenai / spv2
View on GitHub
Science-parse version 2
☆257Nov 20, 2019Updated 6 years ago
allenai / s2orc
View on GitHub
S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
☆1,073Apr 26, 2024Updated 2 years ago
kermitt2 / pdfalto
View on GitHub
PDF to XML ALTO file converter
☆272Updated this week
kermitt2 / article_dataset_builder
View on GitHub
Open Access PDF harvester, metadata aggregator and full-text ingester
☆62May 3, 2024Updated 2 years ago
kermitt2 / entity-fishing
View on GitHub
A machine learning tool for fishing entities
☆268Feb 27, 2026Updated 4 months ago
allenai / pdffigures2
View on GitHub
Given a scholarly PDF, extract figures, tables, captions, and section titles.
☆750Mar 10, 2024Updated 2 years ago
lfoppiano / grobid-quantities
View on GitHub
GROBID extension for identifying and normalizing physical quantities.
☆85Apr 8, 2026Updated 3 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
softcite / software-mentions
View on GitHub
Softcite software mention recognizer, finding mentions and citations to software from within the academic literature
☆85Jun 6, 2026Updated last month
eLifePathways / sciencebeam-parser
View on GitHub
A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools…
☆297Jul 8, 2026Updated 2 weeks ago
facebookresearch / nougat
View on GitHub
Implementation of Nougat Neural Optical Understanding for Academic Documents
☆10,046Feb 21, 2025Updated last year
allenai / scibert
View on GitHub
A BERT model for scientific text.
☆1,705Feb 22, 2022Updated 4 years ago
kermitt2 / grobid-example
View on GitHub
Some examples of usage of Grobid in a third party java project.
☆20Jun 14, 2023Updated 3 years ago
jsvine / pdfplumber
View on GitHub
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
☆10,575Updated this week
Layout-Parser / layout-parser
View on GitHub
A Unified Toolkit for Deep Learning Based Document Image Analysis
☆5,764Aug 15, 2024Updated last year
grobidOrg / grobid-ner
View on GitHub
A Named-Entity Recogniser based on Grobid.
☆55May 14, 2025Updated last year
pdfminer / pdfminer.six
View on GitHub
Community maintained fork of pdfminer - we fathom PDF
☆7,002Mar 13, 2026Updated 4 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
allenai / specter
View on GitHub
SPECTER: Document-level Representation Learning using Citation-informed Transformers
☆586Jun 12, 2023Updated 3 years ago
datalab-to / marker
View on GitHub
Convert PDF to markdown + JSON quickly with high accuracy
☆37,711Updated this week
allenai / scispacy
View on GitHub
A full spaCy pipeline and models for scientific/biomedical documents.
☆1,975Dec 4, 2025Updated 7 months ago
neuml / paperetl
View on GitHub
📄 ⚙️ ETL processes for medical and scientific papers
☆697Dec 7, 2025Updated 7 months ago
Future-House / paper-qa
View on GitHub
High accuracy RAG for answering questions from scientific documents with citations
☆8,909Updated this week
allenai / papermage
View on GitHub
library supporting NLP and CV research on scientific papers
☆800Nov 8, 2024Updated last year
kermitt2 / biblio_glutton_harvester
View on GitHub
Open Access PDF harvester
☆42May 3, 2024Updated 2 years ago
Unstructured-IO / unstructured
View on GitHub
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean…
☆15,176Updated this week
CrossRef / pdfextract
View on GitHub
MOVED TO https://gitlab.com/crossref/pdfextract
☆510Jul 26, 2017Updated 8 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
pymupdf / PyMuPDF
View on GitHub
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
☆10,283Updated this week
neuml / txtai
View on GitHub
💡 All-in-one AI framework for semantic search, LLM orchestration and language model workflows
☆12,741Updated this week
WING-NUS / Neural-ParsCit
View on GitHub
Neuralized version of the Reference String Parser component of the ParsCit package.
☆81May 27, 2022Updated 4 years ago
kermitt2 / Pub2TEI
View on GitHub
Service for converting and enhancing heterogeneous publisher XML formats into TEI
☆65Apr 12, 2026Updated 3 months ago
allenai / pawls
View on GitHub
Software that makes labeling PDFs easy.
☆433May 13, 2024Updated 2 years ago
CrossRef / rest-api-doc
View on GitHub
Documentation for Crossref's REST API. For questions or suggestions, see https://community.crossref.org/
☆798Sep 25, 2024Updated last year
allenai / vila
View on GitHub
Incorporating VIsual LAyout Structures for Scientific Text Classification
☆180Mar 18, 2023Updated 3 years ago