titipata/scipdf_parser

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/titipata/scipdf_parser)

titipata / scipdf_parser

Python PDF parser for scientific publications: content and figures

☆455

Alternatives and similar repositories for scipdf_parser

Users that are interested in scipdf_parser are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

grobidOrg / grobid-client-python
View on GitHub
Python client for GROBID Web services
☆410Mar 5, 2026Updated 4 months ago
grobidOrg / grobid
View on GitHub
A machine learning software for extracting information from scholarly documents
☆5,010Updated this week
allenai / science-parse
View on GitHub
Science Parse parses scientific papers (in PDF form) and returns them in structured form.
☆699May 26, 2024Updated 2 years ago
allenai / pdffigures2
View on GitHub
Given a scholarly PDF, extract figures, tables, captions, and section titles.
☆750Mar 10, 2024Updated 2 years ago
cat-lemonade / PDFDataExtractor
View on GitHub
A toolkit for automatically extracting semantic information from PDF files of scientific articles
☆76Dec 20, 2023Updated 2 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
kermitt2 / article_dataset_builder
View on GitHub
Open Access PDF harvester, metadata aggregator and full-text ingester
☆62May 3, 2024Updated 2 years ago
allenai / spv2
View on GitHub
Science-parse version 2
☆257Nov 20, 2019Updated 6 years ago
UCREL / science_parse_py_api
View on GitHub
Python API for Science Parse
☆13Mar 27, 2021Updated 5 years ago
sairin1202 / SciXGen
View on GitHub
Dataset and model in the paper "SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation"
☆13Feb 14, 2022Updated 4 years ago
allenai / papermage
View on GitHub
library supporting NLP and CV research on scientific papers
☆800Nov 8, 2024Updated last year
allenai / s2orc
View on GitHub
S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
☆1,073Apr 26, 2024Updated 2 years ago
grobidOrg / grobid-ner
View on GitHub
A Named-Entity Recogniser based on Grobid.
☆55May 14, 2025Updated last year
pengyuanli / PDFigCapX
View on GitHub
https://doi.org/10.1093/bioinformatics/btz228
☆45Nov 19, 2024Updated last year
ScienciaLAB / document-qa
View on GitHub
Scientific Document Insight Q/A
☆37Jun 7, 2026Updated last month
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
paperfetcher / paperfetcher
View on GitHub
Pip-installable Python package to automate handsearching and citation searching for systematic reviews.
☆20Jul 13, 2024Updated 2 years ago
anHALytics / anhalytics-core
View on GitHub
Analytic platform for the HAL research archive (in development)
☆12Oct 2, 2020Updated 5 years ago
papercast-dev / papercast
View on GitHub
A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GRO…
☆53Mar 17, 2025Updated last year
kermitt2 / pdfalto
View on GitHub
PDF to XML ALTO file converter
☆272Updated this week
ScienciaLAB / structure-vision
View on GitHub
Viewer for the structure extracted by Grobid on PDF documents
☆57Nov 7, 2025Updated 8 months ago
DAMO-NLP-SG / CoI-Agent
View on GitHub
Official code for paper: Chain of Ideas: Revolutionizing Research via Novel Idea Development with LLM Agents
☆510Jan 15, 2025Updated last year
kermitt2 / biblio-glutton
View on GitHub
A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
☆150Apr 8, 2026Updated 3 months ago
danielnsilva / semanticscholar
View on GitHub
Unofficial Python client library for Semantic Scholar APIs.
☆475Jul 3, 2026Updated 2 weeks ago
IBM / science-result-extractor
View on GitHub
☆100May 20, 2022Updated 4 years ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
davendw49 / sciparser
View on GitHub
PDF parsing toolkit for preparing academic text corpus
☆64Jul 12, 2024Updated 2 years ago
allenai / vila
View on GitHub
Incorporating VIsual LAyout Structures for Scientific Text Classification
☆180Mar 18, 2023Updated 3 years ago
lfoppiano / material-parsers
View on GitHub
Material parsers and other tools, scripts Initially developed for Grobid Superconductor
☆14Feb 21, 2025Updated last year
cverluise / openPatstat
View on GitHub
Load, build and explore Patstat using the Google Cloud Platform
☆10Jan 19, 2019Updated 7 years ago
scholarly-python-package / scholarly
View on GitHub
Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
☆1,877Mar 24, 2026Updated 3 months ago
Future-House / paper-qa
View on GitHub
High accuracy RAG for answering questions from scientific documents with citations
☆8,909Updated this week
tanmayb123 / BertPreTraining
View on GitHub
☆11Nov 10, 2020Updated 5 years ago
kermitt2 / datastet
View on GitHub
Finding mentions and citations to named and implicit research datasets from within the academic literature
☆31Jun 14, 2025Updated last year
lfoppiano / SuperMat
View on GitHub
Superconductors material dataset
☆28Dec 5, 2023Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
allenai / scispacy
View on GitHub
A full spaCy pipeline and models for scientific/biomedical documents.
☆1,975Dec 4, 2025Updated 7 months ago
laurentromary / stdfSpec
View on GitHub
Specification of a stand-off element for the TEI guidelines
☆12Apr 29, 2021Updated 5 years ago
eLifePathways / sciencebeam-parser
View on GitHub
A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools…
☆297Jul 8, 2026Updated 2 weeks ago
moaraio / SS-self-hosting
View on GitHub
This is a public repository to enable researchers to begin their journey of self-hosting data from Semantic Scholar.
☆46Nov 7, 2024Updated last year
jerry3027 / PolyIE
View on GitHub
☆17Jan 26, 2024Updated 2 years ago
quest-bih / oddpub
View on GitHub
Algorithm for Open Data Detection in Publications (ODDPub)
☆44Updated this week
howisonlab / softcite-dataset
View on GitHub
A gold-standard dataset of software mentions in research publications.
☆39Jul 27, 2023Updated 2 years ago