mattbierbaum/arxiv-public-datasets

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/mattbierbaum/arxiv-public-datasets)

mattbierbaum / arxiv-public-datasets

A set of scripts to grab public datasets from resources related to arXiv

☆479

Alternatives and similar repositories for arxiv-public-datasets

Users that are interested in arxiv-public-datasets are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

sdmhans / arxiv_dataset_extraction
View on GitHub
A simple script for extracting plain text from arxiv dataset: https://www.kaggle.com/Cornell-University/arxiv
☆15Dec 7, 2020Updated 5 years ago
lukasschwab / arxiv.py
View on GitHub
Python wrapper for the arXiv API
☆1,538Jul 10, 2026Updated 2 weeks ago
armancohan / arxiv-tools
View on GitHub
Tools to bulk download arxiv data
☆134Oct 29, 2018Updated 7 years ago
kermitt2 / arxiv_harvester
View on GitHub
Poor man's simple harvester for arXiv resources
☆14Jul 14, 2023Updated 3 years ago
allenai / s2orc
View on GitHub
S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
☆1,075Apr 26, 2024Updated 2 years ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
webis-de / scidata22-stereo-scientific-text-reuse
View on GitHub
☆11Dec 2, 2024Updated last year
nveldt / HypergraphHomophily
View on GitHub
☆12Sep 17, 2022Updated 3 years ago
junipertcy / Network-Code-Repository
View on GitHub
A centralized repository for network analysis tools.
☆11Dec 16, 2019Updated 6 years ago
rmovva / LLM-publication-patterns-public
View on GitHub
[NAACL 2024] Topics, Authors, and Institutions in Large Language Model Research: Trends from 17K arXiv Papers https://arxiv.org/abs/2307.…
☆17Jan 27, 2024Updated 2 years ago
allenai / grobid
View on GitHub
A machine learning software for extracting information from scholarly documents
☆23Jun 15, 2026Updated last month
allenai / science-parse
View on GitHub
Science Parse parses scientific papers (in PDF form) and returns them in structured form.
☆702May 26, 2024Updated 2 years ago
IllDepence / unarXive
View on GitHub
A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network
☆305Sep 28, 2024Updated last year
kermitt2 / biblio_glutton_harvester
View on GitHub
Open Access PDF harvester
☆42May 3, 2024Updated 2 years ago
grobidOrg / grobid
View on GitHub
A machine learning software for extracting information from scholarly documents
☆5,022Updated this week
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
hamanlp / hama-py
View on GitHub
🦛 파이썬 한글 처리 라이브러리. Python Korean Morphological Analyzer
☆19Feb 4, 2025Updated last year
tingyaohsu / SciCap
View on GitHub
SciCap Dataset
☆59Nov 5, 2021Updated 4 years ago
maxdotio / mighty-batch
View on GitHub
Highly concurrent and fast content processing for Mighty Inference Server
☆10Feb 6, 2023Updated 3 years ago
allenai / s2orc-doc2json
View on GitHub
Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)
☆473Apr 11, 2024Updated 2 years ago
WING-NUS / scisumm-corpus
View on GitHub
Scientific Document Summarization Corpus and Annotations from the WING NUS group.
☆215May 22, 2023Updated 3 years ago
arXiv / zzzArchived_arxiv-fulltext
View on GitHub
arXiv plain text extraction
☆42Dec 8, 2022Updated 3 years ago
cverluise / openPatstat
View on GitHub
Load, build and explore Patstat using the Google Cloud Platform
☆10Jan 19, 2019Updated 7 years ago
dennishnf / project-arxiv-citation-network
View on GitHub
This project involved the analysis of the ArXiv citation network.
☆15Jan 29, 2022Updated 4 years ago
lordgrilo / ComplexNetworks2021-tda-crash-courses
View on GitHub
☆19Nov 30, 2021Updated 4 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
kermitt2 / datastet
View on GitHub
Finding mentions and citations to named and implicit research datasets from within the academic literature
☆31Jun 14, 2025Updated last year
anHALytics / anhalytics-core
View on GitHub
Analytic platform for the HAL research archive (in development)
☆12Oct 2, 2020Updated 5 years ago
jacobeisenstein / jos-gender-2014
View on GitHub
Software for the paper "Gender and Lexical Variation in Social Media" with David Bamman and Tyler Schnoebelen
☆17Nov 10, 2015Updated 10 years ago
com3dian / Grobidmonkey
View on GitHub
The grobidmonkey package is an open-source package designed for postprocessing GROBID outputs.
☆12Mar 27, 2024Updated 2 years ago
greenelab / crossref
View on GitHub
Download metadata for all DOIs using the Crossref API
☆66Sep 25, 2018Updated 7 years ago
laurentromary / stdfSpec
View on GitHub
Specification of a stand-off element for the TEI guidelines
☆12Apr 29, 2021Updated 5 years ago
robjsliwa / pyprolog
View on GitHub
Prolog implemented in Python
☆12Sep 6, 2024Updated last year
levguy / talksumm
View on GitHub
TalkSumm - Scientific Paper Summarization Based on Conference Talks
☆43Aug 24, 2021Updated 4 years ago
castorini / pyserini
View on GitHub
Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
☆2,102Jul 16, 2026Updated last week
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
jtonglet / Numerical-Hybrid-QA-Literature
View on GitHub
A list of Numerical Multimodal reasoning papers and their implementation
☆11May 13, 2024Updated 2 years ago
sairin1202 / SciXGen
View on GitHub
Dataset and model in the paper "SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation"
☆13Feb 14, 2022Updated 4 years ago
acrule / janus
View on GitHub
Jupyter Notebook Extension that assists with notebook cleaning
☆20Mar 14, 2018Updated 8 years ago
allenai / vila
View on GitHub
Incorporating VIsual LAyout Structures for Scientific Text Classification
☆180Mar 18, 2023Updated 3 years ago
allenai / ForeCite
View on GitHub
☆35Sep 16, 2022Updated 3 years ago
allenai / scibert
View on GitHub
A BERT model for scientific text.
☆1,705Feb 22, 2022Updated 4 years ago
istex-archives / istex-browser-extension
View on GitHub
Bouton ISTEX : extension web capable d'insérer dynamiquement sur la page web consultée un lien vers le fulltext d'un document si ce dern…
☆11May 30, 2023Updated 3 years ago