allenai/peS2o

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/allenai/peS2o)

allenai / peS2o

Pretraining Efficiently on S2ORC!

☆187

Alternatives and similar repositories for peS2o

Users that are interested in peS2o are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

allenai / aspire
View on GitHub
Repo for Aspire - A scientific document similarity model based on matching fine-grained aspects of scientific papers.
☆55Aug 20, 2023Updated 2 years ago
allenai / scirepeval
View on GitHub
SciRepEval benchmark training and evaluation scripts
☆89May 5, 2026Updated 2 months ago
allenai / s2orc-doc2json
View on GitHub
Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)
☆473Apr 11, 2024Updated 2 years ago
allenai / s2orc
View on GitHub
S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
☆1,075Apr 26, 2024Updated 2 years ago
allenai / dolma
View on GitHub
Data and tools for generating and inspecting OLMo pre-training data.
☆1,528Nov 5, 2025Updated 8 months ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
Kel-Lu / SciGen
View on GitHub
SciGen
☆25Aug 10, 2021Updated 4 years ago
malteos / scincl
View on GitHub
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)
☆79Dec 29, 2025Updated 6 months ago
IllDepence / unarXive
View on GitHub
A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network
☆305Sep 28, 2024Updated last year
allenai / SPECTER2
View on GitHub
☆139Feb 24, 2026Updated 5 months ago
stardust-coder / japanese-lm-med-harness
View on GitHub
☆11Oct 2, 2024Updated last year
kazuki-irie / kv-memory-brain
View on GitHub
Official Code Repository for the paper "Key-value memory in the brain"
☆32Feb 25, 2025Updated last year
WING-NUS / SciAssist
View on GitHub
☆20Feb 17, 2024Updated 2 years ago
zoranmedic / mdcr
View on GitHub
Benchmark dataset for the evaluation of scientific article representations on the task of citation recommendation across various scientif…
☆12Oct 21, 2022Updated 3 years ago
IDSIA / rtrl-elstm
View on GitHub
Official repository for the paper "Exploring the Promise and Limits of Real-Time Recurrent Learning" (ICLR 2024)
☆13Jun 11, 2025Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
hyp1231 / ICLR2023-OpenReviewData
View on GitHub
Crawl & visualize ICLR papers and reviews.
☆18Nov 5, 2022Updated 3 years ago
XueFuzhao / OpenMoE
View on GitHub
A family of open-sourced Mixture-of-Experts (MoE) Large Language Models
☆1,691Mar 8, 2024Updated 2 years ago
bdusell / stack-attention
View on GitHub
Code for the paper "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns"
☆18Mar 15, 2024Updated 2 years ago
sail-sg / regmix
View on GitHub
[ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
☆195Feb 17, 2025Updated last year
UKPLab / naacl2019-does-my-rebuttal-matter
View on GitHub
☆28Jul 29, 2023Updated 2 years ago
allenai / scicite
View on GitHub
Repository for NAACL 2019 paper on Citation Intent prediction
☆130Dec 1, 2019Updated 6 years ago
nreimers / beir-sparta
View on GitHub
Re-Implementation of SPARTA model
☆13Oct 1, 2021Updated 4 years ago
copenlu / scientific-information-change
View on GitHub
Code for the paper "Modeling Information Change in Science Communication with Semantically Matched Paraphrases" from EMNLP 2022
☆13Oct 20, 2022Updated 3 years ago
NousResearch / StripedHyenaTrainer
View on GitHub
☆67Dec 8, 2023Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
huggingface / datablations
View on GitHub
Scaling Data-Constrained Language Models
☆345Jun 28, 2025Updated last year
microsoft / Efficient-Large-LM-Trainer
View on GitHub
☆39Jul 25, 2024Updated 2 years ago
allenai / specter
View on GitHub
SPECTER: Document-level Representation Learning using Citation-informed Transformers
☆587Jun 12, 2023Updated 3 years ago
Aratako / Task-Vector-Merge-Optimzier
View on GitHub
☆16Apr 11, 2024Updated 2 years ago
limenlp / safer-instruct
View on GitHub
This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"
☆17Feb 22, 2024Updated 2 years ago
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,237Updated this week
GAIR-NLP / MathPile
View on GitHub
[NeurlPS D&B 2024] Generative AI for Math: MathPile
☆418Apr 4, 2025Updated last year
huggingface / cosmopedia
View on GitHub
☆573Nov 20, 2024Updated last year
tetsu9923 / SciReviewGen
View on GitHub
Official dataset repository for "SciReviewGen: A Large-scale Dataset for Automatic Literature Review Generation."
☆21Jun 4, 2023Updated 3 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
allenai / vila
View on GitHub
Incorporating VIsual LAyout Structures for Scientific Text Classification
☆180Mar 18, 2023Updated 3 years ago
allenai / S2APLER
View on GitHub
S2APLER: S2 Agglomeration of Papers with Low Error Rate (it's for academic paper clustering)
☆22Jul 8, 2026Updated 2 weeks ago
cxcscmu / MATES
View on GitHub
Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
☆80Nov 14, 2024Updated last year
tsafavi / cascader
View on GitHub
CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction (arXiv 22)
☆13Jun 17, 2022Updated 4 years ago
EleutherAI / rnngineering
View on GitHub
Engineering the state of RNN language models (Mamba, RWKV, etc.)
☆33May 25, 2024Updated 2 years ago
maximzubkov / fft-scan
View on GitHub
Efficient PScan implementation in PyTorch
☆17Jan 2, 2024Updated 2 years ago
p-lambda / dsir
View on GitHub
DSIR large-scale data selection framework for language model training
☆275Apr 7, 2024Updated 2 years ago