r-three / common-pileLinks

Code for collecting, processing, and preparing datasets for the Common Pile

☆180

Alternatives and similar repositories for common-pile

Users that are interested in common-pile are comparing it to the libraries listed below

Sorting:

jxmorris12 / bm25_pt
minimal pytorch implementation of bm25 (with sparse tensors)
☆102Updated last year
mungg / FABLES
☆57Updated 9 months ago
MeLeLBGU / SaGe
Code for SaGe subword tokenizer (EACL 2023)
☆25Updated 7 months ago
huggingface / fineweb-2
☆160Updated 2 weeks ago
liujch1998 / infini-gram
☆54Updated last month
jxmorris12 / cde
code for training & evaluating Contextual Document Embedding models
☆194Updated 2 months ago
Pleias / marginalia
☆67Updated last year
sileod / tasksource
Datasets collection and preprocessings framework for NLP extreme multitask learning
☆184Updated last week
marzenakrp / nocha
☆52Updated 8 months ago
Knowledgator / FlashDeBERTa
Trully flash implementation of DeBERTa disentangled attention mechanism.
☆62Updated 2 months ago
Aleph-Alpha-Research / trigrams
☆56Updated 2 months ago
Pleias / Pleias-RAG-Library
Python library to use Pleias-RAG models
☆58Updated 2 months ago
ConsequentAI / fneval
Functional Benchmarks and the Reasoning Gap
☆88Updated 9 months ago
Data-Provenance-Initiative / Data-Provenance-Collection
☆239Updated 3 months ago
bminixhofer / zett
Code for Zero-Shot Tokenizer Transfer
☆133Updated 6 months ago
Pleias / OCRoscope
Small python package to measure OCR quality and other related metrics.
☆24Updated last year
kevinwu23 / StanfordFineTuneBench
☆30Updated 8 months ago
Hannibal046 / nanoColBERT
Simple replication of [ColBERT-v1](https://arxiv.org/abs/2004.12832).
☆80Updated last year
Pleias / Quest-Best-Tokens
An introduction to LLM Sampling
☆79Updated 7 months ago
AnswerDotAI / ModernBERT-Instruct-mini-cookbook
☆48Updated 5 months ago
MinishLab / tokenlearn
Pre-train Static Word Embeddings
☆84Updated last month
allenai / hybrid-preferences
Learning to route instances for Human vs AI Feedback (ACL 2025 Main)
☆23Updated 2 months ago
cohere-ai / magikarp
Code for the paper "Fishing for Magikarp"
☆157Updated 2 months ago
LAION-AI / AIW
Alice in Wonderland code base for experiments and raw experiments data
☆131Updated 3 weeks ago
allenai / infinigram-api
☆69Updated last month
pchizhov / picky_bpe
BPE modification that implements removing of the intermediate tokens during tokenizer training.
☆24Updated 7 months ago
allenai / peS2o
Pretraining Efficiently on S2ORC!
☆164Updated 8 months ago
davanstrien / haiku-dpo
Using open source LLMs to build synthetic datasets for direct preference optimization
☆65Updated last year
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆160Updated last year
jenna-russell / human_detectors
human_detectors hosts the data released from the paper "People who frequently use ChatGPT for writing tasks are accurate and robust detec…
☆36Updated 2 months ago