allenai/decon

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/allenai/decon)

allenai / decon

decontamination

☆35

Alternatives and similar repositories for decon

Users that are interested in decon are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

allenai / duplodocus
View on GitHub
Tooling for exact and MinHash deduplication of large-scale text datasets
☆91Mar 24, 2026Updated 4 months ago
allenai / datamap-rs
View on GitHub
Data mapping framework for rust stuff
☆56Mar 25, 2026Updated 3 months ago
cisnlp / multypo
View on GitHub
A Multilingual Keyboard Layout-Based Typo Generator
☆17Nov 23, 2025Updated 8 months ago
sanderland / script_tok
View on GitHub
Code for the paper "BPE stays on SCRIPT", "Which Pieces Does Unigram Tokenization Really Need?" and MinGram
☆18Jun 26, 2026Updated 3 weeks ago
lgessler / microbert
View on GitHub
A tiny BERT for low-resource monolingual models
☆32Dec 24, 2025Updated 7 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
GeneZC / MiniMoE
View on GitHub
Code for ACL 2023 paper titled "Lifting the Curse of Capacity Gap in Distilling Language Models"
☆29Jul 14, 2023Updated 3 years ago
AI-Hypercomputer / kithara
View on GitHub
☆19Feb 18, 2026Updated 5 months ago
huggingface / gaia
View on GitHub
Hugging Face and Pyserini interoperability
☆20May 18, 2023Updated 3 years ago
stefan-it / xlm-v-experiments
View on GitHub
Experiments for XLM-V Transformers Integeration
☆13Feb 8, 2023Updated 3 years ago
iPieter / llmq
View on GitHub
A Scheduler for Batched LLM Inference
☆19Oct 5, 2025Updated 9 months ago
kensho-technologies / pathpiece
View on GitHub
PathPiece tokenizer
☆14Nov 10, 2024Updated last year
allenai / OLMo-core
View on GitHub
PyTorch building blocks for the OLMo ecosystem
☆1,417Updated this week
CodeCreator / datatools
View on GitHub
Common tools for data processing
☆22Dec 8, 2025Updated 7 months ago
phykn / python-causality-handbook
View on GitHub
[KOR] Causal Inference for the Brave and True. A light-hearted yet rigorous approach to learning about impact estimation and causality.
☆11Jun 21, 2026Updated last month
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
taidopurason / tokenizer-extension
View on GitHub
☆15Dec 4, 2025Updated 7 months ago
ivanmontero / autobot
View on GitHub
Implementation of the paper 'Sentence Bottleneck Autoencoders from Transformer Language Models'
☆17Mar 14, 2022Updated 4 years ago
bigganbing / Fairseq_MorphTE
View on GitHub
[NeurIPS 2022]MorphTE: Injecting Morphology in Tensorized Embeddings
☆17Oct 29, 2022Updated 3 years ago
marcotchen / SimpleGPT
View on GitHub
[ICML 2026] Improving GPT via a simple normalization strategy
☆15May 22, 2026Updated 2 months ago
vincentamato / mlx-esm-2
View on GitHub
An MLX implementation of Meta AI's ESM-2 protein language model
☆16Aug 16, 2025Updated 11 months ago
jqueguiner / wav2vec2-sprint
View on GitHub
docker for HF wav2vec2-sprint
☆13Mar 26, 2021Updated 5 years ago
allenai / olmes
View on GitHub
Reproducible, flexible LLM evaluations
☆388Mar 24, 2026Updated 4 months ago
Leukas / CUTE
View on GitHub
☆20Apr 26, 2026Updated 2 months ago
UIC-Liu-Lab / DGA
View on GitHub
[EMNLP 2022] Adapting a Language Model While Preserving its General Knowledge
☆21Feb 12, 2023Updated 3 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
huggingface / model_card
View on GitHub
☆30Sep 27, 2021Updated 4 years ago
angelvillar96 / TemplaTorch
View on GitHub
Simple template for quick prototyping and standardization of deep learning projects
☆11Dec 29, 2023Updated 2 years ago
Breakend / SelfDestructingModels
View on GitHub
☆14Aug 9, 2023Updated 2 years ago
NathanGodey / headless-lm
View on GitHub
Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/…
☆29Apr 17, 2024Updated 2 years ago
pchizhov / picky_bpe
View on GitHub
BPE modification that implements removing of the intermediate tokens during tokenizer training.
☆27Nov 25, 2024Updated last year
DunZhang / Jasper-Token-Compression-Training
View on GitHub
The training codes of Jasper-Token-Compression-600M
☆20Nov 19, 2025Updated 8 months ago
cimeister / tokenizer-intrinsic-evals
View on GitHub
TokEval: intrinsic quality metrics for tokenizers across natural language, code, and math
☆46Jul 4, 2026Updated 2 weeks ago
allenai / staged-training
View on GitHub
Staged Training for Transformer Language Models
☆33Mar 31, 2022Updated 4 years ago
yixiaoer / tpux
View on GitHub
A set of Python scripts that makes your experience on TPU better
☆56Sep 18, 2025Updated 10 months ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
nateraw / spaces-docker-templates
View on GitHub
🚀🤗 A collection of templates for Hugging Face Spaces
☆35Oct 9, 2023Updated 2 years ago
insait-institute / ritranslation
View on GitHub
[ACL'26 Findings] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
☆20Jun 27, 2026Updated 3 weeks ago
goodfire-ai / scribe-task-suite
View on GitHub
A suite of interpretability tasks to evaluate agents using Scribe for notebook access
☆18Oct 2, 2025Updated 9 months ago
sander-wood / chord_generation
View on GitHub
Reproduction of "Chord Generation from Symbolic Melody Using BLSTM Networks".
☆15May 10, 2022Updated 4 years ago
LLM-QC / judgezoo
View on GitHub
A collection of judges for evaluating LLM model output for safety & toxicity with a standardized API.
☆15Jan 7, 2026Updated 6 months ago
adapter-hub / hgiyt
View on GitHub
Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"
☆28Oct 3, 2021Updated 4 years ago
hipe-eval / HIPE-2022-data
View on GitHub
Data for the HIPE 2022 shared task.
☆23May 15, 2026Updated 2 months ago