allenai/duplodocus

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/allenai/duplodocus)

allenai / duplodocus

Tooling for exact and MinHash deduplication of large-scale text datasets

☆72

Alternatives and similar repositories for duplodocus

Users that are interested in duplodocus are comparing it to the libraries listed below

Sorting:

allenai / datamap-rs
View on GitHub
Data mapping framework for rust stuff
☆46Feb 26, 2026Updated last week
allenai / dolma3
View on GitHub
☆48Jan 20, 2026Updated last month
PRIME-RL / RL-Compositionality
View on GitHub
FROM $f(x)$ AND $g(x)$ TO $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones
☆64Jan 26, 2026Updated last month
ppriyank / -Online-Soft-Mining-and-Class-Aware-Attention-Pytorch
View on GitHub
(Pytorch and Tensorflow) Implementation of Weighted Contrastive Loss (Deep Metric Learning by Online Soft Mining and Class-Aware Attentio…
☆21Oct 21, 2019Updated 6 years ago
allenai / OLMo-core
View on GitHub
PyTorch building blocks for the OLMo ecosystem
☆839Updated this week
mipypf / practical-mi-guide
View on GitHub
☆37Sep 21, 2025Updated 5 months ago
huggingface / AIEnergyScore
View on GitHub
AI Energy Score: Initiative to establish comparable energy efficiency ratings for AI models.
☆37Dec 2, 2025Updated 3 months ago
vsTerminus / SnowRunner
View on GitHub
Modifications to initial.pak for general improvements
☆15Jan 29, 2026Updated last month
flowaicom / flow-judge
View on GitHub
Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafte…
☆84Oct 29, 2024Updated last year
Faildes / Universal-Model-Merge-Scripter
View on GitHub
Creates CMM script that can directly executed on Kaggle from easy merge script
☆14Jan 12, 2026Updated last month
TParcollet / E2E-SincNet
View on GitHub
E2E-SincNet: Toward fully end-to-end speech recognition
☆30Feb 1, 2020Updated 6 years ago
outscale / osc-bsu-csi-driver
View on GitHub
The OSC BSU CSI Driver is a CSI driver for Kubernetes allowing the use of Outscale Block Storage Units (BSU) volumes
☆10Updated this week
KoelLabs / ML
View on GitHub
Koel Labs innovates open-source speech research, inclusive speech technologies, and real-time pronunciation feedback for language learner…
☆18Feb 25, 2026Updated last week
kemingy / rabitq
View on GitHub
rabitq rust implementation
☆10Feb 4, 2026Updated last month
wlzhao22 / tsdg
View on GitHub
TSDG: An efficient index graph for graph-based nearest neighbor search
☆10Jul 14, 2022Updated 3 years ago
sheepit / sheepit
View on GitHub
A minimalistic deployment software focused on simplicity and clarity.
☆11Feb 12, 2022Updated 4 years ago
munshkr / Marea.sc
View on GitHub
Some kind of TidalCycles implementation for SuperCollider
☆14May 29, 2020Updated 5 years ago
Minju-nimm / MIT_PJT
View on GitHub
어린이를 위한 동화 제작 서비스, My AI Fairy-Tale
☆11Apr 7, 2023Updated 2 years ago
Blue-Yonder-OSS / cyclic-boosting
View on GitHub
implementation of Cyclic Boosting machine learning algorithms
☆95Sep 2, 2024Updated last year
artbataev / end2end
View on GitHub
Losses and decoders for end-to-end ASR and OCR
☆34Oct 30, 2020Updated 5 years ago
rugby0823 / bert-predict
View on GitHub
simplify the prediction process for a finetuned bert model
☆11Jun 19, 2019Updated 6 years ago
quant-aq / aeromancy
View on GitHub
⚗️ Aeromancy: A framework for performing reproducible AI and ML
☆11Jun 5, 2025Updated 9 months ago
SharpCoder / rust-kernel
View on GitHub
A rust operating system for the ARM V7-A running on a beaglebone black
☆12Mar 11, 2021Updated 4 years ago
lemonade-sdk / peel
View on GitHub
Get aid from local LLMs right in your PowerShell
☆15May 2, 2025Updated 10 months ago
CyberGrandChallenge / linux-source-3.13.2-cgc
View on GitHub
DARPA Cyber Grand Challenge Linux source code
☆17Jul 9, 2015Updated 10 years ago
G-EDM / G-EDM
View on GitHub
G-EDM is a wire and sinker EDM machine for the DIY community with focus on a mostly 3d printed concept
☆22Nov 9, 2025Updated 3 months ago
allenai / sso
View on GitHub
Repository for Skill Set Optimization
☆14Jul 26, 2024Updated last year
metterian / korean_bert_score
View on GitHub
BERT score for text generation
☆12Jan 15, 2025Updated last year
iwatake2222 / opencv_sample_in_rust
View on GitHub
OpenCV Sample Projects in Rust
☆12Nov 27, 2021Updated 4 years ago
maestro-os / maestro-utils
View on GitHub
Utility commands for Maestro operating system
☆14Oct 30, 2025Updated 4 months ago
Jiaju-Chen / UpliftRec
View on GitHub
this is a work about UpliftRec
☆10Dec 10, 2024Updated last year
taorui-plus / Chinese-ASR-gitbook
View on GitHub
工业级中文语音识别系统电子书
☆13Oct 30, 2020Updated 5 years ago
JustlyAI / lmss_entity_extractor
View on GitHub
Tool to apply Legal Matter Specification Standard (LMSS) to documents
☆12Aug 15, 2024Updated last year
oksome / Skink
View on GitHub
Control the DOM from Python using Websockets
☆12Mar 5, 2018Updated 7 years ago
liuzl / pullword
View on GitHub
Unsupervised Word Discovery
☆10Jul 26, 2019Updated 6 years ago
camenduru / LGM-ply-to-glb-replicate
View on GitHub
☆16Feb 18, 2024Updated 2 years ago
yoichi1484 / subspace
View on GitHub
An implementation of "Subspace Representations for Soft Set Operations and Sentence Similarities" (NAACL 2024)
☆10May 31, 2024Updated last year
muellerzr / fastai2-Starlette
View on GitHub
A Starlette example for deployment in fastai2
☆11Dec 18, 2020Updated 5 years ago
fabienbaradel / Tensorflow-tutorials
View on GitHub
Seminar: intro to deep learning with tensorflow
☆13Jun 27, 2017Updated 8 years ago