facebookresearch / SemDeDupLinks
Code for "SemDeDup", a simple method for identifying and removing semantic duplicates from a dataset (data pairs which are semantically similar, but not exactly identical).
☆136Updated last year
Alternatives and similar repositories for SemDeDup
Users that are interested in SemDeDup are comparing it to the libraries listed below
Sorting:
- [ICML 2024] Selecting High-Quality Data for Training Language Models☆176Updated last year
- Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models☆142Updated 2 years ago
- [ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)☆149Updated 4 months ago
- DSIR large-scale data selection framework for language model training☆251Updated last year
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]