facebookresearch / SemDeDupLinks

Code for "SemDeDup", a simple method for identifying and removing semantic duplicates from a dataset (data pairs which are semantically similar, but not exactly identical).

☆144

Alternatives and similar repositories for SemDeDup

Users that are interested in SemDeDup are comparing it to the libraries listed below

Sorting:

p-lambda / dsir
DSIR large-scale data selection framework for language model training
☆263Updated last year
benzakenelad / BitFit
Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
☆142Updated 3 years ago
princeton-nlp / QuRating
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆192Updated last year
huggingface / OBELICS
Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M d…
☆206Updated last year
thu-coai / PICL
Code for ACL2023 paper: Pre-Training to Learn in Context
☆107Updated last year
hkust-nlp / llm-compression-intelligence
Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]
☆142Updated last year
IBM / SALMON
Self-Alignment with Principle-Following Reward Models
☆168Updated last month
microsoft / AdaMix
This is the implementation of the paper AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning (https://arxiv.org/abs/2205.1…
☆135Updated 2 years ago
CodeCreator / WebOrganizer
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
☆67Updated 5 months ago
yegcjs / mixinglaws
☆106Updated 3 months ago
sail-sg / regmix
[ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
☆173Updated 8 months ago
Cohere-Labs-Community / parameter-efficient-moe
☆271Updated last year
TIGER-AI-Lab / LongICLBench
Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]
☆108Updated 8 months ago
allenai / wimbd
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
☆223Updated 11 months ago
prateeky2806 / ties-merging
☆194Updated last year
swj0419 / in-context-pretraining
☆54Updated last year
thunlp / MoEfication
☆140Updated last year
TIGER-AI-Lab / MAmmoTH2
Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
☆148Updated last year
princeton-nlp / HELMET
The HELMET Benchmark
☆178Updated 2 months ago
facebookresearch / RLCD
Reproduction of "RLCD Reinforcement Learning from Contrast Distillation for Language Model Alignment
☆69Updated 2 years ago
princeton-nlp / AutoCompressors
[EMNLP 2023] Adapting Language Models to Compress Long Contexts
☆314Updated last year
princeton-nlp / CEPE
[ACL 2024] Long-Context Language Modeling with Parallel Encodings
☆165Updated last year
neelsjain / NEFTune
Official repository of NEFTune: Noisy Embeddings Improves Instruction Finetuning
☆401Updated last year
kaistAI / CoT-Collection
[EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning
☆248Updated last year
locuslab / scaling_laws_data_filtering
☆65Updated last year
XiangLi1999 / ContrastiveDecoding
contrastive decoding
☆202Updated 2 years ago
QingruZhang / PASTA
PASTA: Post-hoc Attention Steering for LLMs
☆125Updated 11 months ago
sangmichaelxie / doremi
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
☆342Updated last year
rabeehk / compacter
☆130Updated 3 years ago
lucidrains / CALM-pytorch
Implementation of CALM from the paper "LLM Augmented LLMs: Expanding Capabilities through Composition", out of Google Deepmind
☆177Updated last year