☆1,258Jul 30, 2024Updated last year
Alternatives and similar repositories for deduplicate-text-datasets
Users that are interested in deduplicate-text-datasets are comparing it to the libraries listed below
Sorting:
- All-in-one text de-duplication☆744Feb 24, 2026Updated last week
- Tools to download and cleanup Common Crawl data☆1,039Apr 25, 2023Updated 2 years ago
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,915Updated this week
- MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW☆2,874Jan 20, 2026Updated last month
- Data and tools for generating and inspecting OLMo pre-training data.☆1,416Nov 5, 2025Updated 4 months ago
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆319Mar 20, 2023Updated 2 years ago
- Pipeline for pulling and processing online language model pretraining data from the web☆177Jul 31, 2023Updated 2 years ago
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets☆226Nov 16, 2024Updated last year
- The RedPajama-Data repository contains code for preparing large datasets for training large language models.☆4,923Dec 7, 2024Updated last year
- A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)☆4,738Jan 8, 2024Updated 2 years ago
- Toolkit for creating, sharing and using natural language prompts.☆2,998Oct 23, 2023Updated 2 years ago
- A framework for few-shot evaluation of language models.☆11,540Updated this week
- Accessible large language models via k-bit quantization for PyTorch.☆7,997Feb 26, 2026Updated last week
- Unsupervised text tokenizer for Neural Network-based text generation.☆11,668Feb 22, 2026Updated last week
- The pipeline for the OSCAR corpus☆176Nov 9, 2025Updated 3 months ago
- ☆1,560Feb 20, 2026Updated 2 weeks ago
- Aligning pretrained language models with instruction data generated by themselves.☆4,580Mar 27, 2023Updated 2 years ago
- A modular RL library to fine-tune language models to human preferences☆2,380Mar 1, 2024Updated 2 years ago
- ☆565Nov 20, 2024Updated last year
- DSIR large-scale data selection framework for language model training☆270Apr 7, 2024Updated last year
- ☆2,950Jan 15, 2026Updated last month
- Ongoing research training transformer models at scale☆15,461Updated this week
- A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.☆2,095Oct 16, 2025Updated 4 months ago
- Robust recipes to align language models with human and AI preferences☆5,510Sep 8, 2025Updated 5 months ago
- Train transformer language models with reinforcement learning.☆17,523Updated this week
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,314Feb 20, 2026Updated 2 weeks ago
- maximal update parametrization (µP)☆1,686Jul 17, 2024Updated last year
- A Unified Library for Parameter-Efficient and Modular Transfer Learning☆2,801Updated this week
- Reproduce results and replicate training fo T0 (Multitask Prompted Training Enables Zero-Shot Task Generalization)☆465Nov 5, 2022Updated 3 years ago
- Scaling Data-Constrained Language Models☆342Jun 28, 2025Updated 8 months ago
- An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries☆7,395Feb 3, 2026Updated last month
- ☆1,636Apr 27, 2023Updated 2 years ago
- Efficient few-shot learning with Sentence Transformers☆2,688Dec 11, 2025Updated 2 months ago
- Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"☆6,489Jan 14, 2026Updated last month
- Tools for merging pretrained large language models.☆6,826Updated this week
- The implementation of DeBERTa☆2,197Sep 29, 2023Updated 2 years ago
- Fast and memory-efficient exact attention☆22,460Updated this week
- Expanding natural instructions☆1,035Dec 11, 2023Updated 2 years ago
- NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations☆786May 19, 2024Updated last year