ChenghaoMou / text-dedupLinks

All-in-one text de-duplication

☆723

Alternatives and similar repositories for text-dedup

Users that are interested in text-dedup are comparing it to the libraries listed below

Sorting:

bigscience-workshop / data-preparation
Code used for sourcing and cleaning the BigScience ROOTS corpus
☆316Updated 2 years ago
google-research / deduplicate-text-datasets
☆1,250Updated last year
facebookresearch / contriever
Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning
☆759Updated 2 years ago
allenai / natural-instructions
Expanding natural instructions
☆1,022Updated last year
facebookresearch / atlas
Code repository for supporting the paper "Atlas Few-shot Learning with Retrieval Augmented Language Models",(https//arxiv.org/abs/2208.03…
☆548Updated last year
facebookresearch / cc_net
Tools to download and cleanup Common Crawl data
☆1,031Updated 2 years ago
ContextualAI / gritlm
Generative Representational Instruction Tuning
☆675Updated 3 months ago
declare-lab / instruct-eval
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
☆548Updated last year
princeton-nlp / ALCE
[EMNLP 2023] Enabling Large Language Models to Generate Text with Citations. Paper: https://arxiv.org/abs/2305.14627
☆497Updated last year
zhilizju / Awesome-instruction-tuning
A curated list of awesome instruction tuning datasets, models, papers and repositories.
☆340Updated 2 years ago
texttron / tevatron
Tevatron - Unified Document Retrieval Toolkit across Scale, Language, and Modality. Demo in SIGIR 2023, SIGIR 2025.
☆701Updated 2 weeks ago
Shark-NLP / OpenICL
OpenICL is an open-source framework to facilitate research, development, and prototyping of in-context learning.
☆575Updated 2 years ago
sangmichaelxie / doremi
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
☆342Updated last year
nelson-liu / lost-in-the-middle
Code and data for "Lost in the Middle: How Language Models Use Long Contexts"
☆360Updated last year
hkust-nlp / deita
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆571Updated 10 months ago
bigscience-workshop / xmtf
Crosslingual Generalization through Multitask Finetuning
☆537Updated last year
ZigeW / data_management_LLM
Collection of training data management explorations for large language models
☆334Updated last year
allenai / wimbd
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
☆223Updated 11 months ago
huggingface / cosmopedia
☆544Updated 11 months ago
neelsjain / NEFTune
Official repository of NEFTune: Noisy Embeddings Improves Instruction Finetuning
☆401Updated last year
hsing-wang / Awesome-LLM-MT
☆247Updated last year
SinclairCoder / Instruction-Tuning-Papers
Reading list of Instruction-tuning. A trend starts from Natrural-Instruction (ACL 2022), FLAN (ICLR 2022) and T0 (ICLR 2022).
☆770Updated 2 years ago
facebookresearch / FiD
Fusion-in-Decoder
☆588Updated 2 years ago
SeanLee97 / AnglE
Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard
☆556Updated this week
epfLLM / Megatron-LLM
distributed trainer for LLMs
☆581Updated last year
google-research / FLAN
☆1,547Updated 2 months ago
FranxYao / Long-Context-Data-Engineering
Implementation of paper Data Engineering for Scaling Language Models to 128K Context
☆477Updated last year
facebookresearch / dpr-scale
Scalable training for dense retrieval models.
☆297Updated 4 months ago
p-lambda / dsir
DSIR large-scale data selection framework for language model training
☆261Updated last year
raunak-agarwal / instruction-datasets
Datasets for Instruction Tuning of Large Language Models
☆257Updated last year