sileod / tasksourceLinks

Datasets collection and preprocessings framework for NLP extreme multitask learning

☆185

Alternatives and similar repositories for tasksource

Users that are interested in tasksource are comparing it to the libraries listed below

Sorting:

huggingface / olm-datasets
Pipeline for pulling and processing online language model pretraining data from the web
☆177Updated 2 years ago
jxmorris12 / bm25_pt
minimal pytorch implementation of bm25 (with sparse tensors)
☆104Updated last year
facebookresearch / tart
Code and model release for the paper "Task-aware Retrieval with Instructions" by Asai et al.
☆163Updated last year
allenai / catwalk
This project studies the performance and robustness of language models and task-adaptation methods.
☆150Updated last year
chaitanyamalaviya / ExpertQA
[Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers
☆131Updated last year
bminixhofer / zett
Code for Zero-Shot Tokenizer Transfer
☆133Updated 6 months ago
bigscience-workshop / data_tooling
Tools for managing datasets for governance and training.
☆85Updated 2 months ago
seonghyeonye / Flipped-Learning
[ICLR 2023] Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners
☆116Updated last month
bigscience-workshop / lm-evaluation-harness
A framework for few-shot evaluation of autoregressive language models.
☆105Updated 2 years ago
allenai / peS2o
Pretraining Efficiently on S2ORC!
☆165Updated 9 months ago
huggingface / that_is_good_data
☆66Updated last year
facebookresearch / dpr-scale
Scalable training for dense retrieval models.
☆299Updated last month
shayne-longpre / a-pretrainers-guide
☆72Updated 2 years ago
zetaalphavector / InPars
Inquisitive Parrots for Search
☆193Updated 2 months ago
akoksal / LongForm
Reverse Instructions to generate instruction tuning data with corpus examples
☆214Updated last year
huggingface / llm-swarm
Manage scalable open LLM inference endpoints in Slurm clusters
☆268Updated last year
allenai / wimbd
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
☆223Updated 8 months ago
allenai / bff
☆39Updated last year
kernelmachine / cbtm
Code repository for the c-BTM paper
☆107Updated last year
kyleliang919 / Long-context-transformers
Exploring finetuning public checkpoints on filter 8K sequences on Pile
☆116Updated 2 years ago
Rallio67 / language-model-agents
Experiments with generating opensource language model assistants
☆97Updated 2 years ago
mega002 / lm-debugger
The official code of LM-Debugger, an interactive tool for inspection and intervention in transformer-based language models.
☆178Updated 3 years ago
McGill-NLP / instruct-qa
Code and Data for "Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering"
☆86Updated 11 months ago
daniel-furman / sft-demos
Lightweight demos for finetuning LLMs. Powered by 🤗 transformers and open-source datasets.
☆77Updated 9 months ago
schen149 / sub-sentence-encoder
The official code repo for "Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations".
☆83Updated last year
orhonovich / unnatural-instructions
☆180Updated 2 years ago
jakespringer / echo-embeddings
☆152Updated last year
imoneoi / multipack_sampler
Multipack distributed sampler for fast padding-free training of LLMs
☆199Updated 11 months ago
IBM / fastfit
FastFit ⚡ When LLMs are Unfit Use FastFit ⚡ Fast and Effective Text Classification with Many Classes
☆210Updated 2 months ago
google-research-datasets / swim-ir
SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…
☆49Updated last year