bigscience-workshop / data_toolingLinks

Tools for managing datasets for governance and training.

☆87

Alternatives and similar repositories for data_tooling

Users that are interested in data_tooling are comparing it to the libraries listed below

Sorting:

huggingface / olm-datasets
Pipeline for pulling and processing online language model pretraining data from the web
☆178Updated 2 years ago
huggingface / olm-training
Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.
☆96Updated 2 years ago
shayne-longpre / a-pretrainers-guide
☆72Updated 2 years ago
huggingface / that_is_good_data
☆65Updated 2 years ago
google-research / t5x_retrieval
☆101Updated 2 years ago
oscar-project / ungoliant
The pipeline for the OSCAR corpus
☆174Updated 3 weeks ago
sileod / tasksource
Datasets collection and preprocessings framework for NLP extreme multitask learning
☆189Updated 4 months ago
CPJKU / wechsel
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.
☆85Updated last year
bigscience-workshop / multilingual-modeling
BLOOM+1: Adapting BLOOM model to support a new unseen language
☆74Updated last year
EleutherAI / openwebtext2
☆92Updated 3 years ago
leogao2 / lm_dataformat
☆78Updated 2 years ago
bigscience-workshop / lm-evaluation-harness
A framework for few-shot evaluation of autoregressive language models.
☆105Updated 2 years ago
google-research / longt5
☆184Updated 2 years ago
salesforce / TaiChi
Open source library for few shot NLP
☆78Updated 2 years ago
bigscience-workshop / metadata
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
☆31Updated 2 years ago
mbzuai-nlp / bactrian-x
A Multilingual Replicable Instruction-Following Model
☆95Updated 2 years ago
cisnlp / Glot500
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023
☆106Updated last year
microsoft / xtreme-distil-transformers
XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale
☆157Updated last year
facebookresearch / EditEval
An instruction-based benchmark for text improvements.
☆143Updated 3 years ago
UniversalNER / UniversalNER
☆27Updated 9 months ago
kwang2049 / easy-elasticsearch
Using business-level retrieval system (BM25) with Python in just a few lines.
☆31Updated 2 years ago
google-research / metricx
☆118Updated 11 months ago
yxuansu / Contrastive_Search_Is_What_You_Need
[TMLR'23] Contrastive Search Is What You Need For Neural Text Generation
☆121Updated 2 years ago
amazon-science / mintaka
Dataset from the paper "Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering" (COLING 2022)
☆117Updated 3 years ago
bloomberg / minilmv2.bb
Our open source implementation of MiniLMv2 (https://aclanthology.org/2021.findings-acl.188)
☆61Updated 2 years ago
martiansideofthemoon / rankgen
Official code and model checkpoints for our EMNLP 2022 paper "RankGen - Improving Text Generation with Large Ranking Models" (https://arx…
☆138Updated 2 years ago
malteos / llm-datasets
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
☆63Updated last year
allenai / bff
☆38Updated last year
terrierteam / pyterrier_colbert
☆87Updated 8 months ago
cisnlp / GlotLID
💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
☆173Updated 2 weeks ago