Tools for managing datasets for governance and training.
☆90Mar 16, 2026Updated this week
Alternatives and similar repositories for data_tooling
Users that are interested in data_tooling are comparing it to the libraries listed below
Sorting:
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆319Mar 20, 2023Updated 3 years ago
- Generate BERT vocabularies and pretraining examples from Wikipedias☆17May 11, 2020Updated 5 years ago
- Submission for AIviVN sentiment analysis contest https://www.aivivn.com/contests/1☆15Oct 12, 2021Updated 4 years ago
- All-in-one text de-duplication☆745Mar 9, 2026Updated last week
- Code and Data for Evaluation WG☆42May 4, 2022Updated 3 years ago
- Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.☆1,010Jul 29, 2024Updated last year
- Multilingual bert retrained on news + squad2 for vietnamese☆24Feb 16, 2020Updated 6 years ago
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆74Mar 2, 2024Updated 2 years ago
- ☆78Dec 7, 2023Updated 2 years ago
- Pre-training script for BART in JAX/Flax☆38Aug 4, 2022Updated 3 years ago
- Semeval-2021 Multilingual and Cross-lingual Word-in-Context Task☆18May 27, 2021Updated 4 years ago
- CMU Linguistic Annotation Backend☆15Sep 22, 2025Updated 5 months ago
- 청와대 국민청원 데이터 아카이브☆15Aug 29, 2020Updated 5 years ago
- A utility for storing and reading files for Korean LM training 💾☆35Oct 15, 2025Updated 5 months ago
- Pipeline for pulling and processing online language model pretraining data from the web☆178Jul 31, 2023Updated 2 years ago
- Noise-robust de-duplication at scale☆19Apr 9, 2023Updated 2 years ago
- Viewer for text datasets in formats like HuggingFace, JSONL, etc.☆15Feb 25, 2025Updated last year
- a compact audio-to-phoneme aligner for singing voice☆12Jan 17, 2024Updated 2 years ago
- QLoRA with Enhanced Multi GPU Support☆38Aug 8, 2023Updated 2 years ago
- UFSAC is a resource containing all WordNet Sense Annotated Corpora, and a Java library for manipulating them☆38May 17, 2022Updated 3 years ago
- The pipeline for the OSCAR corpus☆176Nov 9, 2025Updated 4 months ago
- NLP Affective Computing - text-based emotion recognition with Deep Learning and LLMs☆16Nov 10, 2025Updated 4 months ago
- ☆12Dec 8, 2022Updated 3 years ago
- ☆13Jan 18, 2020Updated 6 years ago
- Code and data for the paper "Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?"☆26Jun 3, 2025Updated 9 months ago
- ☆1,262Jul 30, 2024Updated last year
- Intro to Machine Learning and Deep Learning for Earth-Life Sciences☆14Jun 29, 2019Updated 6 years ago
- 本项目主要对开源的MOSS SFT数据进行整理 ,转换成mnbvc多轮对话格式。MOSS-003涵盖用性、忠实性、无害性三个层面,共353w样本,MOSS-003 包含更细粒度的有用性类别标记、更广泛的无害性数据和更长对话轮数,共630w样本,☆12Dec 3, 2023Updated 2 years ago
- Anh - LAION's multilingual assistant datasets and models☆27Apr 5, 2023Updated 2 years ago
- kogpt를 oslo로 파인튜닝하는 예제.☆23Aug 26, 2022Updated 3 years ago
- ☆95Jul 16, 2022Updated 3 years ago
- This repository contains the data and code created under the project NLP4Rare-cm-uc3m.☆10Sep 14, 2021Updated 4 years ago
- Korean Named Entity Corpus☆25May 12, 2023Updated 2 years ago
- ☆13Aug 29, 2020Updated 5 years ago
- Các thí nghiệm liên quan tới LLMs cho tiếng Việt (insprised by Physics of LLMs Series)☆11Oct 21, 2024Updated last year
- Translation demonstrator☆38May 12, 2020Updated 5 years ago
- ☆15Dec 12, 2019Updated 6 years ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆1,438Mar 20, 2024Updated 2 years ago
- ☆23Jun 27, 2019Updated 6 years ago