huggingface/that_is_good_data

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/huggingface/that_is_good_data)

huggingface / that_is_good_data

☆65

Alternatives and similar repositories for that_is_good_data

Users that are interested in that_is_good_data are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

bigscience-workshop / metadata
View on GitHub
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
☆29Jun 12, 2023Updated 3 years ago
UKPLab / on-emergence
View on GitHub
Codes and files for the paper Are Emergent Abilities in Large Language Models just In-Context Learning
☆33Jan 9, 2025Updated last year
rstodden / TS_annotation_tool
View on GitHub
Annotation Tool for Text Simplification Corpora
☆16Oct 5, 2023Updated 2 years ago
gucci-j / light-transformer-emnlp2021
View on GitHub
EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling
☆34Nov 21, 2021Updated 4 years ago
chorusai / brave
View on GitHub
Brave is a simple visualisation library for NLP information extraction, built on top of embedded BRAT.
☆15Dec 25, 2019Updated 6 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
mainlp / awesome-human-label-variation
View on GitHub
A curated list of awesome datasets with human label variation (un-aggregated labels) in Natural Language Processing and Computer Vision, …
☆102Apr 15, 2024Updated 2 years ago
chanzuckerberg / ChemDisGene
View on GitHub
Bio relation extraction labeled dataset
☆46Apr 15, 2022Updated 4 years ago
ptlmasking / maskbert
View on GitHub
☆20Dec 16, 2020Updated 5 years ago
Babelscape / wikineural
View on GitHub
Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2…
☆72Jan 27, 2023Updated 3 years ago
dennlinger / klexikon
View on GitHub
Klexikon: A German Dataset for Joint Summarization and Simplification
☆17Oct 5, 2022Updated 3 years ago
cleanlab / cleanlab-tlm
View on GitHub
Python client library for Cleanlab Trustworthy Language Model
☆24Dec 9, 2025Updated 7 months ago
shoaibahmed / metadata_archaeology
View on GitHub
Official code for the paper: "Metadata Archaeology"
☆19May 10, 2023Updated 3 years ago
r-three / t-few
View on GitHub
Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"
☆460Sep 6, 2023Updated 2 years ago
SamsungSAILMontreal / PAPA
View on GitHub
Repository for the PopulAtion Parameter Averaging (PAPA) paper
☆31Apr 11, 2024Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
EleutherAI / polyglot-data
View on GitHub
data related codebase for polyglot project
☆19Mar 30, 2023Updated 3 years ago
DFKI-NLP / MobIE
View on GitHub
[Konvens21] This repository contains the DFKI MobIE Corpus, a dataset of 3,232 German-language documents that have been annotated with fi…
☆12Sep 17, 2024Updated last year
explosion / prodigy-evaluate
View on GitHub
🔎 A Prodigy plugin for evaluating spaCy pipelines
☆13Mar 26, 2024Updated 2 years ago
hotchpotch / yasem
View on GitHub
YASEM - Yet Another Splade|Sparse Embedder - A simple and efficient library for SPLADE embeddings
☆13May 22, 2025Updated last year
kabirkhan / recon
View on GitHub
Recon NER, Debug and correct annotated Named Entity Recognition (NER) data for inconsistencies and get insights on improving the quality …
☆104Feb 26, 2024Updated 2 years ago
google-research-datasets / WebRED
View on GitHub
WebRED is a large and diverse manually annotated dataset for extracting relationships from a variety of text found on the World Wide Web.
☆22Mar 11, 2021Updated 5 years ago
cyk1337 / Highway-Transformer
View on GitHub
[ACL‘20] Highway Transformer: A Gated Transformer.
☆33Dec 5, 2021Updated 4 years ago
adapter-hub / hgiyt
View on GitHub
Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"
☆28Oct 3, 2021Updated 4 years ago
pdufter / staticlama
View on GitHub
☆13Apr 16, 2021Updated 5 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
uds-lsv / TOKEN-is-a-MASK
View on GitHub
Code for our TSD paper "TOKEN is a MASK: Few-shot Named Entity Recognition with Pre-trained Language Models"
☆14Aug 19, 2022Updated 3 years ago
aiintelligentsystems / next-level-bert
View on GitHub
☆16Jun 14, 2024Updated 2 years ago
BinWang28 / Sentence-Embedding-S3E
View on GitHub
Efficient Sentence Embedding via Semantic Subspace Analysis
☆14Feb 25, 2020Updated 6 years ago
IDSIA / lmtool-fwp
View on GitHub
PyTorch Language Modeling Toolkit for Fast Weight Programmers
☆22Jun 11, 2025Updated last year
malteos / clp-transfer
View on GitHub
Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
☆30Jan 25, 2023Updated 3 years ago
jacobandreas / geca
View on GitHub
☆41Jan 11, 2021Updated 5 years ago
MeLeLBGU / SaGe
View on GitHub
Code for SaGe subword tokenizer (EACL 2023)
☆28Nov 30, 2024Updated last year
hammoudhasan / DiversitySSL
View on GitHub
Original code base for On Pretraining Data Diversity for Self-Supervised Learning
☆14Dec 30, 2024Updated last year
leonweber / pedl
View on GitHub
Search the biomedical literature for protein interactions and protein associations
☆11Nov 24, 2023Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
epfl-dlab / pairformance
View on GitHub
Tool to perform paired evaluation of automatic systems
☆13Oct 20, 2021Updated 4 years ago
bozheng-hit / VoCapXLM
View on GitHub
Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"
☆20Nov 12, 2021Updated 4 years ago
MaLA-LM / GlotEval
View on GitHub
GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific way
☆18Nov 4, 2025Updated 8 months ago
Weyaxi / scrape-open-llm-leaderboard
View on GitHub
Scrape and export data from the Open LLM Leaderboard.
☆48Dec 17, 2024Updated last year
adapter-hub / efficient-task-transfer
View on GitHub
Research code for "What to Pre-Train on? Efficient Intermediate Task Selection", EMNLP 2021
☆37Dec 21, 2021Updated 4 years ago
Pleias / Pleias-Rag
View on GitHub
☆17Feb 25, 2025Updated last year
jdf-prog / LLM-Gen
View on GitHub
A simple generate script utils using fastchat conv template for generation of Large Language Models
☆21Jun 21, 2023Updated 3 years ago