Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in dataset using pandas
☆53Jul 6, 2023Updated 2 years ago
Alternatives and similar repositories for HuggingFace-Datasets-Text-Quality-Analysis
Users that are interested in HuggingFace-Datasets-Text-Quality-Analysis are comparing it to the libraries listed below
Sorting:
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆19Jul 20, 2023Updated 2 years ago
- ☆27Oct 30, 2023Updated 2 years ago
- 文本去重☆78May 23, 2024Updated last year
- Multi-Task instruction-tuned LLaMA☆14May 5, 2023Updated 2 years ago
- Code for the paper "REV: Information-Theoretic Evaluation of Free-Text Rationales"☆16Aug 11, 2023Updated 2 years ago
- ☆27Dec 13, 2024Updated last year
- query by humming system☆19Aug 7, 2015Updated 10 years ago
- ☆13Aug 13, 2023Updated 2 years ago
- 《大语言模型》综述全书学习笔记☆13Aug 2, 2024Updated last year
- Corresponding source code for the study "Real-time Synthesis of Imagined Speech Processes from Minimally Invasive Recordings of Neural Ac…☆11Jul 30, 2021Updated 4 years ago
- code and speech demo for speech reconstruction from ECoG recordings☆12May 21, 2025Updated 9 months ago
- ☆11May 20, 2023Updated 2 years ago
- A set of tools for headphone correction and binaural synthesis of spatial audio systems on headphones☆31Mar 14, 2026Updated last week
- code and data for paper "Learning Kernel-Smoothed Machine Translation with Retrieved Examples"☆24Mar 16, 2022Updated 4 years ago
- Awesome Entity Alignment is a collection of EA techniques, including papers, codes, and datasets.☆11Oct 27, 2022Updated 3 years ago
- ☆13Feb 1, 2026Updated last month
- Implementation of Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems☆14Nov 11, 2023Updated 2 years ago
- This is the official implementation of our multi-channel multi-speaker multi-spatial neural audio codec architecture.☆51Mar 17, 2025Updated last year
- Adding random noise to a text dataset, and controlling very accurately the quality of the result☆20Mar 14, 2026Updated last week
- Learning effect regulated object categories☆15Nov 4, 2025Updated 4 months ago
- chinese wwm masking and ngram masking based on jieba☆11Jul 25, 2019Updated 6 years ago
- ☆81Feb 24, 2026Updated 3 weeks ago
- Code to implement the model of No.2 in Task 1 of the Auditory EEG Challenge (ICASSP 2024)☆12Jan 29, 2024Updated 2 years ago
- Code for paper "Nearest Neighbor Knowledge Distillation for Neural Machine Translation" by Zhixian Yang, Renliang Sun, and Xiaojun Wan. T…☆32Jul 16, 2022Updated 3 years ago
- This is the official implementation of PGUSE☆35Jun 7, 2025Updated 9 months ago
- Code for the ACL 2022 paper "Contextual Representation Learning beyond Masked Language Modeling"☆33Oct 23, 2022Updated 3 years ago
- [ACL 2023] Few-shot Reranking for Multi-hop QA via Language Model Prompting☆27Oct 19, 2025Updated 5 months ago
- [ICANN 2024 (Oral)] MISS: A Generative Pre-training and Fine-tuning Approach for Med-VQA☆12Aug 8, 2024Updated last year
- AI修仙☆11Jul 8, 2025Updated 8 months ago
- 中华经典文献数据集☆20Jun 29, 2023Updated 2 years ago
- ☆13Feb 26, 2023Updated 3 years ago
- 蚂蚁金融自然语言处理竞赛。☆10Sep 3, 2018Updated 7 years ago
- Efficient Pre-training of Masked Language Model via Concept-based Curriculum Masking☆13Feb 5, 2023Updated 3 years ago
- Code for the ICML 2021 paper "Sharing Less is More: Lifelong Learning in Deep Networks with Selective Layer Transfer"☆12Aug 17, 2021Updated 4 years ago
- ☆68Dec 30, 2025Updated 2 months ago
- Luotuo Embedding(骆驼嵌入) is a text embedding model, which developed by 李鲁鲁, 冷子昂, 陈启源, 蒟蒻等.☆267Aug 25, 2023Updated 2 years ago
- Rogue your vibe hero like rogue like.☆28Nov 8, 2025Updated 4 months ago
- Official baseline, dataset and evaluation scripts for the ICASSP 2026 URGENT challenge.☆33Nov 12, 2025Updated 4 months ago
- Understanding the correlation between different LLM benchmarks☆29Jan 11, 2024Updated 2 years ago