onesuper/HuggingFace-Datasets-Text-Quality-Analysis

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/onesuper/HuggingFace-Datasets-Text-Quality-Analysis)

onesuper / HuggingFace-Datasets-Text-Quality-Analysis

Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in dataset using pandas

☆54

Alternatives and similar repositories for HuggingFace-Datasets-Text-Quality-Analysis

Users that are interested in HuggingFace-Datasets-Text-Quality-Analysis are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

gpengzhi / CrossConST-MT
View on GitHub
Code for Findings of ACL 2023 paper "Improving Zero-shot Multilingual Neural Machine Translation by Leveraging Cross-lingual Consistency …
☆10Jul 18, 2023Updated 3 years ago
LydiaXiaohongLi / Megatron-DeepSpeed
View on GitHub
Ongoing research training transformer language models at scale, including: BERT & GPT-2
☆19Jul 20, 2023Updated 3 years ago
meysam81 / node-importance-LPA
View on GitHub
Label propagation algorithm for community detection based on node importance and label influence
☆12Feb 15, 2018Updated 8 years ago
harsh07bharvada / structures-wiz
View on GitHub
An optimised implementation of Data structures & Algorithms like Fenwick Trees, Segment Trees, Stacks, Priority Queues, Linked Lists etc…
☆10Aug 9, 2021Updated 4 years ago
tianyaqu / guess-your-song
View on GitHub
query by humming system
☆19Aug 7, 2015Updated 10 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
facebookresearch / dual-system-for-visual-language-reasoning
View on GitHub
Github repo for Peifeng's internship project
☆13Nov 7, 2023Updated 2 years ago
r3kall / FastCommunityDetection
View on GitHub
Finding community structure in very large networks, by Clauset-Newman-Moore
☆15Apr 12, 2018Updated 8 years ago
ptlmasking / maskbert
View on GitHub
☆20Dec 16, 2020Updated 5 years ago
conradlee / network-community-benchmark
View on GitHub
For benchmaring community detection algorithms on social networks with meta-data
☆17Sep 19, 2014Updated 11 years ago
jin-woo-lee / nfs-binaural
View on GitHub
☆13Aug 13, 2023Updated 2 years ago
ValeriaTodaro / genES-MDA
View on GitHub
genES-MDA is a generic Python open-source software package to solve inverse problems via the Ensemble Smoother with Multiple Data Assimil…
☆12Mar 9, 2026Updated 4 months ago
ahollocou / scoda
View on GitHub
A streaming algorithm for community detection algorithm in very large networks
☆15Mar 8, 2017Updated 9 years ago
chaot4 / frechet_distance
View on GitHub
An implementation for the fast computation and decision of Fréchet distances.
☆13Feb 10, 2021Updated 5 years ago
PanShi2016 / Community_Detection
View on GitHub
Baseline Algorithms for Community Detection
☆16May 25, 2022Updated 4 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
BAAI-WuDao / CPM
View on GitHub
Introduction to CPM
☆17Jun 22, 2021Updated 5 years ago
cognitive-systems-lab / closed-loop-seeg-speech-synthesis
View on GitHub
Corresponding source code for the study "Real-time Synthesis of Imagined Speech Processes from Minimally Invasive Recordings of Neural Ac…
☆11Jul 30, 2021Updated 4 years ago
Xiefeng69 / Awesome-Entity-Alignment
View on GitHub
Awesome Entity Alignment is a collection of EA techniques, including papers, codes, and datasets.
☆11Oct 27, 2022Updated 3 years ago
MaoXinn / DATTI
View on GitHub
☆12Jul 7, 2022Updated 4 years ago
ichi131 / Direction-based-BiTSE
View on GitHub
☆15Sep 19, 2024Updated last year
multimodal-art-projection / I-SHEEP
View on GitHub
I-SHEEP: Iterative Self-enHancEmEnt Paradigm of LLMs through Self-Instruct and Self-Assessment
☆17Jan 16, 2025Updated last year
LianxinRay / bert_wwm_ngram_masking_of_chinese
View on GitHub
chinese wwm masking and ngram masking based on jieba
☆11Jul 25, 2019Updated 7 years ago
FadedCosine / kNN-KD
View on GitHub
Code for paper "Nearest Neighbor Knowledge Distillation for Neural Machine Translation" by Zhixian Yang, Renliang Sun, and Xiaojun Wan. T…
☆32Jul 16, 2022Updated 4 years ago
anton-jeran / MULTI-AUDIODEC
View on GitHub
This is the official implementation of our multi-channel multi-speaker multi-spatial neural audio codec architecture.
☆54Mar 17, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
FUZHIYI / TACO
View on GitHub
Code for the ACL 2022 paper "Contextual Representation Learning beyond Masked Language Modeling"
☆33Oct 23, 2022Updated 3 years ago
TIMMY-CHAN / MISS
View on GitHub
[ICANN 2024 (Oral)] MISS: A Generative Pre-training and Fine-tuning Approach for Med-VQA
☆12Aug 8, 2024Updated last year
marijnkoolen / fuzzy-search
View on GitHub
Fuzzy search modules for searching lists of words in low quality OCR and HTR text.
☆23Jun 29, 2026Updated 3 weeks ago
uwescience / GossipMap
View on GitHub
GossipMap: distributed parallel community detection algorithm
☆22Sep 3, 2015Updated 10 years ago
shincling / discreteSeparation
View on GitHub
The demo for "Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem".
☆12Oct 25, 2021Updated 4 years ago
kyegomez / MELLE
View on GitHub
An open source community implementation of the model MELLE from the paper: "Autoregressive Speech Synthesis without Vector Quantization"
☆16Updated this week
LTU-Machine-Learning / Inner_Speech_EEG_fMRI
View on GitHub
This repository contains the code used to preprocess the EEG and fMRI data along with the stimulation protocols used to generate the Bimo…
☆20Aug 22, 2023Updated 2 years ago
mukhal / PromptRank
View on GitHub
[ACL 2023] Few-shot Reranking for Multi-hop QA via Language Model Prompting
☆27Oct 19, 2025Updated 9 months ago
leolle / atec_nlp
View on GitHub
蚂蚁金融自然语言处理竞赛。
☆10Sep 3, 2018Updated 7 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
KoreaMGLEE / Concept-based-curriculum-masking
View on GitHub
Efficient Pre-training of Masked Language Model via Concept-based Curriculum Masking
☆13Feb 5, 2023Updated 3 years ago
hollobit / Awesome-GenAITech
View on GitHub
Awesome-GenAITech: a curated list of Generative AI Techniques
☆11Jul 11, 2023Updated 3 years ago
ZhjGo / ai-game
View on GitHub
AI修仙
☆11Jul 8, 2025Updated last year
ledmaster / unified-embeddings
View on GitHub
Implementation of Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems
☆15Nov 11, 2023Updated 2 years ago
Lifelong-ML / LASEM
View on GitHub
Code for the ICML 2021 paper "Sharing Less is More: Lifelong Learning in Deep Networks with Selective Layer Transfer"
☆12Aug 17, 2021Updated 4 years ago
euaurora / HappyShares-CNST-HEU
View on GitHub
☆17Jun 15, 2023Updated 3 years ago
LC1332 / Luotuo-Text-Embedding
View on GitHub
Luotuo Embedding(骆驼嵌入) is a text embedding model, which developed by 李鲁鲁, 冷子昂, 陈启源, 蒟蒻等.
☆265Aug 25, 2023Updated 2 years ago