onesuper / HuggingFace-Datasets-Text-Quality-Analysis

Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in dataset using pandas
50Updated last year

Related projects

Alternatives and complementary repositories for HuggingFace-Datasets-Text-Quality-Analysis