Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in dataset using pandas
☆53Jul 6, 2023Updated 2 years ago
Alternatives and similar repositories for HuggingFace-Datasets-Text-Quality-Analysis
Users that are interested in HuggingFace-Datasets-Text-Quality-Analysis are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Code for Findings of ACL 2023 paper "Improving Zero-shot Multilingual Neural Machine Translation by Leveraging Cross-lingual Consistency …☆10Jul 18, 2023Updated 2 years ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆19Jul 20, 2023Updated 2 years ago
- ☆27Oct 30, 2023Updated 2 years ago
- 文本去重☆77May 23, 2024Updated 2 years ago
- A series of BERT and Albert model checkpoints trained to reduce gendered correlations in pre-training☆11Oct 22, 2020Updated 5 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- ☆26Dec 13, 2024Updated last year
- Findings of ACL 2021☆24May 8, 2021Updated 5 years ago
- ☆13Aug 13, 2023Updated 2 years ago
- 《大语言模型》综述全书学习笔记☆12Aug 2, 2024Updated last year
- ACL 2023 Dual-Alignment Pre-training for Cross-lingual Sentence Embedding☆24Aug 21, 2024Updated last year
- Decoding of the speech envelope from EEG using the VLAAI deep neural network☆14Sep 28, 2022Updated 3 years ago
- Corresponding source code for the study "Real-time Synthesis of Imagined Speech Processes from Minimally Invasive Recordings of Neural Ac…☆11Jul 30, 2021Updated 4 years ago
- genES-MDA is a generic Python open-source software package to solve inverse problems via the Ensemble Smoother with Multiple Data Assimil…☆12Mar 9, 2026Updated 3 months ago
- ☆12May 20, 2023Updated 3 years ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- A curated list of full-duplex spoken dialogue models & benchmarks☆94Updated this week
- code and data for paper "Learning Kernel-Smoothed Machine Translation with Retrieved Examples"☆24Mar 16, 2022Updated 4 years ago
- Awesome Entity Alignment is a collection of EA techniques, including papers, codes, and datasets.☆11Oct 27, 2022Updated 3 years ago
- ☆12Jul 7, 2022Updated 3 years ago
- Vocabulary Trimming (VT) is a model compression technique, which reduces a multilingual LM vocabulary to a target language by deleting ir…☆67Oct 25, 2024Updated last year
- PASE: Phonologically Anchored Speech Enhancer☆67Apr 9, 2026Updated 2 months ago
- ☆15Sep 1, 2023Updated 2 years ago
- ☆18Feb 1, 2026Updated 4 months ago
- This repository contains the code used to preprocess the EEG and fMRI data along with the stimulation protocols used to generate the Bimo…☆20Aug 22, 2023Updated 2 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- I-SHEEP: Iterative Self-enHancEmEnt Paradigm of LLMs through Self-Instruct and Self-Assessment☆17Jan 16, 2025Updated last year
- Fuzzy search modules for searching lists of words in low quality OCR and HTR text.☆23Mar 30, 2026Updated 2 months ago
- Code for paper "Nearest Neighbor Knowledge Distillation for Neural Machine Translation" by Zhixian Yang, Renliang Sun, and Xiaojun Wan. T…☆32Jul 16, 2022Updated 3 years ago
- This is the official implementation of PGUSE☆40Jun 7, 2025Updated last year
- The demo for "Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem".☆12Oct 25, 2021Updated 4 years ago
- [ACL 2023] Few-shot Reranking for Multi-hop QA via Language Model Prompting☆27Oct 19, 2025Updated 7 months ago
- ☆23Jul 15, 2025Updated 11 months ago
- A set of tools for headphone correction and binaural synthesis of spatial audio systems on headphones☆41Mar 14, 2026Updated 3 months ago
- ☆88Feb 24, 2026Updated 3 months ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- AI修仙☆11Jul 8, 2025Updated 11 months ago
- ☆27May 5, 2025Updated last year
- ☆17Jun 15, 2023Updated 3 years ago
- 蚂蚁金融自然语言处理竞赛。☆10Sep 3, 2018Updated 7 years ago
- Efficient Pre-training of Masked Language Model via Concept-based Curriculum Masking☆13Feb 5, 2023Updated 3 years ago
- Code for the ICML 2021 paper "Sharing Less is More: Lifelong Learning in Deep Networks with Selective Layer Transfer"☆12Aug 17, 2021Updated 4 years ago
- ☆23Aug 7, 2023Updated 2 years ago