Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in dataset using pandas
☆53Jul 6, 2023Updated 2 years ago
Alternatives and similar repositories for HuggingFace-Datasets-Text-Quality-Analysis
Users that are interested in HuggingFace-Datasets-Text-Quality-Analysis are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Code for Findings of ACL 2023 paper "Improving Zero-shot Multilingual Neural Machine Translation by Leveraging Cross-lingual Consistency …☆10Jul 18, 2023Updated 2 years ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆19Jul 20, 2023Updated 2 years ago
- ☆27Oct 30, 2023Updated 2 years ago
- 文本去重☆77May 23, 2024Updated 2 years ago
- Towards Systematic Measurement for Long Text Quality☆38Sep 5, 2024Updated last year
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- Multi-Task instruction-tuned LLaMA☆14May 5, 2023Updated 3 years ago
- [DEPRECIATED] Symbolic MIDI Music AI implementation☆20Jun 11, 2022Updated 3 years ago
- query by humming system☆19Aug 7, 2015Updated 10 years ago
- ☆13Aug 13, 2023Updated 2 years ago
- 《大语言模型》综述全书学习笔记☆12Aug 2, 2024Updated last year
- ACL 2023 Dual-Alignment Pre-training for Cross-lingual Sentence Embedding☆24Aug 21, 2024Updated last year
- [ICLR2026] FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates☆49Apr 13, 2026Updated last month
- A streaming algorithm for community detection algorithm in very large networks☆15Mar 8, 2017Updated 9 years ago
- Decoding of the speech envelope from EEG using the VLAAI deep neural network☆14Sep 28, 2022Updated 3 years ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Corresponding source code for the study "Real-time Synthesis of Imagined Speech Processes from Minimally Invasive Recordings of Neural Ac…☆11Jul 30, 2021Updated 4 years ago
- genES-MDA is a generic Python open-source software package to solve inverse problems via the Ensemble Smoother with Multiple Data Assimil…☆12Mar 9, 2026Updated 2 months ago
- An implementation for the fast computation and decision of Fréchet distances.☆13Feb 10, 2021Updated 5 years ago
- code and speech demo for speech reconstruction from ECoG recordings☆12May 21, 2025Updated last year
- This is the repository for our WSDM 2020 publication: Interpretable Click-through Rate Prediction through Hierarchical Attention☆40Oct 29, 2019Updated 6 years ago
- ☆12May 20, 2023Updated 3 years ago
- code and data for paper "Learning Kernel-Smoothed Machine Translation with Retrieved Examples"☆24Mar 16, 2022Updated 4 years ago
- ☆12Jul 7, 2022Updated 3 years ago
- Vocabulary Trimming (VT) is a model compression technique, which reduces a multilingual LM vocabulary to a target language by deleting ir…☆67Oct 25, 2024Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- ☆17Feb 1, 2026Updated 3 months ago
- This repository contains the code used to preprocess the EEG and fMRI data along with the stimulation protocols used to generate the Bimo…☆20Aug 22, 2023Updated 2 years ago
- chinese wwm masking and ngram masking based on jieba☆11Jul 25, 2019Updated 6 years ago
- I-SHEEP: Iterative Self-enHancEmEnt Paradigm of LLMs through Self-Instruct and Self-Assessment☆17Jan 16, 2025Updated last year
- Code to implement the model of No.2 in Task 1 of the Auditory EEG Challenge (ICASSP 2024)☆12Jan 29, 2024Updated 2 years ago
- Code for paper "Nearest Neighbor Knowledge Distillation for Neural Machine Translation" by Zhixian Yang, Renliang Sun, and Xiaojun Wan. T…☆32Jul 16, 2022Updated 3 years ago
- A list of advice on doing research that is useful for me :)☆13Aug 17, 2019Updated 6 years ago
- Drift detection module for machine learning pipelines.☆25Jun 21, 2023Updated 2 years ago
- The demo for "Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem".☆12Oct 25, 2021Updated 4 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- [ICANN 2024 (Oral)] MISS: A Generative Pre-training and Fine-tuning Approach for Med-VQA☆12Aug 8, 2024Updated last year
- [ACL 2023] Few-shot Reranking for Multi-hop QA via Language Model Prompting☆27Oct 19, 2025Updated 7 months ago
- ☆23Jul 15, 2025Updated 10 months ago
- A set of tools for headphone correction and binaural synthesis of spatial audio systems on headphones☆39Mar 14, 2026Updated 2 months ago
- ☆88Feb 24, 2026Updated 3 months ago
- AI修仙☆11Jul 8, 2025Updated 10 months ago
- 蚂蚁金融自然语言处理竞赛。☆10Sep 3, 2018Updated 7 years ago