onesuper / HuggingFace-Datasets-Text-Quality-AnalysisLinks
Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in dataset using pandas
☆53Updated last year
Alternatives and similar repositories for HuggingFace-Datasets-Text-Quality-Analysis
Users that are interested in HuggingFace-Datasets-Text-Quality-Analysis are comparing it to the libraries listed below
Sorting:
- Light local website for displaying performances from different chat models.☆87Updated last year
- MultilingualShareGPT, the free multi-language corpus for LLM training☆72Updated 2 years ago
- AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark☆144Updated 6 months ago
- A Multi-Turn Dialogue Corpus based on Alpaca Instructions☆171Updated 2 years ago
- ☆128Updated 2 years ago
- ☆34Updated last year
- a curated list of the role of small models in the LLM era☆101Updated 9 months ago
- CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models☆40Updated last year
- This repository contains the code to train flan t5 with alpaca instructions and low rank adaptation.☆51Updated 2 years ago
- 中文大语言模型评测第一期☆109Updated last year
- ☆68Updated 2 years ago
- A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.☆85Updated last year
- Unofficial implementation of AlpaGasus☆91Updated last year
- Leveraging passage embeddings for efficient listwise reranking with large language models.☆44Updated 6 months ago
- Imitate OpenAI with Local Models☆87Updated 9 months ago
- ☆31Updated 2 years ago
- [EMNLP 2023 Demo] "CLEVA: Chinese Language Models EVAluation Platform" [ACL 2025 Findings] "C2LEVA: Toward Comprehensive and Contaminatio…☆63Updated last month
- ☆172Updated 2 years ago
- The "GPT-API-Accelerate" project provides a set of Python classes for accelerating the process of generating responses to prompts using t…☆23Updated 8 months ago
- Fast LLM Training CodeBase With dynamic strategy choosing [Deepspeed+Megatron+FlashAttention+CudaFusionKernel+Compiler];☆38Updated last year
- 🐋 An unofficial implementation of Self-Alignment with Instruction Backtranslation.☆140Updated last month
- Summarize all open source Large Languages Models and low-cost replication methods for Chatgpt.☆137Updated 2 years ago
- code for Scaling Laws of RoPE-based Extrapolation☆73Updated last year
- YuLan-IR: Information Retrieval Boosted LMs☆222Updated last year
- 文本去重☆72Updated last year
- Logiqa2.0 dataset - logical reasoning in MRC and NLI tasks☆92Updated last year
- Counting-Stars (★)☆83Updated 3 weeks ago
- 中文大语言模型评测第二期☆70Updated last year
- The multilingual variant of GLM, a general language model trained with autoregressive blank infilling objective☆62Updated 2 years ago
- ⏳ ChatLog: Recording and Analysing ChatGPT Across Time☆99Updated last year