alanshi/charset_mnbvc

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/alanshi/charset_mnbvc)

alanshi / charset_mnbvc

本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作

☆70

Alternatives and similar repositories for charset_mnbvc

Users that are interested in charset_mnbvc are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

aplmikex / deduplication_mnbvc
View on GitHub
文本去重
☆77May 23, 2024Updated 2 years ago
Mythos-Rudy / mnbvc-fasttext-classification
View on GitHub
this repo is mnbvc text quality classification using fastText
☆16Oct 2, 2023Updated 2 years ago
esbatmop / MNBVC
View on GitHub
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…
☆4,245Jul 13, 2026Updated last week
pany8125 / ShareGPTQAExtractor-mnbvc
View on GitHub
MNBVC项目-ShareGPT语料清洗
☆16Oct 4, 2023Updated 2 years ago
bicici / FDA
View on GitHub
Feature Decay Algorithms
☆11Mar 5, 2014Updated 12 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
luojie1024 / MossQA-mnbvc
View on GitHub
本项目主要对开源的MOSS SFT数据进行整理，转换成mnbvc多轮对话格式。MOSS-003涵盖用性、忠实性、无害性三个层面，共353w样本，MOSS-003 包含更细粒度的有用性类别标记、更广泛的无害性数据和更长对话轮数，共630w样本，
☆13Dec 3, 2023Updated 2 years ago
xiatingyu / SFT-DataSelection-at-scale
View on GitHub
☆34Feb 9, 2025Updated last year
wangguojim / LargeScale
View on GitHub
☆19May 11, 2024Updated 2 years ago
ProjectD-AI / LLaMA-Megatron-DeepSpeed
View on GitHub
Ongoing research training transformer language models at scale, including: BERT & GPT-2
☆69Jul 20, 2023Updated 3 years ago
THUDM / FasterTransformer
View on GitHub
Transformer related optimization, including BERT, GPT
☆39Feb 10, 2023Updated 3 years ago
NielsRogge / tapas_utils
View on GitHub
A package containing utils for the PyTorch version of the Tapas algorithm.
☆11Apr 29, 2021Updated 5 years ago
bigscience-workshop / data-preparation
View on GitHub
Code used for sourcing and cleaning the BigScience ROOTS corpus
☆318Mar 20, 2023Updated 3 years ago
shibing624 / deep-research
View on GitHub
Python implementation of AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, w…
☆49Mar 22, 2025Updated last year
Oneflow-Inc / one-glm
View on GitHub
A more efficient GLM implementation!
☆54Feb 18, 2023Updated 3 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
zhexiao / office-parser
View on GitHub
把教育信息化体系中的Word试题，Excel试卷、知识点等数据解析成json内容。
☆14Mar 3, 2020Updated 6 years ago
MonolithFoundation / Bumblebee
View on GitHub
A Simple MLLM Surpassed QwenVL-Max with OpenSource Data Only in 14B LLM.
☆38Sep 9, 2024Updated last year
jiangnanboy / llm_corpus_quality
View on GitHub
大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning
☆80Jul 25, 2024Updated last year
UKPLab / on-emergence
View on GitHub
Codes and files for the paper Are Emergent Abilities in Large Language Models just In-Context Learning
☆33Jan 9, 2025Updated last year
bojone / bytepiece
View on GitHub
更纯粹、更高压缩率的Tokenizer
☆488Nov 27, 2024Updated last year
asahi417 / lm-vocab-trimmer
View on GitHub
Vocabulary Trimming (VT) is a model compression technique, which reduces a multilingual LM vocabulary to a target language by deleting ir…
☆67Oct 25, 2024Updated last year
OpenNLPLab / ETSC-Exact-Toeplitz-to-SSM-Conversion
View on GitHub
[EMNLP 2023] Official implementation of the algorithm ETSC: Exact Toeplitz-to-SSM Conversion our EMNLP 2023 paper - Accelerating Toeplitz…
☆14Oct 17, 2023Updated 2 years ago
ssbuild / llm_rlhf
View on GitHub
realize the reinforcement learning training for gpt2 llama bloom and so on llm model
☆27Sep 19, 2023Updated 2 years ago
hpandana / gradient-accumulation-tf-estimator
View on GitHub
Gradient accumulation on tf.estimator
☆12Dec 15, 2020Updated 5 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
onesuper / HuggingFace-Datasets-Text-Quality-Analysis
View on GitHub
Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in datase…
☆54Jul 6, 2023Updated 3 years ago
fandongmeng / DTMT_InDec
View on GitHub
Implementation of DTMT with incremental decoding
☆13Feb 20, 2021Updated 5 years ago
Fu-Dayuan / AgentRefine
View on GitHub
(ICLR 2025) AgentRefine: Enhancing Agent Generalization through Refinement Tuning
☆20Nov 22, 2025Updated 7 months ago
fitphp / dataman
View on GitHub
数据管理平台（DataMan）是完全免费且开源的，任何人都可以无限制的修改代码以及部署服务，这对于很多想要对数据管理的应用平台来说是一个很好的选择：低廉的成本换回的是高效的管理方案，同时又有健康的生态提供支持。
☆13Feb 25, 2022Updated 4 years ago
SivilTaram / code-html-to-markdown
View on GitHub
A lightweight script for processing HTML page to markdown format with support for code blocks
☆81Apr 14, 2024Updated 2 years ago
commoncrawl / ia-web-commons
View on GitHub
Web archiving utility library
☆11Jun 19, 2026Updated last month
EleutherAI / pilev2
View on GitHub
☆13Jan 20, 2023Updated 3 years ago
songmzhang / CBMI
View on GitHub
The code of ACL2022 paper "Conditional Bilingual Mutual Information based Adaptive Training for Neural Machine Translation"..
☆14Aug 6, 2022Updated 3 years ago
ChenghaoMou / text-dedup
View on GitHub
All-in-one text de-duplication
☆764Mar 9, 2026Updated 4 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
yudiandoris / csi
View on GitHub
End-to-End Chinese Speaker Identification
☆11Nov 17, 2022Updated 3 years ago
YidingYu / To-be-a-researcher
View on GitHub
A list of advice on doing research that is useful for me :)
☆13Aug 17, 2019Updated 6 years ago
trevelyan / saito
View on GitHub
Saito --> NEW REPOSITORY -->
☆12Dec 31, 2025Updated 6 months ago
LeoVogiatzis / GNN_based_NILM
View on GitHub
Non Intrusive Load Monitoring based on Graph Neural Networks and Representation Learning
☆11Oct 18, 2022Updated 3 years ago
H-TayyarMadabushi / SemEval_2022_Task2-idiomaticity
View on GitHub
Data and preprocessing scripts for SemEval 2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding
☆16Feb 3, 2022Updated 4 years ago
SFFAI-AIKT / AIKT-Natural_Language_Processing
View on GitHub
This repository is a sub branch of AI Knowledge Tree, mainly focus on Natural Language Processing.
☆27Jun 14, 2021Updated 5 years ago
multi-swe-bench / MagentLess
View on GitHub
☆13Jul 31, 2025Updated 11 months ago