CASIA-LM/ChineseWebText

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/CASIA-LM/ChineseWebText)

CASIA-LM / ChineseWebText

☆186

Alternatives and similar repositories for ChineseWebText

Users that are interested in ChineseWebText are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

CASIA-LM / ChineseWebText-2.0
View on GitHub
Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information
☆40Dec 2, 2024Updated last year
FlagOpen / FlagData
View on GitHub
☆364Jun 13, 2024Updated 2 years ago
adamgallas / SpinalDLA
View on GitHub
[FPL'24] This repository contains the source code for the paper “Revealing Untapped DSP Optimization Potentials for FPGA-based Systolic M…
☆22May 6, 2024Updated 2 years ago
WanyueZhang-ai / spatial-understanding
View on GitHub
☆19Sep 3, 2025Updated 10 months ago
CASIA-LM / MoDS
View on GitHub
☆153Apr 16, 2024Updated 2 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
aplmikex / deduplication_mnbvc
View on GitHub
文本去重
☆77May 23, 2024Updated 2 years ago
Ultramarine-spec / huggingface_downloader
View on GitHub
☆12Apr 15, 2024Updated 2 years ago
esbatmop / MNBVC
View on GitHub
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…
☆4,246Jul 13, 2026Updated last week
Chinese-Tiny-LLM / Chinese-Tiny-LLM
View on GitHub
☆237May 10, 2024Updated 2 years ago
hkust-nlp / deita
View on GitHub
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆599Dec 9, 2024Updated last year
FudanNLPLAB / CBook-150K
View on GitHub
中文图书语料MD5链接
☆217Jan 31, 2024Updated 2 years ago
SkyworkAI / Skywork
View on GitHub
Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and English) and code data. We have open-sour…
☆1,496Mar 7, 2025Updated last year
magpie-align / magpie
View on GitHub
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …
☆874Mar 17, 2025Updated last year
GAIR-NLP / DatasetResearch
View on GitHub
DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery
☆22Sep 24, 2025Updated 9 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
tianyi-lab / Cherry_LLM
View on GitHub
[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other mo…
☆416Jun 25, 2025Updated last year
datajuicer / data-juicer
View on GitHub
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
☆6,744Updated this week
facebookresearch / cc_net
View on GitHub
Tools to download and cleanup Common Crawl data
☆1,046Apr 25, 2023Updated 3 years ago
thu-coai / CritiqueLLM
View on GitHub
☆147Jul 1, 2024Updated 2 years ago
multimodal-art-projection / MAP-NEO
View on GitHub
☆985Feb 7, 2025Updated last year
RUCKBReasoning / GLM-Dialog
View on GitHub
☆59Aug 1, 2023Updated 2 years ago
sangmichaelxie / doremi
View on GitHub
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
☆357Dec 26, 2023Updated 2 years ago
OpenMOSS / Say-I-Dont-Know
View on GitHub
[ICML'2024] Can AI Assistants Know What They Don't Know?
☆86Feb 5, 2024Updated 2 years ago
yinzhangyue / EoT
View on GitHub
Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication
☆21Mar 21, 2024Updated 2 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
icip-cas / AutoAlign
View on GitHub
A toolkit for automated alignment research.
☆15Jul 3, 2026Updated 2 weeks ago
yegcjs / mixinglaws
View on GitHub
☆113Jul 15, 2025Updated last year
LianjiaTech / BELLE
View on GitHub
BELLE: Be Everyone's Large Language model Engine（开源中文对话大模型）
☆8,273Oct 16, 2024Updated last year
CASIA-LM / OpenS2S
View on GitHub
OpenS2S : Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model
☆119Mar 28, 2026Updated 3 months ago
jeffeuxMartin / meta-learning-hlp
View on GitHub
A publishing website of a table collecting meta-learning-related papers in the area of human language processing.
☆17Aug 2, 2021Updated 4 years ago
ChenghaoMou / text-dedup
View on GitHub
All-in-one text de-duplication
☆764Mar 9, 2026Updated 4 months ago
RUC-GSAI / Llama-3-SynE
View on GitHub
Llama-3-SynE: A Significantly Enhanced Version of Llama-3 with Advanced Scientific Reasoning and Chinese Language Capabilities | 继续预训练提升 …
☆40May 31, 2025Updated last year
OFA-Sys / InsTag
View on GitHub
InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning
☆287Aug 20, 2023Updated 2 years ago
fxmeng / mixtral_spliter
View on GitHub
Converting Mixtral-8x7B to Mixtral-[1~7]x7B
☆22Mar 4, 2024Updated 2 years ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
PhoebusSi / Alpaca-CoT
View on GitHub
We unified the interfaces of instruction-tuning data (e.g., CoT data), multiple LLMs and parameter-efficient methods (e.g., lora, p-tunin…
☆2,791Dec 12, 2023Updated 2 years ago
Azure99 / BlossomData
View on GitHub
A fluent, scalable, and easy-to-use LLM data processing framework.
☆28Jan 31, 2026Updated 5 months ago
Academic-Hammer / HammerLLM
View on GitHub
1.4B sLLM for Chinese and English - HammerLLM🔨
☆44Apr 7, 2024Updated 2 years ago
THUDM / AlignBench
View on GitHub
大模型多维度中文对齐评测基准 (ACL 2024)
☆430Oct 25, 2025Updated 8 months ago
chatnoir-eu / web-content-extraction-benchmark
View on GitHub
Web Content Extraction Benchmark
☆27Dec 16, 2025Updated 7 months ago
infly-ai / INF-LLM
View on GitHub
The official repo of INF-34B models trained by INF Technology.
☆34Jul 25, 2024Updated last year
BAAI-Zlab / COIG
View on GitHub
☆128May 27, 2023Updated 3 years ago