文本去重
☆78May 23, 2024Updated last year
Alternatives and similar repositories for deduplication_mnbvc
Users that are interested in deduplication_mnbvc are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- 本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作☆70Oct 17, 2025Updated 5 months ago
- ☆44Jun 18, 2023Updated 2 years ago
- MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…☆4,143Mar 8, 2026Updated 2 weeks ago
- ☆185Nov 13, 2023Updated 2 years ago
- Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in datase…☆53Jul 6, 2023Updated 2 years ago
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- ☆364Jun 13, 2024Updated last year
- ☆70Apr 14, 2023Updated 2 years ago
- ☆15Sep 24, 2023Updated 2 years ago
- Bert TensorRT模型加速部署☆10Apr 1, 2022Updated 3 years ago
- All-in-one text de-duplication☆747Mar 9, 2026Updated 2 weeks ago
- CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models☆49Mar 7, 2024Updated 2 years ago
- Extract Chinese/English QA Data from WikiHow pages.☆16May 21, 2023Updated 2 years ago
- Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料☆1,002Feb 6, 2026Updated last month
- 复现论文《Distilling Task-Specific Knowledge from BERT into Simple Neural Networks》☆16Jun 13, 2021Updated 4 years ago
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- ☆28Aug 27, 2025Updated 7 months ago
- 使用qlora对中文大语言模型进行微调,包含ChatGLM、Chinese-LLaMA-Alpaca、BELLE☆89Jun 27, 2023Updated 2 years ago
- ☆11Nov 9, 2022Updated 3 years ago
- Firefly中文LLaMA-2大模型,支持增量预训练Baichuan2、Llama2、Llama、Falcon、Qwen、Baichuan、InternLM、Bloom等大模型☆416Oct 21, 2023Updated 2 years ago
- Sohu 2017 competition. We won the third prize.☆18Jun 19, 2017Updated 8 years ago
- 本项目主要对开源的MOSS SFT数据进行整理 ,转换成mnbvc多轮对话格式。MOSS-003涵盖用性、忠实性、无害性三个层面,共353w样本,MOSS-003 包含更细粒度的有用性类别标记、更广泛的无害性数据和更长对话轮数,共630w样本,☆12Dec 3, 2023Updated 2 years ago
- Code for "[COLM'25] RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing"☆23Mar 18, 2025Updated last year
- ☆15Nov 22, 2023Updated 2 years ago
- 《大语言模型》综述全书学习笔记☆13Aug 2, 2024Updated last year
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- 根据gpt2-ml中文模型finetune自己的数据集☆44May 22, 2023Updated 2 years ago
- ☆462Jun 9, 2024Updated last year
- Defeasible Natural Language Inference☆13Dec 4, 2020Updated 5 years ago
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆318Mar 20, 2023Updated 3 years ago
- CommonsenseQA☆10Mar 28, 2020Updated 5 years ago
- 中文图书语料MD5链接☆217Jan 31, 2024Updated 2 years ago
- ☆41Apr 30, 2025Updated 10 months ago
- ☆313Apr 6, 2023Updated 2 years ago
- 人工精调的中文对话数据集和一段chatglm的微调代码☆1,194May 3, 2025Updated 10 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- 用于大模型 RLHF 进行人工数据标注排序的工具。A tool for manual response data annotation sorting in RLHF stage.☆256Aug 1, 2023Updated 2 years ago
- This repo is built for showing how to generate PPT use python☆43Aug 10, 2024Updated last year
- DSIR large-scale data selection framework for language model training☆271Apr 7, 2024Updated last year
- 更纯粹、更高压缩率的Tokenizer☆488Nov 27, 2024Updated last year
- Github repo for Peifeng's internship project☆13Nov 7, 2023Updated 2 years ago
- TXT小说阅读和朗读☆16Jan 15, 2026Updated 2 months ago
- The complete training code of the open-source high-performance Llama model, including the full process from pre-training to RLHF.☆68Mar 27, 2023Updated 3 years ago