文本去重
☆78May 23, 2024Updated 2 years ago
Alternatives and similar repositories for deduplication_mnbvc
Users that are interested in deduplication_mnbvc are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- 本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作☆71Oct 17, 2025Updated 7 months ago
- this repo is mnbvc text quality classification using fastText☆16Oct 2, 2023Updated 2 years ago
- ☆44Jun 18, 2023Updated 2 years ago
- MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…☆4,199May 5, 2026Updated 3 weeks ago
- ☆185Nov 13, 2023Updated 2 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in datase…☆53Jul 6, 2023Updated 2 years ago
- ☆364Jun 13, 2024Updated last year
- JAVA API client for AUTOMATIC1111/stable-diffusion-webui☆18Nov 16, 2023Updated 2 years ago
- ☆70Apr 14, 2023Updated 3 years ago
- Bert TensorRT模型加速部署☆10Apr 1, 2022Updated 4 years ago
- All-in-one text de-duplication☆759Mar 9, 2026Updated 2 months ago
- Extract Chinese/English QA Data from WikiHow pages.☆16May 21, 2023Updated 3 years ago
- Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料☆1,007Feb 6, 2026Updated 3 months ago
- 一个简单的 stable-diffusion-webui api 调用实现☆24Apr 18, 2023Updated 3 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- 复现论文《Distilling Task-Specific Knowledge from BERT into Simple Neural Networks》☆16Jun 13, 2021Updated 4 years ago
- ☆28Aug 27, 2025Updated 8 months ago
- 使用qlora对中文大语言模型进行微调,包含ChatGLM、Chinese-LLaMA-Alpaca、BELLE☆88Jun 27, 2023Updated 2 years ago
- ☆11Nov 9, 2022Updated 3 years ago
- Sohu 2017 competition. We won the third prize.☆18Jun 19, 2017Updated 8 years ago
- Firefly中文LLaMA-2大模型,支持增量预训练Baichuan2、Llama2、Llama、Falcon、Qwen、Baichuan、InternLM、Bloom等大模型☆414Oct 21, 2023Updated 2 years ago
- 本项目主要对开源的MOSS SFT数据进行整理 ,转换成mnbvc多轮对话格式。MOSS-003涵盖用性、忠实性、无害性三个层面,共353w样本,MOSS-003 包含更细粒度的有用性类别标记、更广泛的无害性数据和更长对话轮数,共630w样本,☆13Dec 3, 2023Updated 2 years ago
- vLLM Router☆55Mar 11, 2024Updated 2 years ago
- ☆15Nov 22, 2023Updated 2 years ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- 根据gpt2-ml中文模型finetune自己的数据集☆44May 22, 2023Updated 3 years ago
- Code for "[COLM'25] RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing"☆24Mar 18, 2025Updated last year
- ☆462Jun 9, 2024Updated last year
- Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷☆6,427Updated this week
- Pile Deduplication Code☆18May 15, 2023Updated 3 years ago
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆318Mar 20, 2023Updated 3 years ago
- CommonsenseQA☆10Mar 28, 2020Updated 6 years ago
- 推荐系统入门教程,包含基础知识和相应的运行实例☆11Jan 9, 2024Updated 2 years ago
- genES-MDA is a generic Python open-source software package to solve inverse problems via the Ensemble Smoother with Multiple Data Assimil…☆12Mar 9, 2026Updated 2 months ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- 中文图书语料MD5链接☆217Jan 31, 2024Updated 2 years ago
- ☆41Apr 30, 2025Updated last year
- A GitHub repository associated with paper "Learn to Earn: Enabling Coordination Within a Ride-Hailing Fleet"☆10Jun 22, 2020Updated 5 years ago
- ☆310Apr 6, 2023Updated 3 years ago
- ACL-2022 paper: Divide and Conquer: Text Semantic Matching with Disentangled Keywords and Intents.☆38Apr 22, 2022Updated 4 years ago
- 人工精调的中文对话数据集和一段chatglm的微调代码☆1,191May 3, 2025Updated last year
- 用于大模型 RLHF 进行人工数据标注排序的工具。A tool for manual response data annotation sorting in RLHF stage.☆255Aug 1, 2023Updated 2 years ago