文本去重
☆78May 23, 2024Updated last year
Alternatives and similar repositories for deduplication_mnbvc
Users that are interested in deduplication_mnbvc are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- 本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作☆70Oct 17, 2025Updated 5 months ago
- this repo is mnbvc text quality classification using fastText☆16Oct 2, 2023Updated 2 years ago
- ☆44Jun 18, 2023Updated 2 years ago
- MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…☆4,159Apr 6, 2026Updated last week
- ☆12Apr 10, 2023Updated 3 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- ☆185Nov 13, 2023Updated 2 years ago
- ☆365Jun 13, 2024Updated last year
- ☆70Apr 14, 2023Updated 3 years ago
- ☆15Sep 24, 2023Updated 2 years ago
- Bert TensorRT模型加速部署☆10Apr 1, 2022Updated 4 years ago
- All-in-one text de-duplication☆750Mar 9, 2026Updated last month
- CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models☆49Mar 7, 2024Updated 2 years ago
- Extract Chinese/English QA Data from WikiHow pages.☆16May 21, 2023Updated 2 years ago
- ☆28Aug 27, 2025Updated 7 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- 使用qlora对中文大语言模型进行微调,包 含ChatGLM、Chinese-LLaMA-Alpaca、BELLE☆89Jun 27, 2023Updated 2 years ago
- ☆11Nov 9, 2022Updated 3 years ago
- Sohu 2017 competition. We won the third prize.☆18Jun 19, 2017Updated 8 years ago
- Firefly中文LLaMA-2大模型,支持增量预训练Baichuan2、Llama2、Llama、Falcon、Qwen、Baichuan、InternLM、Bloom等大模型☆416Oct 21, 2023Updated 2 years ago
- 本项目主要对开源的MOSS SFT数据进行整理 ,转换成mnbvc多轮对话格式。MOSS-003涵盖用性、忠实性、无害性三个层面,共353w样本,MOSS-003 包含更细粒度的有用性类别标记、更广泛的无害性数据和更长对话轮数,共630w样本,☆12Dec 3, 2023Updated 2 years ago
- vLLM Router☆55Mar 11, 2024Updated 2 years ago
- ☆15Nov 22, 2023Updated 2 years ago
- 《大语言模型》综述全书学习笔记☆12Aug 2, 2024Updated last year
- Code for "[COLM'25] RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing"☆24Mar 18, 2025Updated last year
- Serverless GPU API endpoints on Runpod - Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- ☆462Jun 9, 2024Updated last year
- Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷☆6,281Updated this week
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆317Mar 20, 2023Updated 3 years ago
- CommonsenseQA☆10Mar 28, 2020Updated 6 years ago
- 中文图书语料MD5链接☆217Jan 31, 2024Updated 2 years ago
- ☆310Apr 6, 2023Updated 3 years ago
- Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models☆26May 31, 2025Updated 10 months ago
- 人工精调的中文对话数据集和一段chatglm的微调代码☆1,193May 3, 2025Updated 11 months ago
- 用于大模型 RLHF 进行人工数据标注排序的工具。A tool for manual response data annotation sorting in RLHF stage.☆256Aug 1, 2023Updated 2 years ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- 基于Pytorch + BERT的抽取式机器阅读理解☆21Dec 8, 2022Updated 3 years ago
- DSIR large-scale data selection framework for language model training☆272Apr 7, 2024Updated 2 years ago
- This repo is built for showing how to generate PPT use python☆43Aug 10, 2024Updated last year
- TXT小说阅读和朗读☆16Jan 15, 2026Updated 3 months ago
- Github repo for Peifeng's internship project☆13Nov 7, 2023Updated 2 years ago
- The complete training code of the open-source high-performance Llama model, including the full process from pre-training to RLHF.☆68Mar 27, 2023Updated 3 years ago
- A framework for cleaning Chinese dialog data☆273May 14, 2021Updated 4 years ago