文本去重
☆77May 23, 2024Updated 2 years ago
Alternatives and similar repositories for deduplication_mnbvc
Users that are interested in deduplication_mnbvc are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- 本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作☆71Oct 17, 2025Updated 7 months ago
- ☆43Jun 18, 2023Updated 2 years ago
- MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文 化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…☆4,211May 23, 2026Updated 3 weeks ago
- ☆185Nov 13, 2023Updated 2 years ago
- Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in datase…☆53Jul 6, 2023Updated 2 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- ☆366Jun 13, 2024Updated 2 years ago
- ☆70Apr 14, 2023Updated 3 years ago
- ☆15Sep 24, 2023Updated 2 years ago
- Bert TensorRT模型加速部署☆10Apr 1, 2022Updated 4 years ago
- CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models☆49Mar 7, 2024Updated 2 years ago
- Extract Chinese/English QA Data from WikiHow pages.☆17May 21, 2023Updated 3 years ago
- Large-scale Pre-training Corpus for Chinese 100G 中文预训练 语料☆1,012Feb 6, 2026Updated 4 months ago
- ☆29Aug 27, 2025Updated 9 months ago
- 使用qlora对中文大语言模型进行微调,包含ChatGLM、Chinese-LLaMA-Alpaca、BELLE☆88Jun 27, 2023Updated 2 years ago
- End-to-end encrypted cloud storage - Proton Drive • AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- ☆11Nov 9, 2022Updated 3 years ago
- 本项目主要对开源的MOSS SFT数据进行整理 ,转换成mnbvc多轮对话格式。MOSS-003涵盖用性、忠实性、无害性三个层面,共353w样本,MOSS-003 包含更细粒度的有用性类别标记、更广泛的无害性数据和更长对话轮数,共630w样本,☆13Dec 3, 2023Updated 2 years ago
- String deduplication package for Go☆19Jan 10, 2024Updated 2 years ago
- vLLM Router☆55Mar 11, 2024Updated 2 years ago
- 文档去重功能是为了解决搜索引擎的文档语义重复的问题,方法是多重哈希下的语义指纹算法。☆11Aug 17, 2013Updated 12 years ago
- ☆15Nov 22, 2023Updated 2 years ago
- Find duplicate text files.☆14Jan 14, 2025Updated last year
- 《大语言模型》综述全书学习笔记☆12Aug 2, 2024Updated last year
- 根据gpt2-ml中文模型finetune自己的数据集☆44May 22, 2023Updated 3 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- ☆462Jun 9, 2024Updated 2 years ago
- Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷☆6,525Updated this week
- Pile Deduplication Code☆18May 15, 2023Updated 3 years ago
- RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems☆17May 25, 2020Updated 6 years ago
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆318Mar 20, 2023Updated 3 years ago
- genES-MDA is a generic Python open-source software package to solve inverse problems via the Ensemble Smoother with Multiple Data Assimil…☆12Mar 9, 2026Updated 3 months ago
- 中文图书语料MD5链接☆217Jan 31, 2024Updated 2 years ago
- A GitHub repository associated with paper "Learn to Earn: Enabling Coordination Within a Ride-Hailing Fleet"☆10Jun 22, 2020Updated 5 years ago
- ☆310Apr 6, 2023Updated 3 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- ACL-2022 paper: Divide and Conquer: Text Semantic Matching with Disentangled Keywords and Intents.☆38Apr 22, 2022Updated 4 years ago
- Find near-duplicate documents using minhashing implemented in Go.☆16Dec 22, 2015Updated 10 years ago
- Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models☆27May 31, 2025Updated last year
- 人工精调的中文对话数据集和一段chatglm的微调代码☆1,190May 3, 2025Updated last year
- 用于大模型 RLHF 进行人工数据标注排序的工具。A tool for manual response data annotation sorting in RLHF stage.☆253Aug 1, 2023Updated 2 years ago
- Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.☆18Aug 28, 2023Updated 2 years ago
- 基于Pytorch + BERT的抽取式机器阅读理解☆21Dec 8, 2022Updated 3 years ago