文本去重
☆78May 23, 2024Updated last year
Alternatives and similar repositories for deduplication_mnbvc
Users that are interested in deduplication_mnbvc are comparing it to the libraries listed below
Sorting:
- 本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作☆70Oct 17, 2025Updated 4 months ago
- ☆185Nov 13, 2023Updated 2 years ago
- Bert TensorRT模型加速部署☆10Apr 1, 2022Updated 3 years ago
- ☆11Nov 9, 2022Updated 3 years ago
- Extract Chinese/English QA Data from WikiHow pages.☆16May 21, 2023Updated 2 years ago
- ☆363Jun 13, 2024Updated last year
- ☆15Sep 24, 2023Updated 2 years ago
- 复现论文《Distilling Task-Specific Knowledge from BERT into Simple Neural Networks》☆16Jun 13, 2021Updated 4 years ago
- ☆70Apr 14, 2023Updated 2 years ago
- CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models☆48Mar 7, 2024Updated 2 years ago
- ☆27Aug 27, 2025Updated 6 months ago
- Code for "[COLM'25] RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing"☆23Mar 18, 2025Updated 11 months ago
- 介绍docker、docker compose的使用。☆21Sep 4, 2024Updated last year
- Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料☆997Feb 6, 2026Updated last month
- vLLM Router☆55Mar 11, 2024Updated last year
- 使用qlora对中文大语言模型进行微调,包含ChatGLM、Chinese-LLaMA-Alpaca、BELLE☆89Jun 27, 2023Updated 2 years ago
- All-in-one text de-duplication☆744Feb 24, 2026Updated last week
- Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷☆6,001Updated this week
- Firefly中文LLaMA-2大模型,支持增量 预训练Baichuan2、Llama2、Llama、Falcon、Qwen、Baichuan、InternLM、Bloom等大模型☆416Oct 21, 2023Updated 2 years ago
- 用于微调LLM的中文指令数据集☆29Apr 12, 2023Updated 2 years ago
- ☆460Jun 9, 2024Updated last year
- Chinese safety prompts for evaluating and improving the safety of LLMs. 中文安全prompts,用于评估和提升大模型的 安全性。☆1,132Feb 27, 2024Updated 2 years ago
- A dataset template for guiding chat-models to self-cognition, including information about the model’s identity, capabilities, usage, limi…☆29Sep 4, 2023Updated 2 years ago
- [NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other mo…☆416Jun 25, 2025Updated 8 months ago
- The complete training code of the open-source high-performance Llama model, including the full process from pre-training to RLHF.☆67Mar 27, 2023Updated 2 years ago
- ☆36Sep 6, 2024Updated last year
- 北语 246 实验室新生简明指南☆10May 30, 2022Updated 3 years ago
- ☆22Feb 11, 2026Updated 3 weeks ago
- We propose the Text-to-CQL task and provide the dataset.☆35Jun 26, 2023Updated 2 years ago
- 马克思主义哲学:从《黑格尔法哲学批判》到《资本论》☆16Jul 13, 2024Updated last year
- ☆313Apr 6, 2023Updated 2 years ago
- [COLING 2024] CMNEE: A Large-Scale Document-Level Event Extraction Dataset based on Open-Source Chinese Military News☆44Jan 26, 2026Updated last month
- Source code for "A Two-Stream AMR-enhanced Model for Document-level Event Argument Extraction" @ NAACL 2022☆37May 7, 2022Updated 3 years ago
- ccks2021事件抽取比赛☆30Jul 21, 2021Updated 4 years ago
- ☆235May 10, 2024Updated last year
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆86Dec 6, 2023Updated 2 years ago
- ⭐️ NLP Algorithms with transformers lib. Supporting Text-Classification, Text-Generation, Information-Extraction, Text-Matching, RLHF, SF…☆2,409Sep 29, 2023Updated 2 years ago
- ☆148Apr 16, 2024Updated last year
- Chinese-LLaMA 1&2、Chinese-Falcon 基础模型;ChatFlow中文对话模型;中文OpenLLaMA模型;NLP预训练/指令微调数据集☆3,055Apr 14, 2024Updated last year