本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作
☆70Oct 17, 2025Updated 6 months ago
Alternatives and similar repositories for charset_mnbvc
Users that are interested in charset_mnbvc are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- 文本去重☆78May 23, 2024Updated last year
- this repo is mnbvc text quality classification using fastText☆16Oct 2, 2023Updated 2 years ago
- MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的 数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…☆4,173Apr 6, 2026Updated 3 weeks ago
- Feature Decay Algorithms☆11Mar 5, 2014Updated 12 years ago
- 本项目主要对开源的MOSS SFT数据进行整理 ,转换成mnbvc多轮对话格式。MOSS-003涵盖用性、忠实性、无害性三个层面,共353w样本,MOSS-003 包含更细粒度的有用性类别标记、更广泛的无害性数据和更长对话轮数,共630w样本,☆12Dec 3, 2023Updated 2 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Materials for "Prompting is not a substitute for probability measurements in large language models" (EMNLP 2023)☆24Oct 24, 2023Updated 2 years ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆69Jul 20, 2023Updated 2 years ago
- ☆19May 11, 2024Updated last year
- Implementing BERT + CRF with PyTorch for Chinese NER.☆10Mar 7, 2022Updated 4 years ago
- 使用ndk进行MD5加密☆32Jun 2, 2016Updated 9 years ago
- Transformer related optimization, including BERT, GPT☆39Feb 10, 2023Updated 3 years ago
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆318Mar 20, 2023Updated 3 years ago
- Python implementation of AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, w…☆49Mar 22, 2025Updated last year
- 大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning☆81Jul 25, 2024Updated last year
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- A more efficient GLM implementation!☆54Feb 18, 2023Updated 3 years ago
- A Simple MLLM Surpassed QwenVL-Max with OpenSource Data Only in 14B LLM.☆38Sep 9, 2024Updated last year
- 更纯粹、更高压缩率的Tokenizer☆488Nov 27, 2024Updated last year
- A package containing utils for the PyTorch version of the Tapas algorithm.☆11Apr 29, 2021Updated 5 years ago
- realize the reinforcement learning training for gpt2 llama bloom and so on llm model☆27Sep 19, 2023Updated 2 years ago
- TextPy: Collaborative Agent Workflow through Programming and Prompting☆27May 9, 2025Updated 11 months ago
- [EMNLP 2023] Official implementation of the algorithm ETSC: Exact Toeplitz-to-SSM Conversion our EMNLP 2023 paper - Accelerating Toeplitz…☆14Oct 17, 2023Updated 2 years ago
- Implementation of DTMT with incremental decoding☆13Feb 20, 2021Updated 5 years ago
- 复现 Soft-Masked BERT, 论文 Spelling Error Correction with Soft-Masked BERT☆12Oct 14, 2020Updated 5 years ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- 数据管理平台(DataMan)是完全免费且开源的,任何人都可以无限制的修改代码以及部署服务,这对于很多想要对数据管理的应用平台来说是一个很好的选择:低廉的成本换回的是高效的管理方案,同时又有健康的生态提供支持。☆13Feb 25, 2022Updated 4 years ago
- ☆13Jan 20, 2023Updated 3 years ago
- 开源AI视频剪辑工具。长视频自动拆条为爆款短片段 · 9:16/1:1/16:9多格式导出 · 本地Whisper字幕 · Rust渲染管线 · 无需上传☆53Updated this week
- Build Neo4J Knowledge Graphs from Excel files☆23Nov 18, 2024Updated last year
- Pattern of Resume.☆17Aug 6, 2017Updated 8 years ago
- [ACL2024] Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios☆71Aug 5, 2025Updated 8 months ago
- I-SHEEP: Iterative Self-enHancEmEnt Paradigm of LLMs through Self-Instruct and Self-Assessment☆17Jan 16, 2025Updated last year
- Making large AI models cheaper, faster and more accessible☆15Apr 20, 2023Updated 3 years ago
- C++ library for loading XDF files☆17Updated this week
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- 图神经网络在推荐系统的应用☆13Aug 26, 2021Updated 4 years ago
- ☆18Feb 4, 2026Updated 2 months ago
- ☆20Dec 31, 2020Updated 5 years ago
- 中华经典文献数据集☆21Jun 29, 2023Updated 2 years ago
- 一个为RAG系统设计的Markdown文档工具,提供标题结构自动抽取和文档分割两大功能。完整保留文档层级结构,解决传统切分器丢失标题层级与破坏表格完整性的问题。A hierarchy-preserving Markdown document splitter for RAG…☆13Jan 2, 2025Updated last year
- openai realtime webrtc python client☆47Dec 29, 2024Updated last year
- A lab project impletementing the Neural State Machine☆17Dec 28, 2020Updated 5 years ago