本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作
☆71Oct 17, 2025Updated 7 months ago
Alternatives and similar repositories for charset_mnbvc
Users that are interested in charset_mnbvc are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- 文本去重☆77May 23, 2024Updated 2 years ago
- this repo is mnbvc text quality classification using fastText☆16Oct 2, 2023Updated 2 years ago
- MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…☆4,206May 23, 2026Updated 2 weeks ago
- Feature Decay Algorithms☆11Mar 5, 2014Updated 12 years ago
- ☆33Feb 9, 2025Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆69Jul 20, 2023Updated 2 years ago
- ☆19May 11, 2024Updated 2 years ago
- 使用ndk进行MD5加密☆32Jun 2, 2016Updated 10 years ago
- Transformer related optimization, including BERT, GPT☆39Feb 10, 2023Updated 3 years ago
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆318Mar 20, 2023Updated 3 years ago
- 大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning☆81Jul 25, 2024Updated last year
- A more efficient GLM implementation!☆54Feb 18, 2023Updated 3 years ago
- 更纯粹、更高压缩率的Tokenizer☆487Nov 27, 2024Updated last year
- 这是一个一键让小参数大模型进行角色扮演的项目,从数据构成和训练都包含在这项目中☆26Mar 31, 2024Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Codes and files for the paper Are Emergent Abilities in Large Language Models just In-Context Learning☆33Jan 9, 2025Updated last year
- Fine-tuned BERT on SQuAd 2.0 Dataset. Applied Knowledge Distillation (KD) and fine-tuned DistilBERT (student) using BERT as the teacher m…☆26Feb 13, 2021Updated 5 years ago
- Vocabulary Trimming (VT) is a model compression technique, which reduces a multilingual LM vocabulary to a target language by deleting ir…☆67Oct 25, 2024Updated last year
- realize the reinforcement learning training for gpt2 llama bloom and so on llm model☆27Sep 19, 2023Updated 2 years ago
- [EMNLP 2023] Official implementation of the algorithm ETSC: Exact Toeplitz-to-SSM Conversion our EMNLP 2023 paper - Accelerating Toeplitz…☆14Oct 17, 2023Updated 2 years ago
- (ICLR 2025) AgentRefine: Enhancing Agent Generalization through Refinement Tuning☆19Nov 22, 2025Updated 6 months ago
- A lightweight script for processing HTML page to markdown format with support for code blocks☆82Apr 14, 2024Updated 2 years ago
- INSET: Sentence Infilling with Inter-sentential Transformer☆30Nov 21, 2020Updated 5 years ago
- 复现 Soft-Masked BERT, 论文 Spelling Error Correction with Soft-Masked BERT☆12Oct 14, 2020Updated 5 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Web archiving utility library☆11May 5, 2026Updated last month
- generate video with voice narration from ppt/pdf Slides☆11Sep 4, 2023Updated 2 years ago
- 数据管理平台(DataMan)是完全免费且开源的,任何人都可以无限制的修改代码以及部署服务,这对于很多想要对数据管理的应用平台来说是一个很好的选择:低 廉的成本换回的是高效的管理方案,同时又有健康的生态提供支持。☆13Feb 25, 2022Updated 4 years ago
- ☆13Jan 20, 2023Updated 3 years ago
- Data and codes for BioBERT-MRC☆11Oct 5, 2021Updated 4 years ago
- The code of ACL2022 paper "Conditional Bilingual Mutual Information based Adaptive Training for Neural Machine Translation"..☆14Aug 6, 2022Updated 3 years ago
- [ACL2024] Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios☆71Aug 5, 2025Updated 10 months ago
- I-SHEEP: Iterative Self-enHancEmEnt Paradigm of LLMs through Self-Instruct and Self-Assessment☆17Jan 16, 2025Updated last year
- Porter is a data cleaning tool designed to assist with full data extraction from MySQL, MongoDB, and text files (CSV/TSV/JSON) and push t…☆16Sep 16, 2024Updated last year
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- 主要是用python进行生存分析的步骤,包括生存分析(逐步和单因素),KM曲线、决策曲线,ROC曲线,训练测试样本分布比较☆11Dec 21, 2020Updated 5 years ago
- Making large AI models cheaper, faster and more accessible☆15Apr 20, 2023Updated 3 years ago
- A list of advice on doing research that is useful for me :)☆13Aug 17, 2019Updated 6 years ago
- Saito --> NEW REPOSITORY -->☆12Dec 31, 2025Updated 5 months ago
- 毕业设计:基于AI+GraphCast的智慧城市与气象多元融合云应用平台☆17May 5, 2024Updated 2 years ago
- Data and preprocessing scripts for SemEval 2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding☆16Feb 3, 2022Updated 4 years ago
- A huge dataset for Document Visual Question Answering☆22Jul 29, 2024Updated last year