本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作
☆70Oct 17, 2025Updated 5 months ago
Alternatives and similar repositories for charset_mnbvc
Users that are interested in charset_mnbvc are comparing it to the libraries listed below
Sorting:
- 文本去重☆78May 23, 2024Updated last year
- MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…☆4,143Mar 8, 2026Updated last week
- MNBVC项目-ShareGPT语料清洗☆15Oct 4, 2023Updated 2 years ago
- Feature Decay Algorithms☆11Mar 5, 2014Updated 12 years ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆69Jul 20, 2023Updated 2 years ago
- ☆19May 11, 2024Updated last year
- Implementing BERT + CRF with PyTorch for Chinese NER.☆10Mar 7, 2022Updated 4 years ago
- Transformer related optimization, including BERT, GPT☆39Feb 10, 2023Updated 3 years ago
- 大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning☆76Jul 25, 2024Updated last year
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆319Mar 20, 2023Updated 3 years ago
- The implementation of "Shallow-to-Deep Training for Neural Machine Translation"☆10Oct 26, 2020Updated 5 years ago
- A more efficient GLM implementation!☆54Feb 18, 2023Updated 3 years ago
- Python implementation of AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, w…☆49Mar 22, 2025Updated 11 months ago
- This repo is built for showing how to generate PPT use python☆43Aug 10, 2024Updated last year
- 更纯粹、更高压缩率的Tokenizer☆488Nov 27, 2024Updated last year
- 把教育信息化体系中的Word试题,Excel试卷、知识点等数据解析成json内容。☆13Mar 3, 2020Updated 6 years ago
- realize the reinforcement learning training for gpt2 llama bloom and so on llm model☆27Sep 19, 2023Updated 2 years ago
- Vocabulary Trimming (VT) is a model compression technique, which reduces a multilingual LM vocabulary to a target language by deleting ir…☆63Oct 25, 2024Updated last year
- Implementation of DTMT with incremental decoding☆13Feb 20, 2021Updated 5 years ago
- Gradient accumulation on tf.estimator☆12Dec 15, 2020Updated 5 years ago
- (ICLR 2025) AgentRefine: Enhancing Agent Generalization through Refinement Tuning☆19Nov 22, 2025Updated 3 months ago
- INSET: Sentence Infilling with Inter-sentential Transformer☆30Nov 21, 2020Updated 5 years ago
- 图神经网络在推荐系统的应用☆13Aug 26, 2021Updated 4 years ago
- A lightweight script for processing HTML page to markdown format with support for code blocks☆82Apr 14, 2024Updated last year
- Web archiving utility library☆11Mar 11, 2026Updated last week
- 数据管理平台(DataMan)是完全免费且开源的,任何人都可以无限制的修改代码以及部署服务,这对于很多想要对数据管理的应用平台来说是一个很好的选择:低廉的成本换回的是高效的管理方案,同时又有健康的生态提供支 持。☆13Feb 25, 2022Updated 4 years ago
- Build Neo4J Knowledge Graphs from Excel files☆23Nov 18, 2024Updated last year
- I-SHEEP: Iterative Self-enHancEmEnt Paradigm of LLMs through Self-Instruct and Self-Assessment☆17Jan 16, 2025Updated last year
- All-in-one text de-duplication