本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作
☆70Oct 17, 2025Updated 5 months ago
Alternatives and similar repositories for charset_mnbvc
Users that are interested in charset_mnbvc are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- 文本去重☆78May 23, 2024Updated last year
- this repo is mnbvc text quality classification using fastText☆16Oct 2, 2023Updated 2 years ago
- MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包 括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…☆4,157Mar 22, 2026Updated 2 weeks ago
- MNBVC项目-ShareGPT语料清洗☆15Oct 4, 2023Updated 2 years ago
- Feature Decay Algorithms☆11Mar 5, 2014Updated 12 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- ☆33Feb 9, 2025Updated last year
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆69Jul 20, 2023Updated 2 years ago
- ☆19May 11, 2024Updated last year
- Implementing BERT + CRF with PyTorch for Chinese NER.☆10Mar 7, 2022Updated 4 years ago
- 使用ndk进行MD5加密☆32Jun 2, 2016Updated 9 years ago
- Transformer related optimization, including BERT, GPT☆39Feb 10, 2023Updated 3 years ago
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆318Mar 20, 2023Updated 3 years ago
- The implementation of "Shallow-to-Deep Training for Neural Machine Translation"☆10Oct 26, 2020Updated 5 years ago
- 大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning☆79Jul 25, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Python implementation of AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, w…☆49Mar 22, 2025Updated last year
- A more efficient GLM implementation!☆54Feb 18, 2023Updated 3 years ago
- 更纯粹、更高压缩率的Tokenizer☆487Nov 27, 2024Updated last year
- Codes and files for the paper Are Emergent Abilities in Large Language Models just In-Context Learning☆33Jan 9, 2025Updated last year
- A package containing utils for the PyTorch version of the Tapas algorithm.☆11Apr 29, 2021Updated 4 years ago
- Vocabulary Trimming (VT) is a model compression technique, which reduces a multilingual LM vocabulary to a target language by deleting ir…☆63Oct 25, 2024Updated last year
- realize the reinforcement learning training for gpt2 llama bloom and so on llm model☆27Sep 19, 2023Updated 2 years ago
- Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in datase…☆53Jul 6, 2023Updated 2 years ago
- ☆25Apr 25, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- (ICLR 2025) AgentRefine: Enhancing Agent Generalization through Refinement Tuning☆19Nov 22, 2025Updated 4 months ago
- A lightweight script for processing HTML page to markdown format with support for code blocks☆82Apr 14, 2024Updated last year
- 复现 Soft-Masked BERT, 论文 Spelling Error Correction with Soft-Masked BERT☆12Oct 14, 2020Updated 5 years ago
- Web archiving utility library☆11Mar 11, 2026Updated 3 weeks ago
- The code of ACL2022 paper "Conditional Bilingual Mutual Information based Adaptive Training for Neural Machine Translation"..☆14Aug 6, 2022Updated 3 years ago
- 利用Transformer模型实现的机器翻译☆12Dec 6, 2020Updated 5 years ago
- All-in-one text de-duplication☆750Mar 9, 2026Updated last month
- Making large AI models cheaper, faster and more accessible☆15Apr 20, 2023Updated 2 years ago
- A list of advice on doing research that is useful for me :)☆13Aug 17, 2019Updated 6 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Saito --> NEW REPOSITORY -->☆12Dec 31, 2025Updated 3 months ago
- Remote MCP server that gives LLMs access to run network commands☆51Apr 1, 2026Updated last week
- Data and preprocessing scripts for SemEval 2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding☆15Feb 3, 2022Updated 4 years ago
- 使用Few-Shot方法来做文本分类任务,基于THUCNews数据☆10Jun 4, 2020Updated 5 years ago
- ☆20Dec 31, 2020Updated 5 years ago
- 一个为RAG系统设计的Markdown文档工具,提供标题结构自动抽取和文档分割两大功能。完整保留文档层级结构,解决传统切分器丢失标题层级与破坏表格完整性的问题。A hierarchy-preserving Markdown document splitter for RAG…☆13Jan 2, 2025Updated last year
- Official code of paper "MaskSim: Detection of synthetic images by masked spectrum similarity analysis", CVPRW 2024.☆16Jul 16, 2025Updated 8 months ago